Day 18

SmolVLA fine-tuning on LIBERO-Spatial

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

  • SmolVLA — Hugging Face's compact (2.4B parameter) Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions., released 2025. Designed for Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. on consumer GPUs.
  • Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. (Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions.) — Model that takes images + text instructions, outputs Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions. RT-2 (2023) was the first major one; π0, GR00T, SmolVLA are descendants.
  • LIBERO — A Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. of 100+ short Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks with language instructions. LIBERO-Spatial: 10 spatial-relation tasks ("put the bowl on the right of the plate").
  • LoRA (Low-Rank Adaptation) — Fine-tune by adding small trainable matrices to attention layers, freezing the base. Reduces fine-tune memory by ~5×.
  • PaliGemma backbone — Google's Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. (3B params, 224×224 images). SmolVLA uses it.
  • Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. expert — Lightweight MLP head that converts hidden states into Movement, Mechanics & Robot BodyJointA movable connection between robot parts. targets.
  • Pretrained Core ConceptsPolicyThe rule or model that maps observations or states to actions. — A model trained on a large mixture of data; we fine-tune it for our specific tasks rather than Robot LearningTrainingThe process of fitting a model using data or experience. from scratch.

Real-world analogy

A Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. is a chef who's read every cookbook (vision-language Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task.), is now learning your specific kitchen (LIBERO tasks). You don't reteach them how to chop onions; you just demonstrate "how I want it done in this kitchen". LoRA is teaching them via post-it notes (small trainable layers) instead of rewriting the cookbook.

Hour 1 — Reading

Hour 2 — Setup + verify pretrained inference

cd ~/robo47-il
source .venv/bin/activate

# Download SmolVLA pretrained
python -c "
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained('lerobot/smolvla_base')
print(f'Loaded SmolVLA, {sum(p.numel() for p in policy.parameters())/1e6:.1f}M params')
"

Expected: Loaded SmolVLA, 2401.3M params (or similar; ~2.4B).

LAB

Hour 3 — Lab: LoRA-fine-tune SmolVLA on LIBERO-Spatial (90 min)

What you're building. Fine-tune SmolVLA on LIBERO-Spatial via LoRA. Evaluate Modern Robot LearningZero-shotDoing a new task without task-specific training. (no Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.) first, then after Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.. Quantify the lift.

What success looks like at the end. You have: 1. Modern Robot LearningZero-shotDoing a new task without task-specific training. SmolVLA Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. on LIBERO-Spatial: 0.30–0.45 (the pretrained Evaluation & ResearchBaselineA reference method used for comparison.). 2. Fine-tuned SmolVLA Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly.: 0.70–0.85. 3. LoRA Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. peaks at ~25 GB of GPU memory (vs ~50 GB for full fine-tune).

Step 1 — Zero-shot evaluation (15 min)

lerobot-eval \
  --policy.path=lerobot/smolvla_base \
  --env.type=libero --env.task_suite=libero_spatial \
  --eval.n_episodes=50 \
  --output_dir=runs/smolvla_zeroshot \
  --seed=1

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.