Course navigation
Week 3: Imitation LearningDay 16
ACT (Action Chunking Transformer) on ALOHA insertion
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (10 min)
- ACT (Modern Robot LearningAction chunkingPredicting several future actions at once instead of one action at a time. Transformer) — Stanford 2023. CVAE-based transformer that predicts the next K actions per call, sliding-window deployed.
- CVAE — Conditional Variational AutoEncoder. Trains an encoder to map (obs, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system._chunk) → latent z, decoder to map (obs, z) → Core ConceptsActionA command the robot sends to its motors, controller, or low-level system._chunk. At Robot LearningInferenceUsing a trained model to make predictions or choose actions., sample z = 0 (mean) for determinism.
- Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. Modern Robot LearningChunk sizeHow many future actions are predicted together in one chunk. K — Hyperparameter, typically 100 (one full Manipulation & TasksInsertionPlacing one object into another, like plugging in a connector.!). Trade-off: long chunks = smoother but slower to react.
- Temporal ensembling — Average overlapping chunks. Prediction at time t blends chunks centered at t, t-1, t-2, ... weighted exponentially.
- Backbone — Vision encoder. ACT uses ResNet-18 by default.
- 51M parameters — ACT's typical size. Fits on a 24GB GPU for Robot LearningTrainingThe process of fitting a model using data or experience..
Real-world analogy
ACT is "watch the master, then plan ahead — predict the whole next move (e.g. the entire reach-grasp-insert sequence) in one go, instead of just the next 1/30th of a second."
Hour 1 — Reading
- ACT paper, full sections 1–3 (~30 min): https://tonyzhaozh.github.io/aloha/
- ALOHA paper revisited (already partly read): https://arxiv.org/abs/2304.13705
- LeRobot ACT tutorial blog post (~15 min): https://huggingface.co/blog/lerobot-act
Hour 2 — Read the LeRobot ACT implementation
- Open
~/robo47-il/.venv/lib/python3.12/site-packages/lerobot/policies/act/modeling_act.py. Read for ~30 min. Find: - The CVAE encoder and decoder transformers.
- Where
chunk_size=100is consumed. - The temporal ensembling logic in
select_action.
LAB
Hour 3 — Lab: train ACT, eval, beat the BC baseline (90 min)
What you're building. Train ACT on the same ALOHA Manipulation & TasksInsertionPlacing one object into another, like plugging in a connector. Robot LearningDatasetA collection of training or evaluation data. for 20k steps (≈45 min on 1× H100). Compare directly against Day 15's Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. Evaluation & ResearchBaselineA reference method used for comparison..
What success looks like at the end. You have:
1. ACT checkpoint at runs/act_aloha/checkpoints/last/pretrained_model/.
2. Eval Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. ≈ 0.70–0.95 (vs Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions.'s 0.05–0.30) — the >0.5 win condition.
3. Eval video showing smooth, deliberate insertions.
4. Side-by-side comparison plot figures/day16_act_vs_bc.png.
Step 1 — Train ACT (45 min)
cd ~/robo47-il
source .venv/bin/activate
lerobot-train \
--policy.type=act \
--dataset.repo_id=lerobot/aloha_sim_insertion_human \
--env.type=aloha \
--env.task=AlohaInsertion-v0 \
--batch_size=8 \
--steps=20000 \
--eval_freq=5000 \
--save_freq=5000 \
--output_dir=runs/act_aloha \
--wandb.enable=true \
--wandb.project=robo47 \
--seed=1Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.