Course navigation
Week 3: Imitation LearningDay 21
Week 3 capstone-track reflection + fresh-clone
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (5 min)
No new terms. Today is reflection + Evaluation & ResearchReproducibilityWhether others can reliably get the same result..
Hour 1 — Capstone Track D pre-design (40 min)
Track D of the Week 7 capstone is "dexterous bimanual ACT/π0.5". Even if you don't choose this track, treat today as designing the experiment you'd run if you did. Write docs/day21_track_d_design.md:
# Track D: bimanual dexterous policy
## Hypothesis
"Fine-tuning π0.5 on 50 episodes of bimanual cup-to-saucer placement (collected via teleop) will achieve ≥0.7 success rate vs 0.0 for zero-shot π0.5."
## Variables
- IV: fine-tuning data (none / 50 episodes)
- DV: success rate over 20 eval episodes
- Controls: same eval poses, same camera intrinsics, seed=1
## Experiments
1. Zero-shot π0.5 eval (baseline)
2. ACT trained from scratch on 50 episodes (lower bound ceiling)
3. SmolVLA LoRA on 50 episodes (mid-cost)
4. π0.5 LoRA on 50 episodes (target)
## Compute budget
- 1× H100 for ~6 hours total (1.5 h per fine-tune × 3, plus evals).
## Risks / kill criteria
- If π0.5 zero-shot is already > 0.5, hypothesis falsified — pivot to harder task.
- If ACT-from-scratch matches π0.5 LoRA, the VLA isn't adding value — investigate why.Hour 2 — Fresh-clone test (40 min)
cd /tmp
rm -rf w3-test
git clone <your-w3-imitation-url> w3-test
cd w3-test
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install -r requirements.txt
# Re-run BC eval on saved checkpoint (cheap)
lerobot-eval \
--policy.path=$HF_CACHE/bc_aloha_seed1/pretrained_model \
--env.type=aloha --env.task=AlohaInsertion-v0 \
--eval.n_episodes=10 --seed=1 \
--output_dir=fresh_eval/
grep success_rate fresh_eval/metadata.jsonExpected: Same number as your original eval (within ±0.05 due to env randomness).
LAB
Hour 3 — Week 3 retro (45 min)
RETRO_w3.md:
# Week 3 retro
## Numbers
| Method | Task | Success rate (mean ± std, n seeds) |
|---|---|---|
| BC | ALOHA insertion | 0.17 ± 0.08 (3) |
| ACT | ALOHA insertion | 0.78 ± 0.12 (3) |
| ACT | PushT | 0.65 ± 0.07 (1) |
| DP | PushT | 0.92 ± 0.04 (1) |
| VQ-BeT | PushT | 0.85 (1) |
| SmolVLA zs | LIBERO-Spatial | 0.36 (1) |
| SmolVLA LoRA | LIBERO-Spatial | 0.79 (1) |
| OFT zs | LIBERO 10ep | 0.30 (1) |
| OFT LoRA | LIBERO 10ep | 0.70 (1) |
## Reproducibility
- BC eval reproduced exactly within ±0.03 of original.
## What I learned
1. ACT > BC by 4× — action chunking matters a lot, even before transformer.
2. DP > ACT only on multimodal tasks. ACT is faster + simpler.
3. VQ-BeT is competitive with DP at 5× the inference speed (discrete is a feature, not a limitation).
4. LoRA on SmolVLA: 1.3% trainable params, 2.2× lift in success rate.
5. OpenVLA-OFT zeros faster but starts at similar zero-shot perf to SmolVLA — the OFT win is inference latency, not accuracy.
6. Zero-shot VLA on LIBERO-Spatial is ~0.3, not 0.0. The pretraining is doing real work.
7. Fine-tuning works dramatically better than I'd assumed for 10 episodes; it's almost too easy.
## What still confuses me
- Why does ACT's CVAE z=0 inference work so well? The KL divergence regularization should make the prior matter, but z=0 is treated as deterministic.
- DP's score-matching loss is unstable with `bf16` early in training. Why doesn't this hurt at convergence?Step 4 — Log + commit (5 min)
cd ~/robo47/w3-imitation
git add docs/ RETRO_w3.md
git commit -m "Day 21: Week 3 retro + Track D design + fresh-clone reproducibility"
git pushFull source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.
Completion controls unlock when this day graduates from placeholder to full lab.