Course navigation
Week 5: VLA ArchitecturesDay 35
Week 5 retro + capstone Track A design
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Hour 1 — Capstone Track A pre-design (40 min)
docs/day35_track_a_design.md:
# Track A: VLA fine-tune for a custom task
## Hypothesis
"Fine-tuning π0.7 on 30 episodes of *cleaning a cluttered desk* will achieve
≥ 0.6 success rate, vs ≤ 0.2 for π0.7 zero-shot."
## Data collection
- Use LeRobot SO-101 arm or sim teleop.
- 30 episodes, ~30 s each.
- Cluttered desk: pens, papers, mugs, books. Goal: organize into 3 piles.
## Variables
- IV: fine-tune (none / LoRA / full)
- DV: success rate over 20 eval episodes (different scenes)
## Experiments
1. π0.7 zero-shot (baseline)
2. π0.7 + LoRA r32 on 30 episodes
3. (stretch) π0.7 + LoRA r128
## Compute
- 1× H100, ~3 hours per fine-tune.
## Risk
- 30 episodes might be insufficient. Kill criterion: if LoRA r32 < 0.3, try r128 or collect more data.Hour 2 — Fresh-clone test (45 min)
For Week 5 — the bulk is downloads of HuggingFace checkpoints; Evaluation & ResearchReproducibilityWhether others can reliably get the same result. is in checking that LoRA fine-tunes converge to similar numbers from a fresh clone.
LAB
Hour 3 — Week 5 retro + commit
RETRO_w5.md:
# Week 5 retro
## What I learned
1. RT-1 → π0.7 lineage: vision encoder upgrades (CNN → SigLIP → DINOv3)
and action repr changes (discrete → continuous + flow matching) are
the two main story arcs.
2. LoRA on a 3B-param VLA fits comfortably on 1× H100 (24 GB peak).
3. Zero-shot success rates are surprisingly consistent across VLAs (0.30–0.45).
The pretraining mix is doing real work; the architecture is secondary.
4. After 5k LoRA steps, all VLAs converge to ~0.7-0.8 on LIBERO-Spatial.
The architecture matters less than I thought after fine-tuning.
5. Closed-source models (Gemini Robotics, Helix) are accessed via API or
not at all. Pragmatic: have one open-weights workhorse (π0).
6. Real-time deployment: CogACT > π0 > others on latency; whole-upper-body
needs Helix-style System 1/System 2 split.
## What still confuses me
- Why is zero-shot performance so consistent across architectures? Suggests
the bottleneck is data, not modeling.
- Flow matching vs diffusion: I see both in production. When does each win?
- How do these scale to 100B params? RT-2 was 55B; nothing released since
has matched. Is that a data ceiling or compute ceiling?Deliverable checklist
Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.
Completion controls unlock when this day graduates from placeholder to full lab.