Week 5 retro + capstone Track A design

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Hour 1 — Capstone Track A pre-design (40 min)

docs/day35_track_a_design.md:

# Track A: VLA fine-tune for a custom task

## Hypothesis
"Fine-tuning π0.7 on 30 episodes of *cleaning a cluttered desk* will achieve
≥ 0.6 success rate, vs ≤ 0.2 for π0.7 zero-shot."

## Data collection
- Use LeRobot SO-101 arm or sim teleop.
- 30 episodes, ~30 s each.
- Cluttered desk: pens, papers, mugs, books. Goal: organize into 3 piles.

## Variables
- IV: fine-tune (none / LoRA / full)
- DV: success rate over 20 eval episodes (different scenes)

## Experiments
1. π0.7 zero-shot (baseline)
2. π0.7 + LoRA r32 on 30 episodes
3. (stretch) π0.7 + LoRA r128

## Compute
- 1× H100, ~3 hours per fine-tune.

## Risk
- 30 episodes might be insufficient. Kill criterion: if LoRA r32 < 0.3, try r128 or collect more data.

Hour 2 — Fresh-clone test (45 min)

For Week 5 — the bulk is downloads of HuggingFace checkpoints; is in checking that LoRA fine-tunes converge to similar numbers from a fresh clone.

LAB

Hour 3 — Week 5 retro + commit

RETRO_w5.md:

# Week 5 retro

## What I learned
1. RT-1 → π0.7 lineage: vision encoder upgrades (CNN → SigLIP → DINOv3)
   and action repr changes (discrete → continuous + flow matching) are
   the two main story arcs.
2. LoRA on a 3B-param VLA fits comfortably on 1× H100 (24 GB peak).
3. Zero-shot success rates are surprisingly consistent across VLAs (0.30–0.45).
   The pretraining mix is doing real work; the architecture is secondary.
4. After 5k LoRA steps, all VLAs converge to ~0.7-0.8 on LIBERO-Spatial.
   The architecture matters less than I thought after fine-tuning.
5. Closed-source models (Gemini Robotics, Helix) are accessed via API or
   not at all. Pragmatic: have one open-weights workhorse (π0).
6. Real-time deployment: CogACT > π0 > others on latency; whole-upper-body
   needs Helix-style System 1/System 2 split.

## What still confuses me
- Why is zero-shot performance so consistent across architectures? Suggests
  the bottleneck is data, not modeling.
- Flow matching vs diffusion: I see both in production. When does each win?
- How do these scale to 100B params? RT-2 was 55B; nothing released since
  has matched. Is that a data ceiling or compute ceiling?

Deliverable checklist

docs/day35_track_a_design.md complete
RETRO_w5.md written
Fresh-clone test verifies HuggingFace checkpoint downloads work
All Week 5 commits pushed

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.