ACT (Action Chunking Transformer) on ALOHA insertion

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (10 min)

ACT ( Transformer) — Stanford 2023. CVAE-based transformer that predicts the next K actions per call, sliding-window deployed.
CVAE — Conditional Variational AutoEncoder. Trains an encoder to map (obs, _chunk) → latent z, decoder to map (obs, z) → _chunk. At , sample z = 0 (mean) for determinism.
K — Hyperparameter, typically 100 (one full !). Trade-off: long chunks = smoother but slower to react.
Temporal ensembling — Average overlapping chunks. Prediction at time t blends chunks centered at t, t-1, t-2, ... weighted exponentially.
Backbone — Vision encoder. ACT uses ResNet-18 by default.
51M parameters — ACT's typical size. Fits on a 24GB GPU for .

Real-world analogy

ACT is "watch the master, then plan ahead — predict the whole next move (e.g. the entire reach-grasp-insert sequence) in one go, instead of just the next 1/30th of a second."

Hour 1 — Reading

ACT paper, full sections 1–3 (~30 min): https://tonyzhaozh.github.io/aloha/
ALOHA paper revisited (already partly read): https://arxiv.org/abs/2304.13705
LeRobot ACT tutorial blog post (~15 min): https://huggingface.co/blog/lerobot-act

Hour 2 — Read the LeRobot ACT implementation

Open ~/robo47-il/.venv/lib/python3.12/site-packages/lerobot/policies/act/modeling_act.py. Read for ~30 min. Find:
The CVAE encoder and decoder transformers.
Where chunk_size=100 is consumed.
The temporal ensembling logic in select_action.

LAB

Hour 3 — Lab: train ACT, eval, beat the BC baseline (90 min)

What you're building. Train ACT on the same ALOHA for 20k steps (≈45 min on 1× H100). Compare directly against Day 15's .

What success looks like at the end. You have: 1. ACT checkpoint at runs/act_aloha/checkpoints/last/pretrained_model/. 2. Eval ≈ 0.70–0.95 (vs 's 0.05–0.30) — the >0.5 win condition. 3. Eval video showing smooth, deliberate insertions. 4. Side-by-side comparison plot figures/day16_act_vs_bc.png.

Step 1 — Train ACT (45 min)

cd ~/robo47-il
source .venv/bin/activate

lerobot-train \
  --policy.type=act \
  --dataset.repo_id=lerobot/aloha_sim_insertion_human \
  --env.type=aloha \
  --env.task=AlohaInsertion-v0 \
  --batch_size=8 \
  --steps=20000 \
  --eval_freq=5000 \
  --save_freq=5000 \
  --output_dir=runs/act_aloha \
  --wandb.enable=true \
  --wandb.project=robo47 \
  --seed=1

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.

Papers you will re-read after this

ACT — action chunking transformer