RT-1, RT-2 history + survey

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (10 min)

RT-1 (Robotic Transformer 1) — Google 2022. First serious general-purpose transformer. 35M params, EfficientNet vision encoder, FiLM conditioning, discrete tokens.
RT-2 — Google 2023. Co-fine-tunes a (PaLI-X) on data. Web-scale knowledge in a .
PaLM-E — Google 2023. embodied LM, predecessor to RT-2.
OpenX-Embodiment — 2023 collaboration: 1M+ trajectories from 70+ academic labs, mixed embodiments. Standard .
— Specific type (Franka Panda, Stretch, Jaco, etc.). Cross-embodiment policies generalize across.
tokens — Discrete representation of actions (one token per dimension, 256-vocab). RT-1 originated; some VLAs still use, others moved to continuous.
() — LLM with image input. SigLIP, PaLI-X, PaLM-E. The "frozen brain" of a .

Real-world analogy

RT-1 was "let's see if a transformer can learn ." RT-2 was "let's see if a transformer that already knows the world can learn faster, with semantic ." Every since is a refinement of the RT-2 idea.

Hour 1 — RT-1 + RT-2 papers

RT-1 paper, sections 1–4 (~25 min): https://robotics-transformer1.github.io/
RT-2 paper, sections 1–4 (~30 min): https://robotics-transformer2.github.io/

Hour 2 — OpenX + survey

OpenX-Embodiment paper, abstract + Sec 3 (~25 min): https://robotics-transformer-x.github.io/
Survey (2024) — abstract + categorization figure (~20 min): https://arxiv.org/abs/2405.14093 or https://arxiv.org/abs/2412.14058

LAB

Hour 3 — Lab: lineage table + RT-2 inference (75 min)

What you're building. A reference document the family tree with architectural and parameter details. Plus: run on a small RT-2 variant via HuggingFace.

Step 1 — Build the lineage table (45 min)

Create docs/day29_vla_lineage.md:

# VLA family tree (April 2026)

| Model | Year | Org | Params | Backbone | Vision | Action repr | Notes |
|---|---|---|---|---|---|---|---|
| RT-1 | 2022 | Google | 35M | EfficientNet | + FiLM | Discrete tokens | First serious robot transformer |
| RT-2 | 2023 | Google | 55B (PaLI-X-55B) | PaLI-X | SigLIP | Discrete tokens | Web-scale knowledge |
| RT-2-X | 2023 | OpenX | 55B | PaLI-X | SigLIP | Discrete tokens | Cross-embodiment via OpenX-Embodiment |
| OpenVLA | 2024 | Stanford/TRI | 7B | Llama-2-7B | DINOv2 + SigLIP | Discrete tokens | Open-weights, $20k to train |
| OpenVLA-OFT | 2025 | Stanford/TRI | 7B | Llama-2-7B | DINOv2 + SigLIP | L1 regression head | 3x faster inference |
| π0 | 2024 | Physical Intelligence | 3.3B | PaliGemma-3B | SigLIP | Continuous (flow matching) | First production VLA |
| π0.5 | 2025 | Physical Intelligence | 3.3B | PaliGemma | SigLIP | Continuous + chunks | Hierarchical (low+high level) |
| π0.6 | 2025 | Physical Intelligence | 4B | PaliGemma | DINOv3 | Continuous (flow matching) | Better vision encoder |
| π0.7 | 2026 | Physical Intelligence | 4B | PaliGemma | DINOv3 | Continuous + JEPA aux loss | Latest, generalist |
| GR00T N1 | 2024 | NVIDIA | 2B | Eagle-2 | SigLIP | Diffusion head | Humanoid-focused |
| GR00T N1.5 | 2025 | NVIDIA | 2B | Eagle-2 + skill tokens | SigLIP | Diffusion head | Skill abstraction |
| GR00T N1.6 | 2026 | NVIDIA | 3B | Eagle-2 | DINOv3 | Diffusion head | World-model-aware |
| Helix | 2025 | Figure | ~2B | Custom | Custom | Continuous + RNN | Whole-upper-body, real-time |
| Gemini Robotics | 2025 | Google DeepMind | "Gemini-2.0-class" | Gemini-2.0 | Native | Continuous chunks | Gemini Robotics-ER variant for ER |
| Gemini Robotics-ER 1.6 | 2026 | Google DeepMind | "Gemini-2.5-class" | Gemini-2.5 | Native | Continuous chunks | Apr 14, 2026 release |
| RDT-1B | 2024 | THU/SAIL | 1B | DiT (diffusion) | SigLIP | Diffusion in joint space | Bimanual specialist |
| CogACT | 2024 | THU | 0.6B | Cog-style transformer | SigLIP | Continuous chunks | Lightweight |
| SmolVLA | 2025 | HF | 2.4B | PaliGemma | SigLIP | L1 head | Consumer-GPU fine-tune |

## Architectural trends 2024 → 2026

1. **Vision encoder upgrade**: SigLIP → DINOv3.
2. **Action repr**: discrete tokens → continuous + flow/diffusion.
3. **Action chunking**: borrowed from ACT, now universal.
4. **Auxiliary losses**: pure imitation → +JEPA (π0.7) +world-model (GR00T N1.6).
5. **Real-time**: 3 Hz (RT-2) → 50–200 Hz (Helix).

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.

Papers you will re-read after this

Open X-Embodiment / RT-X RT-2 — web knowledge transfers to control