Course navigation
Week 5: VLA ArchitecturesDay 29
RT-1, RT-2 history + survey
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (10 min)
- RT-1 (Robotic Transformer 1) — Google 2022. First serious general-purpose Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. transformer. 35M params, EfficientNet vision encoder, FiLM conditioning, discrete Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. tokens.
- RT-2 — Google 2023. Co-fine-tunes a Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. (PaLI-X) on Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data. Web-scale knowledge in a Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. Core ConceptsPolicyThe rule or model that maps observations or states to actions..
- PaLM-E — Google 2023. Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. embodied LM, predecessor to RT-2.
- OpenX-Embodiment — 2023 collaboration: 1M+ Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories from 70+ academic labs, mixed embodiments. Standard Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. Robot LearningDatasetA collection of training or evaluation data..
- Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits. — Specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. type (Franka Panda, Stretch, Jaco, etc.). Cross-embodiment policies generalize across.
- Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. tokens — Discrete representation of actions (one token per dimension, 256-vocab). RT-1 originated; some VLAs still use, others moved to continuous.
- Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. (Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text.) — LLM with image input. SigLIP, PaLI-X, PaLM-E. The "frozen brain" of a Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions..
Real-world analogy
RT-1 was "let's see if a transformer can learn Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.." RT-2 was "let's see if a transformer that already knows the world can learn Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. faster, with semantic Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.." Every Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. since is a refinement of the RT-2 idea.
Hour 1 — RT-1 + RT-2 papers
- RT-1 paper, sections 1–4 (~25 min): https://robotics-transformer1.github.io/
- RT-2 paper, sections 1–4 (~30 min): https://robotics-transformer2.github.io/
Hour 2 — OpenX + survey
- OpenX-Embodiment paper, abstract + Sec 3 (~25 min): https://robotics-transformer-x.github.io/
- Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. Survey (2024) — abstract + categorization figure (~20 min): https://arxiv.org/abs/2405.14093 or https://arxiv.org/abs/2412.14058
LAB
Hour 3 — Lab: lineage table + RT-2 inference (75 min)
What you're building. A reference document Navigation & LocomotionMappingBuilding a representation of the environment. the Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. family tree with architectural and parameter details. Plus: run Robot LearningInferenceUsing a trained model to make predictions or choose actions. on a small RT-2 variant via HuggingFace.
Step 1 — Build the lineage table (45 min)
Create docs/day29_vla_lineage.md:
# VLA family tree (April 2026)
| Model | Year | Org | Params | Backbone | Vision | Action repr | Notes |
|---|---|---|---|---|---|---|---|
| RT-1 | 2022 | Google | 35M | EfficientNet | + FiLM | Discrete tokens | First serious robot transformer |
| RT-2 | 2023 | Google | 55B (PaLI-X-55B) | PaLI-X | SigLIP | Discrete tokens | Web-scale knowledge |
| RT-2-X | 2023 | OpenX | 55B | PaLI-X | SigLIP | Discrete tokens | Cross-embodiment via OpenX-Embodiment |
| OpenVLA | 2024 | Stanford/TRI | 7B | Llama-2-7B | DINOv2 + SigLIP | Discrete tokens | Open-weights, $20k to train |
| OpenVLA-OFT | 2025 | Stanford/TRI | 7B | Llama-2-7B | DINOv2 + SigLIP | L1 regression head | 3x faster inference |
| π0 | 2024 | Physical Intelligence | 3.3B | PaliGemma-3B | SigLIP | Continuous (flow matching) | First production VLA |
| π0.5 | 2025 | Physical Intelligence | 3.3B | PaliGemma | SigLIP | Continuous + chunks | Hierarchical (low+high level) |
| π0.6 | 2025 | Physical Intelligence | 4B | PaliGemma | DINOv3 | Continuous (flow matching) | Better vision encoder |
| π0.7 | 2026 | Physical Intelligence | 4B | PaliGemma | DINOv3 | Continuous + JEPA aux loss | Latest, generalist |
| GR00T N1 | 2024 | NVIDIA | 2B | Eagle-2 | SigLIP | Diffusion head | Humanoid-focused |
| GR00T N1.5 | 2025 | NVIDIA | 2B | Eagle-2 + skill tokens | SigLIP | Diffusion head | Skill abstraction |
| GR00T N1.6 | 2026 | NVIDIA | 3B | Eagle-2 | DINOv3 | Diffusion head | World-model-aware |
| Helix | 2025 | Figure | ~2B | Custom | Custom | Continuous + RNN | Whole-upper-body, real-time |
| Gemini Robotics | 2025 | Google DeepMind | "Gemini-2.0-class" | Gemini-2.0 | Native | Continuous chunks | Gemini Robotics-ER variant for ER |
| Gemini Robotics-ER 1.6 | 2026 | Google DeepMind | "Gemini-2.5-class" | Gemini-2.5 | Native | Continuous chunks | Apr 14, 2026 release |
| RDT-1B | 2024 | THU/SAIL | 1B | DiT (diffusion) | SigLIP | Diffusion in joint space | Bimanual specialist |
| CogACT | 2024 | THU | 0.6B | Cog-style transformer | SigLIP | Continuous chunks | Lightweight |
| SmolVLA | 2025 | HF | 2.4B | PaliGemma | SigLIP | L1 head | Consumer-GPU fine-tune |
## Architectural trends 2024 → 2026
1. **Vision encoder upgrade**: SigLIP → DINOv3.
2. **Action repr**: discrete tokens → continuous + flow/diffusion.
3. **Action chunking**: borrowed from ACT, now universal.
4. **Auxiliary losses**: pure imitation → +JEPA (π0.7) +world-model (GR00T N1.6).
5. **Real-time**: 3 Hz (RT-2) → 50–200 Hz (Helix).Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.