COMPUTER-VISIONFOUNDATIONAL2023-07-28

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich

ARCHITECTURE
VLA (vision-language-action model)
ROBOT
not specified in abstract
KEY METRIC
6k
TASK
manipulation, grasping

RT-2 is a Modern Robot LearningVision-Language-Action model (VLA)A model that takes images and language as input and outputs robot actions. that teaches robots to understand and execute complex commands by leveraging the massive knowledge baked into internet-scale AI models. The breakthrough is elegantly simple: represent Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions as text tokens, train them alongside natural language, and suddenly a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can understand commands it never saw during Robot LearningTrainingThe process of fitting a model using data or experience.—like "pick up the extinct animal" (it picks up a dinosaur toy) or "find something to use as a hammer" (it reasons about what makes a good hammer and picks a rock). This matters because previous Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. systems were islands of knowledge, unable to generalize beyond their narrow Robot LearningTrainingThe process of fitting a model using data or experience. data. RT-2 opens the door to robots that can reason semantically and adapt to novel situations by inheriting the common-sense understanding that vision-language models learned from billions of images and text snippets on the web.

ARCHITECTURE

THE PROBLEM

Before RT-2, Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. faced a fundamental Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. crisis. Systems like RT-1 (its predecessor) were trained on specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories in controlled environments—they learned mappings from "image → Core ConceptsActionA command the robot sends to its motors, controller, or low-level system." but couldn't reason about what they were seeing or why. When presented with a novel object or a command that didn't appear in Robot LearningTrainingThe process of fitting a model using data or experience. data, they failed catastrophically. The core limitation: robots had to memorize every scenario. Vision-language models (like CLIP, PaLM-E, PaLI-X) solved this for Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. and language understanding through massive Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task., but nobody had figured out how to meaningfully combine that semantic understanding with Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want.. Previous attempts either treated vision-language models as separate Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. modules (losing the reasoning benefits) or tried clunky hybrid approaches that didn't leverage the full power of the pretrained models. The gap was concrete and costly: a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trained to pick up red cups wouldn't pick up a red mug, even though a human instantly understands the semantic similarity.

HOW IT WORKS

1

Represent Actions as Language Tokens

The genius insight: instead of outputting Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. commands as traditional continuous values (Movement, Mechanics & Robot BodyJointA movable connection between robot parts. angles, coordinates), express them as text tokens just like natural language words. A single Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.—say, a 7-dimensional arm movement—becomes a sequence like "1 128 91 241 5 101 127 217". Each number represents a discrete token ID, quantized from the original continuous Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. space. This unified representation means the model treats actions and language as the same kind of thing. Why this matters: large language models are extremely good at predicting sequences of tokens. By making actions tokens, you can use all the architectural machinery of vision-language models without modification. There's no need for a special "Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. head" or separate Robot LearningTrainingThe process of fitting a model using data or experience. procedure.

2

Co-Fine-Tune on Robot and Web Data Together

Take a pretrained Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. (PaLM-E with 12B parameters or PaLI-X with 55B) and jointly fine-tune it on two data streams: (1) Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Core ConceptsTrajectoryA sequence of states or actions over time. data (image + language command + Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. tokens), and (2) internet-scale vision-language tasks like visual question answering. The co-fine-tuning strategy keeps some of the original vision and language Robot LearningTrainingThe process of fitting a model using data or experience. data in the mix so the model doesn't "forget" its web knowledge while learning Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want.. Why this matters: this is much simpler than previous approaches and it works because the model's vision and language understanding are now directly connected to Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. generation. The Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. learns not just how to move, but *why* it's moving based on semantic understanding. The model sees that "pick up the extinct animal" requires identifying a dinosaur, and that knowledge transfers from VQA Robot LearningTrainingThe process of fitting a model using data or experience. where models learned to recognize dinosaurs in images.

rt2 teaser
rt2simple
rt2 videos compressed
rt2cot comp
3

Evaluate Emergent Semantic Reasoning

The researchers ran 6,000 Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. trials across three categories of emergent behaviors that the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. never explicitly trained on: (1) Symbol understanding (place object on the number 5 / the red icon), (2) Reasoning tasks (pick the smallest object, the one closest to another object), and (3) Chain-of-thought multi-step reasoning (find an improvised hammer by picking a rock, or suggest an energy drink for someone tired). These aren't programmed behaviors—they emerge from the model's learned semantic understanding. Why this matters: this is the smoking gun that the model isn't just memorizing patterns. It's genuinely reasoning about concepts and relationships it learned from internet data and applying them to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want.. This is fundamentally different from prior Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules., which would simply fail on unseen Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. types.

4

Measure Generalization Against Baselines

RT-2 was compared head-to-head against RT-1 (the previous state-of-the-art) and VC-1 (a vision Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. Evaluation & ResearchBaselineA reference method used for comparison.) in blind A/B studies across multiple Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. axes: novel objects, novel scenes, and novel combinations. On emergent reasoning tasks, RT-2 showed a 3x improvement over baselines. On broader Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before., the improvement was approximately 2x. Why this matters: these aren't cherry-picked examples—they're systematic measurements across thousands of trials, which is the gold standard for robotics research. A 2-3x improvement in Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. is transformative for real-world Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot..

5

Ablate Critical Design Choices

The researchers tested two hypotheses: (1) Does model size matter? They compared 5B vs. 55B parameter versions. (2) Does initialization matter? They compared Robot LearningTrainingThe process of fitting a model using data or experience. from scratch vs. Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. vs. co-fine-tuning. Results showed that both pretrained weights and larger model size significantly boost performance. Why this matters: this proves that the improvements aren't just from having more data or better architecture, but specifically from leveraging internet-scale Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task.. It's a clear validation that the approach is sound and not a lucky artifact.

MORE DEMONSTRATIONS

01 ketchup mustard
02 tabasco
03 ketchup blue swap
04 red controller
05 white banana
fail 01 marker

FIGURES

KEY RESULTS

Emergent Semantic Reasoning Improvement3x better than RT-1 and VC-1

vs. RT-1 (previous SOTA) and VC-1 (vision pretraining baseline)

On tasks the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. never trained on (symbol understanding, reasoning, chain-of-thought), RT-2 succeeded where baselines failed. This 3x improvement is the clearest evidence that knowledge from internet-scale Robot LearningTrainingThe process of fitting a model using data or experience. transfers to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. reasoning.

Novel Object Generalization~2x improvement

vs. RT-1 and other baselines across all generalization axes

When shown objects it hadn't seen during Robot LearningTrainingThe process of fitting a model using data or experience., RT-2 performed roughly twice as well. This is the practical Evaluation & ResearchMetricA numerical measure of performance. that matters for real-world Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.—robots in factories and homes encounter novel objects constantly.

Total Evaluation Trials6,000

vs. typical robotics papers with hundreds of trials

The sheer scale of Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. (6k trials) gives statistical confidence that these results aren't Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation.. Robotics is notoriously noisy; this level of rigor is rare and reassuring.

Model Size Impact55B parameters >> 5B parameters

vs. Ablation comparing PaLI-X 55B vs. 5B variant

Larger pretrained models show significantly better transfer. This suggests that web-scale knowledge scales with model capacity—more internet knowledge means more Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before..

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, RT-2 fundamentally changes the game. Before, you'd train a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. model on thousands of hand-annotated demonstrations, and it would work only in the narrow slice of world it was trained on. Now, you can leverage pretrained vision-language models to give your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. common-sense reasoning "for free." The practical implication: instead of collecting 10,000 Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories to teach a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to pick up cups, you might collect 1,000 and get better Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. because the model learned about cup-ness from the internet. More importantly, RT-2 demonstrates that robots can reason semantically—they understand not just "move arm to position (0.5, 0.3, 0.2)" but "pick up the thing that would make a good hammer." This opens doors to robots understanding user intent in natural language, adapting to novel scenarios, and operating in unstructured real-world environments. For a developer, the lesson is: stop thinking of vision, language, and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. as separate problems. Unify them through a shared token representation, leverage existing pretrained models, and let the network discover the semantic connections. This is how you get emergent capabilities you didn't explicitly program.

LIMITATIONS

RT-2 still has significant gaps. The Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories used were from relatively constrained Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks (Manipulation & TasksGraspingTaking hold of an object., object placement in table-top environments)—it's unclear how well this generalizes to Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running., Navigation & LocomotionNavigationMoving through an environment toward a goal., or dynamic tasks. The Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. tokenization is lossy; by converting continuous Control & PlanningControlThe method used to make the robot move the way you want. into discrete tokens, the model loses fine-grained precision, which could be problematic for tasks requiring delicate Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.. The chain-of-thought reasoning, while impressive, is still "rudimentary"—the model can pick a rock as a hammer, but it's not clear how well it would handle truly complex multi-step Control & PlanningPlanningFiguring out what the robot should do before or during movement. or recovery from failure. There's also a Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. question: these models are 12-55B parameters, which is computationally expensive for edge robots. Finally, the Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs., while extensive, is still in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. or controlled lab settings; real-world Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation. at scale remains unproven.

WHAT COMES NEXT

The next frontier is scaling RT to Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. and more complex Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects., combining it with actual chain-of-thought Control & PlanningPlanningFiguring out what the robot should do before or during movement. (where the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. generates intermediate reasoning steps), and deploying it to real robots in unstructured environments. We'll likely see RT-3 add: (1) real-time Robot LearningOnline learningTraining while continuing to collect new live data. where robots update their understanding as they encounter novel objects, (2) multi-modal reasoning (interpreting gestures, tone of voice, not just language commands), (3) Modern Robot LearningFailure recoveryA system’s ability to detect and recover from errors. and self-correction (when a grasp fails, the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. reasons about why and adjusts), and (4) better integration with world models so the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can plan multiple steps ahead. The long-term vision is a universal Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. brain that understands language, vision, and physical consequence as deeply as humans do—and RT-2 is the critical stepping stone showing that leveraging internet-scale Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. is the right path forward.

RELATED PAPERS