COMPUTER-VISIONFOUNDATIONAL2023-07-28

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich

ARCHITECTURE

VLA (vision-language-action model)

ROBOT

not specified in abstract

KEY METRIC

TASK

manipulation, grasping

RT-2 is a that teaches robots to understand and execute complex commands by leveraging the massive knowledge baked into internet-scale AI models. The breakthrough is elegantly simple: represent actions as text tokens, train them alongside natural language, and suddenly a can understand commands it never saw during —like "pick up the extinct animal" (it picks up a dinosaur toy) or "find something to use as a hammer" (it reasons about what makes a good hammer and picks a rock). This matters because previous systems were islands of knowledge, unable to generalize beyond their narrow data. RT-2 opens the door to robots that can reason semantically and adapt to novel situations by inheriting the common-sense understanding that vision-language models learned from billions of images and text snippets on the web.

ARCHITECTURE

THE PROBLEM

Before RT-2, faced a fundamental crisis. Systems like RT-1 (its predecessor) were trained on specific trajectories in controlled environments—they learned mappings from "image → " but couldn't reason about what they were seeing or why. When presented with a novel object or a command that didn't appear in data, they failed catastrophically. The core limitation: robots had to memorize every scenario. Vision-language models (like CLIP, PaLM-E, PaLI-X) solved this for and language understanding through massive , but nobody had figured out how to meaningfully combine that semantic understanding with . Previous attempts either treated vision-language models as separate modules (losing the reasoning benefits) or tried clunky hybrid approaches that didn't leverage the full power of the pretrained models. The gap was concrete and costly: a trained to pick up red cups wouldn't pick up a red mug, even though a human instantly understands the semantic similarity.

HOW IT WORKS

Represent Actions as Language Tokens

The genius insight: instead of outputting commands as traditional continuous values ( angles, coordinates), express them as text tokens just like natural language words. A single —say, a 7-dimensional arm movement—becomes a sequence like "1 128 91 241 5 101 127 217". Each number represents a discrete token ID, quantized from the original continuous space. This unified representation means the model treats actions and language as the same kind of thing. Why this matters: large language models are extremely good at predicting sequences of tokens. By making actions tokens, you can use all the architectural machinery of vision-language models without modification. There's no need for a special " head" or separate procedure.

Co-Fine-Tune on Robot and Web Data Together

Take a pretrained (PaLM-E with 12B parameters or PaLI-X with 55B) and jointly fine-tune it on two data streams: (1) data (image + language command + tokens), and (2) internet-scale vision-language tasks like visual question answering. The co-fine-tuning strategy keeps some of the original vision and language data in the mix so the model doesn't "forget" its web knowledge while learning . Why this matters: this is much simpler than previous approaches and it works because the model's vision and language understanding are now directly connected to generation. The learns not just how to move, but *why* it's moving based on semantic understanding. The model sees that "pick up the extinct animal" requires identifying a dinosaur, and that knowledge transfers from VQA where models learned to recognize dinosaurs in images.

rt2 teaser

rt2simple

rt2 videos compressed

rt2cot comp

Evaluate Emergent Semantic Reasoning

The researchers ran 6,000 trials across three categories of emergent behaviors that the never explicitly trained on: (1) Symbol understanding (place object on the number 5 / the red icon), (2) Reasoning tasks (pick the smallest object, the one closest to another object), and (3) Chain-of-thought multi-step reasoning (find an improvised hammer by picking a rock, or suggest an energy drink for someone tired). These aren't programmed behaviors—they emerge from the model's learned semantic understanding. Why this matters: this is the smoking gun that the model isn't just memorizing patterns. It's genuinely reasoning about concepts and relationships it learned from internet data and applying them to . This is fundamentally different from prior , which would simply fail on unseen types.

Measure Generalization Against Baselines

RT-2 was compared head-to-head against RT-1 (the previous state-of-the-art) and VC-1 (a vision ) in blind A/B studies across multiple axes: novel objects, novel scenes, and novel combinations. On emergent reasoning tasks, RT-2 showed a 3x improvement over baselines. On broader , the improvement was approximately 2x. Why this matters: these aren't cherry-picked examples—they're systematic measurements across thousands of trials, which is the gold standard for robotics research. A 2-3x improvement in is transformative for real-world .

Ablate Critical Design Choices

The researchers tested two hypotheses: (1) Does model size matter? They compared 5B vs. 55B parameter versions. (2) Does initialization matter? They compared from scratch vs. vs. co-fine-tuning. Results showed that both pretrained weights and larger model size significantly boost performance. Why this matters: this proves that the improvements aren't just from having more data or better architecture, but specifically from leveraging internet-scale . It's a clear validation that the approach is sound and not a lucky artifact.

MORE DEMONSTRATIONS

01 ketchup mustard

02 tabasco

03 ketchup blue swap

04 red controller

05 white banana

fail 01 marker

FIGURES

KEY RESULTS

Emergent Semantic Reasoning Improvement3x better than RT-1 and VC-1

vs. RT-1 (previous SOTA) and VC-1 (vision pretraining baseline)

On tasks the never trained on (symbol understanding, reasoning, chain-of-thought), RT-2 succeeded where baselines failed. This 3x improvement is the clearest evidence that knowledge from internet-scale transfers to reasoning.

Novel Object Generalization~2x improvement

vs. RT-1 and other baselines across all generalization axes

When shown objects it hadn't seen during , RT-2 performed roughly twice as well. This is the practical that matters for real-world —robots in factories and homes encounter novel objects constantly.

Total Evaluation Trials6,000

vs. typical robotics papers with hundreds of trials

The sheer scale of (6k trials) gives statistical confidence that these results aren't . Robotics is notoriously noisy; this level of rigor is rare and reassuring.

Model Size Impact55B parameters >> 5B parameters

vs. Ablation comparing PaLI-X 55B vs. 5B variant

Larger pretrained models show significantly better transfer. This suggests that web-scale knowledge scales with model capacity—more internet knowledge means more .

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, RT-2 fundamentally changes the game. Before, you'd train a model on thousands of hand-annotated demonstrations, and it would work only in the narrow slice of world it was trained on. Now, you can leverage pretrained vision-language models to give your common-sense reasoning "for free." The practical implication: instead of collecting 10,000 trajectories to teach a to pick up cups, you might collect 1,000 and get better because the model learned about cup-ness from the internet. More importantly, RT-2 demonstrates that robots can reason semantically—they understand not just "move arm to position (0.5, 0.3, 0.2)" but "pick up the thing that would make a good hammer." This opens doors to robots understanding user intent in natural language, adapting to novel scenarios, and operating in unstructured real-world environments. For a developer, the lesson is: stop thinking of vision, language, and as separate problems. Unify them through a shared token representation, leverage existing pretrained models, and let the network discover the semantic connections. This is how you get emergent capabilities you didn't explicitly program.

LIMITATIONS

RT-2 still has significant gaps. The trajectories used were from relatively constrained tasks (, object placement in table-top environments)—it's unclear how well this generalizes to , , or dynamic tasks. The tokenization is lossy; by converting continuous into discrete tokens, the model loses fine-grained precision, which could be problematic for tasks requiring delicate . The chain-of-thought reasoning, while impressive, is still "rudimentary"—the model can pick a rock as a hammer, but it's not clear how well it would handle truly complex multi-step or recovery from failure. There's also a question: these models are 12-55B parameters, which is computationally expensive for edge robots. Finally, the , while extensive, is still in or controlled lab settings; real-world at scale remains unproven.

WHAT COMES NEXT

The next frontier is scaling RT to and more complex , combining it with actual chain-of-thought (where the generates intermediate reasoning steps), and deploying it to real robots in unstructured environments. We'll likely see RT-3 add: (1) real-time where robots update their understanding as they encounter novel objects, (2) multi-modal reasoning (interpreting gestures, tone of voice, not just language commands), (3) and self-correction (when a grasp fails, the reasons about why and adjusts), and (4) better integration with world models so the can plan multiple steps ahead. The long-term vision is a universal brain that understands language, vision, and physical consequence as deeply as humans do—and RT-2 is the critical stepping stone showing that leveraging internet-scale is the right path forward.

Read on arxiv →HTML source →

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Represent Actions as Language Tokens

Co-Fine-Tune on Robot and Web Data Together

Evaluate Emergent Semantic Reasoning

Measure Generalization Against Baselines

Ablate Critical Design Choices

MORE DEMONSTRATIONS

FIGURES

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics