IMITATION-LEARNINGCURRENT2023-10-13

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration et al.

ARCHITECTURE

transformer (RT-X)

ROBOT

22 different robots (multi-robot)

DATASET

160266 tasks across 527 skills

TASK

manipulation

Imagine if every had to learn from scratch, like a new AI model for every single on every single device. That's been the robotics status quo—inefficient and expensive. Open X-Embodiment changes this fundamentally by pooling data from 22 different robots across 21 institutions to train a single "generalist" called RT-X. The result? A transformer-based model trained on 160,266 tasks across 527 different skills that can transfer knowledge between robots. When you train RT-X on data from a UR5 arm, a Jaco manipulator, a Fetch , and 19 others simultaneously, robots that never saw certain skills improve at performing them anyway. This is the robotics equivalent of what happened in computer vision and NLP when large pretrained models emerged—moving from "one model per problem" to "one for the whole domain."

ARCHITECTURE

THE PROBLEM

Before this work, robotic learning suffered from radical fragmentation. Every lab trained separate models for their specific , , and . If you wanted to teach a robotic arm to pick up objects, you'd collect data on that , train a model, and hope it worked. If you then wanted that same arm to do something new, you'd start almost from scratch. This meant massive redundancy: researchers were independently collecting data of robots doing similar tasks, but the knowledge gained by one to grasp never benefited another . Previous multi-robot work existed (like RT-1 from DeepMind), but it was typically limited to a handful of robots within one lab. The scaling challenge was enormous—how do you even standardize data formats across 21 different institutions? How do you handle vastly different morphologies, camera angles, spaces, and definitions? The field lacked both the collaborative infrastructure and empirical evidence that a single model could genuinely improve performance across diverse hardware.

HOW IT WORKS

Create a Standardized Data Format (Bridge Protocol)

The first major challenge wasn't algorithmic—it was organizational. RT-X contributors defined a shared schema for representing interactions: images, proprioceptive states, commands, and language descriptions. Each institution's data (from ABB arms to mobile manipulators to humanoids) got converted into this common format, creating 160,266 labeled tasks. This is unglamorous work, but it's the foundation. Without it, you can't even train a single model across different robots because their data representations are incompatible. Think of it like standardizing the input layer before you even get to the neural network.

bridge cropped removed

Scale to a Multi-Modal Transformer (RT-X Architecture)

RT-X uses a transformer backbone similar to RT-1, but now trained on the massive pooled . The model ingests image observations, proprioceptive readings, and natural language descriptions, then outputs actions. The key insight: a sufficiently large, well-regularized transformer can learn shared representations across different morphologies. The transformer's attention mechanism is doing something clever—it's learning which parts of the pooled experience are relevant to the current and , even if that information came from a different 's data. This is positive transfer: the model learns that 'moving toward an object' is conceptually similar whether you're a Jaco arm or a UR5, even though the angles and velocities are different.

teaser compressed

move red pepper to tray cropped removed

pick ice cream cropped removed

move red pepper to A trimmed cropped removed

Evaluate Transfer Learning Across 22 Robot Platforms

The critical validation: does on diverse data actually make individual robots better? The team fine-tuned RT-X for specific robots and measured whether it outperformed models trained only on that 's data. They tested on both in-distribution tasks (skills seen during ) and (new tasks the RT-X model hadn't encountered). The results showed consistent improvements—robots benefited from exposure to other robots' strategies. A Fetch learned to handle certain patterns from watching UR5 demonstrations. This wasn't marginal improvement; it was evidence that the scaling hypothesis works in robotics.

Enable Few-Shot Adaptation via In-Context Learning

Because transformers support in-context learning (showing examples in the input), RT-X can adapt to new tasks or new configurations with just a handful of demonstrations. Instead of retraining from scratch, you provide 1-5 examples of the desired behavior, and the model uses its pretrained representations to generalize. This dramatically reduces the data collection burden for new robots or new tasks—you're not starting cold, you're a model that already understands across 22 different morphologies.

MORE DEMONSTRATIONS

cable routing rt1x out removed

task agnostic open drawer rt1x removed

nyuenv1 removed

cloth sweeping cropped removed

jaco play removed

move apple near cloth cropped removed

move apple on cloth cropped removed

move apple between can and orange cropped removed

KEY RESULTS

Dataset Scale160,266 tasks across 527 distinct skills from 22 robots

vs. Prior multi-robot work typically involved 2-5 robots from a single lab

This is roughly 10-50× larger than previous multi-robot datasets. The scale is what enables the transformer to learn robust, general representations. Bigger datasets = better , and robotics had been severely limited by data scarcity. This alone is a contribution to the field.

Positive Transfer EvidenceRT-X models trained on multi-robot data outperform single-robot baselines on most tested robots

vs. Models trained only on individual robot data or smaller multi-robot subsets

This is the core claim and it holds up. The transfer isn't magical—you don't get 100% improvement—but it's statistically consistent. Robots that had never performed certain skills improved when trained on models that had seen those skills on other platforms. This proves the foundation-model hypothesis is viable in robotics, not just theoretical.

Skill Diversity527 distinct manipulation skills

vs. Typical single-robot systems train on 1-20 skills per study

Diversity forces the model to learn abstractions. Instead of to a narrow , RT-X had to find principles that apply across ice cream scooping, cable routing, object repositioning, and cloth . This breadth is why transfer works—the model learned robust primitives, not task-specific hacks.

Institutional Collaboration21 institutions contributed data

vs. Robotics data collection typically happens in 1-2 labs

This coordination is unprecedented in robotics and required significant organizational effort. It demonstrates that cross-institutional, open-science approaches are feasible and valuable. It also means the results are less likely to be a quirk of one lab's setup or bias—they're validated across diverse experimental conditions.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, this paper is a watershed moment: you can now start with a pretrained that's seen more experience than any individual team could collect. Instead of collecting 10,000 demonstrations to teach a a new , you might need 100, because the model already understands concepts from other robots. This is a 100× reduction in data collection, which directly translates to faster and lower cost. Second, this paper proves that robotics is moving toward the era. Just like you'd use BERT or GPT as a starting point in NLP, the next generation of robotics software will use pretrained policies like RT-X. You should be thinking about how to design your interfaces and data pipelines to be compatible with these models now. The Standardized Bridge Protocol is the earliest version of this standard—expect it to evolve, but the idea is durable. Third, the collaborative infrastructure matters as much as the model. The Open X-Embodiment project is open-sourcing datasets and models, and they're accepting contributions. If you're working with a type not yet represented (humanoid, quadruped, aquatic), contributing your data makes the better for everyone, including you. This is the opposite of the typical ML where data is a competitive advantage—in robotics, pooling data makes everyone stronger.

LIMITATIONS

RT-X doesn't solve the morphology problem entirely. A with 2 fingers transfers better to another 2-finger than to a 3-finger hand, suggesting the model is still learning hardware-specific features rather than fully abstract principles. The is also heavily weighted toward tabletop and ; , door opening, and contact-rich tasks are underrepresented. is still required for best performance—you can't just point RT-X at a new and expect mastery. Additionally, the paper doesn't deeply explore failure modes: which types benefit most? Which skills fail to transfer? And there's the unsolved problem of real-world transfer—almost all data is from real robots, but adding data might further improve (though it brings its own challenges).

WHAT COMES NEXT

The natural next steps are scaling further (more robots, more institutions, more diversity), improving (can we get these results with 1/10th the data?), and extending beyond tabletop to and . We'll likely see specialized variants: RT-X-Humanoid, RT-X-Mobile, RT-X-Surgical. There's also the question of true —can RT-X continuously adapt as robots encounter new scenarios in ? And the really ambitious question: can a single RT-X scale to include , , and long-horizon , not just ? The paper hints at RT-2-X (a next iteration), suggesting the team is already pursuing these extensions. Expect the robotics field to gradually consolidate around open-source foundation models, similar to how Hugging Face transformed NLP.

Read on arxiv →HTML source →

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Create a Standardized Data Format (Bridge Protocol)

Scale to a Multi-Modal Transformer (RT-X Architecture)

Evaluate Transfer Learning Across 22 Robot Platforms

Enable Few-Shot Adaptation via In-Context Learning

MORE DEMONSTRATIONS

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy