IMITATION-LEARNINGCURRENT2023-10-13

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration et al.

ARCHITECTURE
transformer (RT-X)
ROBOT
22 different robots (multi-robot)
DATASET
160266 tasks across 527 skills
TASK
manipulation

Imagine if every Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. had to learn from scratch, like Robot LearningTrainingThe process of fitting a model using data or experience. a new AI model for every single Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. on every single device. That's been the robotics status quo—inefficient and expensive. Open X-Embodiment changes this fundamentally by pooling data from 22 different robots across 21 institutions to train a single "generalist" Core ConceptsPolicyThe rule or model that maps observations or states to actions. called RT-X. The result? A transformer-based model trained on 160,266 tasks across 527 different skills that can transfer knowledge between robots. When you train RT-X on data from a UR5 arm, a Jaco manipulator, a Fetch Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., and 19 others simultaneously, robots that never saw certain skills improve at performing them anyway. This is the robotics equivalent of what happened in computer vision and NLP when large pretrained models emerged—moving from "one model per problem" to "one Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. for the whole domain."

ARCHITECTURE

THE PROBLEM

Before this work, robotic learning suffered from radical fragmentation. Every lab trained separate models for their specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., and Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces.. If you wanted to teach a robotic arm to pick up objects, you'd collect data on that Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions., train a model, and hope it worked. If you then wanted that same arm to do something new, you'd start almost from scratch. This meant massive redundancy: researchers were independently collecting data of robots doing similar tasks, but the knowledge gained by one Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. to grasp never benefited another Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.. Previous multi-robot work existed (like RT-1 from DeepMind), but it was typically limited to a handful of robots within one lab. The scaling challenge was enormous—how do you even standardize data formats across 21 different institutions? How do you handle vastly different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies, camera angles, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. spaces, and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. definitions? The field lacked both the collaborative infrastructure and empirical evidence that a single model could genuinely improve performance across diverse hardware.

HOW IT WORKS

1

Create a Standardized Data Format (Bridge Protocol)

The first major challenge wasn't algorithmic—it was organizational. RT-X contributors defined a shared schema for representing Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. interactions: Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. images, proprioceptive states, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. commands, and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. language descriptions. Each institution's data (from ABB arms to mobile manipulators to humanoids) got converted into this common format, creating 160,266 labeled tasks. This is unglamorous work, but it's the foundation. Without it, you can't even train a single model across different robots because their data representations are incompatible. Think of it like standardizing the input layer before you even get to the neural network.

bridge cropped removed
2

Scale to a Multi-Modal Transformer (RT-X Architecture)

RT-X uses a transformer backbone similar to RT-1, but now trained on the massive pooled Robot LearningDatasetA collection of training or evaluation data.. The model ingests image observations, proprioceptive Perception & SensingSensorA device that provides information about the robot or its environment. readings, and natural language Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. descriptions, then outputs Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions. The key insight: a sufficiently large, well-regularized transformer can learn shared representations across different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies. The transformer's attention mechanism is doing something clever—it's learning which parts of the pooled experience are relevant to the current Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., even if that information came from a different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s data. This is positive transfer: the model learns that 'moving toward an object' is conceptually similar whether you're a Jaco arm or a UR5, even though the Movement, Mechanics & Robot BodyJointA movable connection between robot parts. angles and velocities are different.

teaser compressed
move red pepper to tray cropped removed
pick ice cream cropped removed
move red pepper to A trimmed cropped removed
3

Evaluate Transfer Learning Across 22 Robot Platforms

The critical validation: does Robot LearningTrainingThe process of fitting a model using data or experience. on diverse Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data actually make individual robots better? The team fine-tuned RT-X for specific robots and measured whether it outperformed models trained only on that Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s data. They tested on both in-distribution tasks (skills seen during Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task.) and Data, Distributions & Training IssuesOOD (Out-of-distribution)A test situation unlike the data seen during training. Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. (new tasks the RT-X model hadn't encountered). The results showed consistent improvements—robots benefited from exposure to other robots' strategies. A Fetch Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. learned to handle certain Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. patterns from watching UR5 demonstrations. This wasn't marginal improvement; it was evidence that the scaling hypothesis works in robotics.

4

Enable Few-Shot Adaptation via In-Context Learning

Because transformers support in-context learning (showing examples in the input), RT-X can adapt to new tasks or new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. configurations with just a handful of demonstrations. Instead of retraining from scratch, you provide 1-5 examples of the desired behavior, and the model uses its pretrained representations to generalize. This dramatically reduces the data collection burden for new robots or new tasks—you're not starting cold, you're Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. a model that already understands Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. across 22 different morphologies.

MORE DEMONSTRATIONS

cable routing rt1x out removed
task agnostic open drawer rt1x removed
nyuenv1 removed
cloth sweeping cropped removed
jaco play removed
move apple near cloth cropped removed
move apple on cloth cropped removed
move apple between can and orange cropped removed

KEY RESULTS

Dataset Scale160,266 tasks across 527 distinct skills from 22 robots

vs. Prior multi-robot work typically involved 2-5 robots from a single lab

This is roughly 10-50× larger than previous multi-robot datasets. The scale is what enables the transformer to learn robust, general representations. Bigger datasets = better Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before., and robotics had been severely limited by data scarcity. This Robot LearningDatasetA collection of training or evaluation data. alone is a contribution to the field.

Positive Transfer EvidenceRT-X models trained on multi-robot data outperform single-robot baselines on most tested robots

vs. Models trained only on individual robot data or smaller multi-robot subsets

This is the core claim and it holds up. The transfer isn't magical—you don't get 100% improvement—but it's statistically consistent. Robots that had never performed certain skills improved when trained on models that had seen those skills on other platforms. This proves the foundation-model hypothesis is viable in robotics, not just theoretical.

Skill Diversity527 distinct manipulation skills

vs. Typical single-robot systems train on 1-20 skills per study

Diversity forces the model to learn abstractions. Instead of Data, Distributions & Training IssuesOverfittingWhen a model performs well on training data but poorly on new data. to a narrow Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., RT-X had to find principles that apply across ice cream scooping, cable routing, object repositioning, and cloth Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.. This breadth is why transfer works—the model learned robust primitives, not task-specific hacks.

Institutional Collaboration21 institutions contributed data

vs. Robotics data collection typically happens in 1-2 labs

This coordination is unprecedented in robotics and required significant organizational effort. It demonstrates that cross-institutional, open-science approaches are feasible and valuable. It also means the results are less likely to be a quirk of one lab's setup or bias—they're validated across diverse experimental conditions.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, this paper is a watershed moment: you can now start with a pretrained Core ConceptsPolicyThe rule or model that maps observations or states to actions. that's seen more Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. experience than any individual team could collect. Instead of collecting 10,000 demonstrations to teach a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. a new Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., you might need 100, because the model already understands Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. concepts from other robots. This is a 100× reduction in data collection, which directly translates to faster Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. and lower cost. Second, this paper proves that robotics is moving toward the Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. era. Just like you'd use BERT or GPT as a starting point in NLP, the next generation of robotics software will use pretrained policies like RT-X. You should be thinking about how to design your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. interfaces and data pipelines to be compatible with these models now. The Standardized Bridge Protocol is the earliest version of this standard—expect it to evolve, but the idea is durable. Third, the collaborative infrastructure matters as much as the model. The Open X-Embodiment project is open-sourcing datasets and models, and they're accepting contributions. If you're working with a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. type not yet represented (humanoid, quadruped, aquatic), contributing your data makes the Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. better for everyone, including you. This is the opposite of the typical ML Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia. where data is a competitive advantage—in robotics, pooling data makes everyone stronger.

LIMITATIONS

RT-X doesn't solve the morphology problem entirely. A Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. with 2 fingers transfers better to another 2-finger Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. than to a 3-finger hand, suggesting the model is still learning hardware-specific features rather than fully abstract Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. principles. The Robot LearningDatasetA collection of training or evaluation data. is also heavily weighted toward tabletop Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. and Manipulation & TasksGraspingTaking hold of an object.; Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects., door opening, and contact-rich tasks are underrepresented. Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. is still required for best performance—you can't just point RT-X at a new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. and expect Modern Robot LearningZero-shotDoing a new task without task-specific training. mastery. Additionally, the paper doesn't deeply explore failure modes: which Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. types benefit most? Which skills fail to transfer? And there's the unsolved problem of real-world Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer—almost all data is from real robots, but adding Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. data might further improve Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. (though it brings its own challenges).

WHAT COMES NEXT

The natural next steps are scaling further (more robots, more institutions, more Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. diversity), improving Robot LearningData efficiencyHow much useful performance a method gets from limited data. (can we get these results with 1/10th the data?), and extending beyond tabletop to Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects. and Control & PlanningWhole-body controlCoordinating the whole robot body at once, common in humanoids.. We'll likely see specialized variants: RT-X-Humanoid, RT-X-Mobile, RT-X-Surgical. There's also the question of true Robot LearningOnline learningTraining while continuing to collect new live data.—can RT-X continuously adapt as robots encounter new scenarios in Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.? And the really ambitious question: can a single RT-X scale to include Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running., Navigation & LocomotionNavigationMoving through an environment toward a goal., and long-horizon Control & PlanningPlanningFiguring out what the robot should do before or during movement., not just Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.? The paper hints at RT-2-X (a next iteration), suggesting the team is already pursuing these extensions. Expect the robotics field to gradually consolidate around open-source foundation models, similar to how Hugging Face transformed NLP.

RELATED PAPERS