9 robotic platforms (multi-platform: includes Franka Panda, UR5, and others from Open X-Embodiment dataset)
DATASET
800k trajectories from Open X-Embodiment dataset
TASK
manipulation
Octo is a breakthrough in Robot LearningRobot learningUsing data and algorithms to help robots improve behavior instead of only relying on hand-written rules. that does something previously thought impossible: train a single Core ConceptsPolicyThe rule or model that maps observations or states to actions. on 800,000 Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. trajectories from 9 different robots and Perception & SensingSensorA device that provides information about the robot or its environment. configurations, then finetune it to completely new robots in just a few hours on a consumer GPU. This is the "ImageNet moment" for robotics. Instead of Robot LearningTrainingThe process of fitting a model using data or experience.Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. policies from scratch for each new Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. (which took weeks), developers can now start with Octo's pretrained weights and adapt to their specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. in an afternoon. The model is a 27M to 93M parameter transformer-based Modern Robot LearningDiffusion policyA robot policy that generates actions using diffusion-model techniques. that understands both language commands ("pick up the pen") and Core ConceptsGoalThe desired outcome or target state for a robot task. images, and it's open-source. On real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. experiments across 9 different platforms—from UR5 industrial arms to WidowX grippers to dual-arm systems—Octo either matches or beats specialized policies trained from scratch, while requiring 52% less Robot LearningTrainingThe process of fitting a model using data or experience. data on average when finetuned.
ARCHITECTURE
THE PROBLEM
Before Octo, robotics researchers faced a painful scaling problem. Each new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. platform required collecting thousands of demonstrations, Robot LearningTrainingThe process of fitting a model using data or experience. a Core ConceptsPolicyThe rule or model that maps observations or states to actions. from scratch for weeks, and hoping it would generalize. Existing 'generalist' models like RT-1 and RT-2 could handle multiple robots, but they were either closed-source, required massive computational resources (RT-2-X has 55 billion parameters), or couldn't adapt to new Perception & SensingSensorA device that provides information about the robot or its environment. configurations—if your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. had force-torque sensors or a different Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects., you were out of luck. The Open X-Embodiment Robot LearningDatasetA collection of training or evaluation data. unlocked something powerful: 800k trajectories across diverse platforms—but nobody had figured out how to actually build a model that could leverage that diversity while remaining practical for everyday robotics labs. The gap wasn't just technical; it was about whether a single pretrained model could actually beat task-specific Robot LearningTrainingThe process of fitting a model using data or experience., especially when adapting to completely unseen Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. configurations.
HOW IT WORKS
1
Multi-Modal Transformer-Diffusion Architecture
Octo uses a transformer backbone (the same architecture that powers GPT) to process sequences of observations—camera images, proprioceptive data like Movement, Mechanics & Robot BodyJointA movable connection between robot parts. angles, and any other sensors your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. has. Instead of predicting Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actions directly, it uses diffusion (a technique from generative AI that iteratively refines noisy predictions into clean ones) to output Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. distributions. This is crucial because it captures uncertainty: when multiple valid Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences exist, the diffusion model learns the whole landscape rather than picking one. The architecture accepts variable-length Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. histories and supports both language and goal-image conditioning, meaning you can say "place the mug on the shelf" or show the model a photo of the target Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.. This flexibility is why Octo can handle force-torque inputs the original Robot LearningTrainingThe process of fitting a model using data or experience. data never saw.
2
Pretraining on 800k Trajectories Across 25 Datasets
Octo trains on the Open X-Embodiment Robot LearningDatasetA collection of training or evaluation data.—a diverse collection spanning different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. arms (Franka, UR5, WidowX), camera configurations, and Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks from Manipulation & TasksPick-and-placePicking up an object from one location and placing it somewhere else. to precise Manipulation & TasksInsertionPlacing one object into another, like plugging in a connector.. The key insight is that this diversity forces the model to learn abstract Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. skills that transfer. If you only trained on UR5 data, the model would overfit to UR5's specific Movement, Mechanics & Robot BodyKinematicsThe study of motion without considering forces.. By seeing multiple robots solve similar tasks in different ways, Octo learns what Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. actually means at a deeper level. The Robot LearningTrainingThe process of fitting a model using data or experience. uses the same objective across all 25 datasets despite heterogeneity in labels (some have language, some don't), sensors, and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. types—a significant engineering challenge that required flexible input/output handling.
3
Efficient Finetuning Protocol with Adapter Layers
Here's where practical impact happens. When adapting Octo to a new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.—say one with a different Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. or Movement, Mechanics & Robot BodyJointA movable connection between robot parts.Control & PlanningControlThe method used to make the robot move the way you want. instead of Movement, Mechanics & Robot BodyEnd-effectorThe tool at the end of a robot arm, like a gripper, hand, or suction cup.Control & PlanningControlThe method used to make the robot move the way you want.—you don't retrain the entire 93M parameter model. Instead, Octo uses parameter-efficient finetuning: you freeze most of the pretrained weights and add small adapter layers that learn robot-specific mappings. This runs on consumer GPUs (an RTX 3090 or similar) with just 100 target demonstrations (~30 minutes of Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data). The same finetuning recipe works across all tasks—no hyperparameter tuning per Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.. This is radically different from previous approaches that required careful tuning for each new domain.
4
Flexible Sensor and Action Space Adaptation
Octo handles a critical real-world problem: robots aren't standardized. The Berkeley Peg Insert Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. required adding force-torque sensing (new observations the Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. never saw), and Berkeley Pick-Up used Movement, Mechanics & Robot BodyJointA movable connection between robot parts. position Control & PlanningControlThe method used to make the robot move the way you want. instead of Movement, Mechanics & Robot BodyEnd-effectorThe tool at the end of a robot arm, like a gripper, hand, or suction cup.Control & PlanningControlThe method used to make the robot move the way you want. (new Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. space). Traditional models would need retraining. Octo's architecture handles this through input/output adapters—lightweight neural networks that translate between any Perception & SensingSensorA device that provides information about the robot or its environment. configuration and the model's internal representation. On Peg Insert (a notoriously difficult precision Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening.), Octo achieved 70% success with force Control & PlanningFeedbackInformation returned from sensors during action to help correct behavior., versus 5% for the next-best Evaluation & ResearchBaselineA reference method used for comparison.. This adaptability is why Octo works across 9 different real platforms rather than just a curated subset.
KEY RESULTS
Zero-shot performance on WidowX tasks (language-conditioned)50% success rate
vs. RT-1-X at 20% and RT-2-X at 50%
Out-of-the-box on tasks from the Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. distribution, Octo matches the 55B parameter RT-2-X while being only 27M parameters (200x smaller). Core ConceptsGoalThe desired outcome or target state for a robot task. image conditioning pushes this to 75% average success—25% higher than language alone. This proves the Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. worked: the model genuinely learned reusable Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. skills.
Finetuning performance across 6 new robot setups72% average success rate
vs. training from scratch at 20% and VC-1 pretraining at 15%
With ~100 target demonstrations and a few hours on consumer GPUs, Octo outperforms the next-best Evaluation & ResearchBaselineA reference method used for comparison. by 52%. Critically, this uses the same finetuning recipe for all 6 setups (CMU Baking, Stanford Coffee, Berkeley Peg Insert, Berkeley Pick-Up, Berkeley Bimanual, Berkeley Coke), including tasks with new sensors and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. spaces. This consistency is rare in robotics—usually you retune hyperparameters for every new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions..
Finetuning time on consumer hardwareFew hours on standard consumer GPUs
vs. weeks for training specialized policies from scratch
A developer can now adapt Octo to their specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. setup in an afternoon without waiting for expensive cloud compute. This democratizes robotics research—labs with limited budgets can now participate in robotics development. For comparison, Robot LearningTrainingThe process of fitting a model using data or experience. RT-2 from scratch requires thousands of TPU hours.
Octo achieves comparable Modern Robot LearningZero-shotDoing a new task without task-specific training. performance at 1000x smaller model scale (93M vs 55B), making it actually deployable on edge hardware and much cheaper to finetune. The two-model lineup lets developers choose between accuracy (Base) and Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot. simplicity (Small).
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
If you're building robotics software, Octo changes your development workflow fundamentally. Before, you collected Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. data, spent weeks Robot LearningTrainingThe process of fitting a model using data or experience. a Core ConceptsPolicyThe rule or model that maps observations or states to actions., and hoped it worked on your specific hardware. Now you start with pretrained weights and finetune in hours. This unlocks a new class of robotics applications: startups can build products without massive ML teams, research labs can focus on novel Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. design rather than Robot LearningTrainingThe process of fitting a model using data or experience. infrastructure, and roboticists can iterate rapidly on new hardware without restarting the entire Robot LearningTrainingThe process of fitting a model using data or experience. pipeline. The architecture teaches important lessons about Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.: the key to building policies that work across robots isn't just data diversity—it's having flexible input/output layers that can adapt to any Perception & SensingSensorA device that provides information about the robot or its environment. configuration. This pattern will likely become standard. As a developer, you should understand why diffusion models matter here: they capture Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. uncertainty better than deterministic policies, which is critical when multiple valid behaviors exist (reaching for a cup can succeed with slightly different trajectories). You should also recognize that this is the beginning of a paradigm shift. Just as transformers revolutionized NLP by enabling Modern Robot LearningTransfer learningUsing knowledge from one task, domain, or robot to help with another., pretrained Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. policies will become the default starting point rather than a luxury. The open-source nature matters enormously—Octo and its weights are publicly available, so your robotics startup or research project isn't locked into proprietary platforms.
LIMITATIONS
Octo isn't magic. It works well on tabletop Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. but hasn't been tested on Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running., Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. in clutter, or tasks requiring reasoning about physics. The Modern Robot LearningZero-shotDoing a new task without task-specific training. performance drops significantly when the test Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. is far from Modern Robot LearningPretrainingTraining a model on a broad dataset before adapting it to a specific task. data (WidowX achieves 50%, but Manipulation & TasksInsertionPlacing one object into another, like plugging in a connector. tasks drop to lower rates without finetuning). Finetuning requires at least ~100 demonstrations—if you have fewer, the gains diminish. The model's size (93M parameters) still requires reasonable compute for Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.; robots with embedded systems may struggle. There are also subtle failure modes: tasks with long horizons (>30 steps) or precise contact-rich Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. (threading a needle) remain challenging. The Simulation & Sim-to-RealEvaluationMeasuring how well a robot system performs. is limited to single-arm and basic dual-arm systems; complex Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. with coordinated multi-arm systems isn't addressed. Finally, like all learning-based approaches, Octo can fail in Data, Distributions & Training IssuesDistribution shiftWhen the deployment data differs from the training data.—a new Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. material or significantly different lighting can cause performance degradation.
WHAT COMES NEXT
The Core ConceptsTrajectoryA sequence of states or actions over time. is clear: expect larger pretrained models trained on orders of magnitude more data as more robotics labs contribute to datasets like Open X-Embodiment. The next version will likely incorporate video understanding (learning from YouTube Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. videos), multi-task language understanding (understanding complex compositional instructions), and Robot LearningOnline learningTraining while continuing to collect new live data. (adapting policies in real-time as the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. encounters new scenarios). There's also research into distillation—could you compress a 93M parameter Octo into a 5M parameter model that runs on robots' onboard compute? Finally, the field needs better understanding of when and why finetuning works: a theory explaining the Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. gap would guide architecture design. Longer term, expect a shift toward foundation models for robotics similar to how BERT transformed NLP—one massive pretrained model fine-tuned for everything from logistics to surgery, trained on multi-year-spanning datasets of millions of Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hours.