IMITATION-LEARNINGCURRENT2024-05-20

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, Sergey Levine

ARCHITECTURE

transformer-based policy

ROBOT

9 robotic platforms (multi-platform: includes Franka Panda, UR5, and others from Open X-Embodiment dataset)

DATASET

800k trajectories from Open X-Embodiment dataset

TASK

manipulation

Octo is a breakthrough in that does something previously thought impossible: train a single on 800,000 trajectories from 9 different robots and configurations, then finetune it to completely new robots in just a few hours on a consumer GPU. This is the "ImageNet moment" for robotics. Instead of policies from scratch for each new (which took weeks), developers can now start with Octo's pretrained weights and adapt to their specific in an afternoon. The model is a 27M to 93M parameter transformer-based that understands both language commands ("pick up the pen") and images, and it's open-source. On real experiments across 9 different platforms—from UR5 industrial arms to WidowX grippers to dual-arm systems—Octo either matches or beats specialized policies trained from scratch, while requiring 52% less data on average when finetuned.

ARCHITECTURE

THE PROBLEM

Before Octo, robotics researchers faced a painful scaling problem. Each new platform required collecting thousands of demonstrations, a from scratch for weeks, and hoping it would generalize. Existing 'generalist' models like RT-1 and RT-2 could handle multiple robots, but they were either closed-source, required massive computational resources (RT-2-X has 55 billion parameters), or couldn't adapt to new configurations—if your had force-torque sensors or a different , you were out of luck. The Open X-Embodiment unlocked something powerful: 800k trajectories across diverse platforms—but nobody had figured out how to actually build a model that could leverage that diversity while remaining practical for everyday robotics labs. The gap wasn't just technical; it was about whether a single pretrained model could actually beat task-specific , especially when adapting to completely unseen configurations.

HOW IT WORKS

Multi-Modal Transformer-Diffusion Architecture

Octo uses a transformer backbone (the same architecture that powers GPT) to process sequences of observations—camera images, proprioceptive data like angles, and any other sensors your has. Instead of predicting actions directly, it uses diffusion (a technique from generative AI that iteratively refines noisy predictions into clean ones) to output distributions. This is crucial because it captures uncertainty: when multiple valid sequences exist, the diffusion model learns the whole landscape rather than picking one. The architecture accepts variable-length histories and supports both language and goal-image conditioning, meaning you can say "place the mug on the shelf" or show the model a photo of the target . This flexibility is why Octo can handle force-torque inputs the original data never saw.

Pretraining on 800k Trajectories Across 25 Datasets

Octo trains on the Open X-Embodiment —a diverse collection spanning different arms (Franka, UR5, WidowX), camera configurations, and tasks from to precise . The key insight is that this diversity forces the model to learn abstract skills that transfer. If you only trained on UR5 data, the model would overfit to UR5's specific . By seeing multiple robots solve similar tasks in different ways, Octo learns what actually means at a deeper level. The uses the same objective across all 25 datasets despite heterogeneity in labels (some have language, some don't), sensors, and types—a significant engineering challenge that required flexible input/output handling.

Efficient Finetuning Protocol with Adapter Layers

Here's where practical impact happens. When adapting Octo to a new —say one with a different or instead of —you don't retrain the entire 93M parameter model. Instead, Octo uses parameter-efficient finetuning: you freeze most of the pretrained weights and add small adapter layers that learn robot-specific mappings. This runs on consumer GPUs (an RTX 3090 or similar) with just 100 target demonstrations (~30 minutes of data). The same finetuning recipe works across all tasks—no hyperparameter tuning per . This is radically different from previous approaches that required careful tuning for each new domain.

Flexible Sensor and Action Space Adaptation

Octo handles a critical real-world problem: robots aren't standardized. The Berkeley Peg Insert required adding force-torque sensing (new observations the never saw), and Berkeley Pick-Up used position instead of (new space). Traditional models would need retraining. Octo's architecture handles this through input/output adapters—lightweight neural networks that translate between any configuration and the model's internal representation. On Peg Insert (a notoriously difficult precision ), Octo achieved 70% success with force , versus 5% for the next-best . This adaptability is why Octo works across 9 different real platforms rather than just a curated subset.

KEY RESULTS

Zero-shot performance on WidowX tasks (language-conditioned)50% success rate

vs. RT-1-X at 20% and RT-2-X at 50%

Out-of-the-box on tasks from the distribution, Octo matches the 55B parameter RT-2-X while being only 27M parameters (200x smaller). image conditioning pushes this to 75% average success—25% higher than language alone. This proves the worked: the model genuinely learned reusable skills.

Finetuning performance across 6 new robot setups72% average success rate

vs. training from scratch at 20% and VC-1 pretraining at 15%

With ~100 target demonstrations and a few hours on consumer GPUs, Octo outperforms the next-best by 52%. Critically, this uses the same finetuning recipe for all 6 setups (CMU Baking, Stanford Coffee, Berkeley Peg Insert, Berkeley Pick-Up, Berkeley Bimanual, Berkeley Coke), including tasks with new sensors and spaces. This consistency is rare in robotics—usually you retune hyperparameters for every new .

Finetuning time on consumer hardwareFew hours on standard consumer GPUs

vs. weeks for training specialized policies from scratch

A developer can now adapt Octo to their specific setup in an afternoon without waiting for expensive cloud compute. This democratizes robotics research—labs with limited budgets can now participate in robotics development. For comparison, RT-2 from scratch requires thousands of TPU hours.

Parameter efficiencyOcto-Base: 93M parameters, Octo-Small: 27M parameters

vs. RT-2-X at 55 billion parameters

Octo achieves comparable performance at 1000x smaller model scale (93M vs 55B), making it actually deployable on edge hardware and much cheaper to finetune. The two-model lineup lets developers choose between accuracy (Base) and simplicity (Small).

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, Octo changes your development workflow fundamentally. Before, you collected data, spent weeks a , and hoped it worked on your specific hardware. Now you start with pretrained weights and finetune in hours. This unlocks a new class of robotics applications: startups can build products without massive ML teams, research labs can focus on novel design rather than infrastructure, and roboticists can iterate rapidly on new hardware without restarting the entire pipeline. The architecture teaches important lessons about : the key to building policies that work across robots isn't just data diversity—it's having flexible input/output layers that can adapt to any configuration. This pattern will likely become standard. As a developer, you should understand why diffusion models matter here: they capture uncertainty better than deterministic policies, which is critical when multiple valid behaviors exist (reaching for a cup can succeed with slightly different trajectories). You should also recognize that this is the beginning of a paradigm shift. Just as transformers revolutionized NLP by enabling , pretrained policies will become the default starting point rather than a luxury. The open-source nature matters enormously—Octo and its weights are publicly available, so your robotics startup or research project isn't locked into proprietary platforms.

LIMITATIONS

Octo isn't magic. It works well on tabletop but hasn't been tested on , in clutter, or tasks requiring reasoning about physics. The performance drops significantly when the test is far from data (WidowX achieves 50%, but tasks drop to lower rates without finetuning). Finetuning requires at least ~100 demonstrations—if you have fewer, the gains diminish. The model's size (93M parameters) still requires reasonable compute for ; robots with embedded systems may struggle. There are also subtle failure modes: tasks with long horizons (>30 steps) or precise contact-rich (threading a needle) remain challenging. The is limited to single-arm and basic dual-arm systems; complex with coordinated multi-arm systems isn't addressed. Finally, like all learning-based approaches, Octo can fail in —a new material or significantly different lighting can cause performance degradation.

WHAT COMES NEXT

The is clear: expect larger pretrained models trained on orders of magnitude more data as more robotics labs contribute to datasets like Open X-Embodiment. The next version will likely incorporate video understanding (learning from YouTube videos), multi-task language understanding (understanding complex compositional instructions), and (adapting policies in real-time as the encounters new scenarios). There's also research into distillation—could you compress a 93M parameter Octo into a 5M parameter model that runs on robots' onboard compute? Finally, the field needs better understanding of when and why finetuning works: a theory explaining the gap would guide architecture design. Longer term, expect a shift toward foundation models for robotics similar to how BERT transformed NLP—one massive pretrained model fine-tuned for everything from logistics to surgery, trained on multi-year-spanning datasets of millions of hours.

Read on arxiv →HTML source →Project page →

Octo: An Open-Source Generalist Robot Policy

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Multi-Modal Transformer-Diffusion Architecture

Pretraining on 800k Trajectories Across 25 Datasets

Efficient Finetuning Protocol with Adapter Layers

Flexible Sensor and Action Space Adaptation

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics