IMITATION-LEARNINGFOUNDATIONAL2024-01-04

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Zipeng Fu, Tony Z. Zhao, Chelsea Finn

ARCHITECTURE

behavior cloning

ROBOT

Mobile ALOHA (ALOHA with mobile base)

DATASET

50 demonstrations per task

KEY METRIC

90%

TASK

mobile manipulation, bimanual manipulation, cooking, navigation

Mobile ALOHA represents a major leap forward in practical robotics: a that can move around your kitchen, use both arms in coordination, and learn complex multi-step tasks like sautéing shrimp, opening cabinets, and calling elevators—all from just 50 human demonstrations per . The breakthrough here isn't just the hardware (though that's impressively affordable), it's the insight that you can teach a mobile to do real-world by combining demonstrations on a mobile platform with data from existing static-arm datasets. With this approach, the system achieves 90% success rates on genuinely difficult tasks. This matters because almost every useful job requires both mobility and dexterity—and until now, combining those two capabilities has been prohibitively complex.

ARCHITECTURE

THE PROBLEM

Before Mobile ALOHA, robotics had a major blind spot: success. Previous work like the original ALOHA system could teach stationary robots to do impressive (think: folding laundry with both arms while staying in one place). But the real world doesn't work that way. Real tasks require a to navigate to a location, then manipulate something, then move again. The gap was huge. Existing systems either (a) used expensive, hard-to-control whole-body interfaces that required experts, or (b) trained on simplified tasks that didn't require coordinated mobile + bimanual . You couldn't easily collect high-quality demonstrations for complex because the systems were clunky. And when researchers tried on , was terrible—you'd need hundreds of demos, not dozens.

HOW IT WORKS

Design a low-cost, intuitive teleoperation interface

The team built Mobile ALOHA by augmenting the existing ALOHA two-arm system with a mobile base and, critically, a whole-body interface that maps human body movements directly to the . Think of it like a motion-capture suit but simpler: the operator wears controllers that track their head, torso, and arms in real-time. This is genius because it lowers the cognitive load on the human demonstrator—you're not thinking "move 3 by 5 degrees," you're just moving naturally and the interface does the . This matters tremendously: if data collection is hard and requires expert operators, you won't get 50 clean demonstrations. You'll get 5 bad ones. By making intuitive, they made data collection fast and high-quality.

mobile aloha

Collect mobile manipulation demonstrations

Using the interface, humans collected 50 demonstrations per for complex, real-world activities: sautéing shrimp (involves to kitchen, arm coordination, timing), opening a two-door wall cabinet (spatial reasoning + bimanual coordination), calling and entering an elevator ( + ), and washing a pan under a faucet (fine-grained bimanual near water and sharp objects). Each demo is recorded as a sequence of full-body states: mobile base position/orientation, angles for both arms, states. The key insight is that 50 demos is achievable for humans in a reasonable timeframe—not hundreds.

teleop restroom 10x speed

Train with behavior cloning and co-training

The team used straightforward : train a neural network to predict the next (arm movements, base movements, commands) given the current . But here's where it gets clever—they didn't train only on the 50 demos. They co-trained on existing static ALOHA datasets ( with no mobile base). This seems counterintuitive: why would stationary-arm data help a mobile ? The answer is that a lot of the hard problem—coordinating two arms in space, understanding object geometry—is identical whether the moves or not. The mobile base is almost a separate channel. By on both datasets simultaneously, the network learns the skills from the larger static and the mobility skills from the smaller mobile . This is why 50 demos works—you're not learning everything from scratch.

Deploy and evaluate on real kitchen tasks

After , the runs autonomously on the tasks it learned. No , no explicit optimization—just : given the current , output the . The team tested on diverse, hard tasks: sautéing requires timing and force ; opening cabinets requires precise door ; elevators require and button pressing. They measured (did the complete the full ?) and safety (did it break anything or hurt anyone?). The result: up to 90% success on some tasks. This is the gold standard in robotics , not because 90% is magical, but because 90% on real-world kitchen tasks with only 50 demos is unprecedented.

MORE DEMONSTRATIONS

cook shrimp

wipe wine

take elevator

use cabinets

wash pan

push chairs

high five third person

high five moving cam

wipe wine 9 trials 8x speed

take elevator 5 trials 8x speed

use cabinets 3 pots 8x speed

use cabinets distractors

push chairs morning 7 trials 8x speed

push chairs night 6 trials 8x speed

FIGURES

KEY RESULTS

Success rate on complex mobile manipulation tasksup to 90%

vs. ~10-20% for behavior cloning without co-training on static data

This is the headline result. Tasks like sautéing shrimp or opening a two-door cabinet require 10-30 sequential actions with both arms and mobile base coordination. Achieving 90% on real, untrained test instances is state-of-the-art for at this complexity.

Demonstrations required per task50

vs. 100-500 for previous mobile manipulation learning approaches

is critical in robotics because collecting real-world demonstrations is slow and expensive. Cutting requirements by 80-90% through co-training is a massive practical win. It means a company could train a new in a workday instead of a week.

Number of distinct mobile manipulation tasks learned6 (sautéing, cabinet opening, elevator, pan washing, and others)

vs. 1-2 for previous work

Breadth matters. The team didn't just get one working—they demonstrated the system generalizes across multiple types: , bimanual coordination, fine-motor , and reasoning about object geometry. This suggests the approach isn't a one-off trick.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, Mobile ALOHA teaches you three critical lessons. First: intuitive data collection beats powerful algorithms. The team didn't invent a new fancy learning algorithm—they used basic that's been around for years. What they did differently was make it trivially easy to collect good data. Developers should obsess over reducing in data collection, because 50 clean demos beat 500 noisy ones. Second: is real and underutilized. The insight that static-arm datasets help mobile robots isn't obvious, but it works. When you're building a system, hunt for related tasks or morphologies you can borrow data from—it can cut your data requirements by half. Third, this proves that complex real-world is now tractable with . For 5-10 years, the robotics field was split between 'tabletop ' (solved-ish with deep learning) and ' in real homes' (impossibly hard). Mobile ALOHA closes that gap. As a developer, this means you can now build practical home robots or kitchen automation using end-to-end learning, not hand-engineered pipelines.

LIMITATIONS

Mobile ALOHA doesn't solve . If the learns to sauté shrimp in one kitchen, it might fail in another kitchen with a different stove layout or lighting. The system is also brittle to —a different pan shape, or shrimp orientation, can cause failure. Additionally, 90% success means 1 in 10 times the fails, which is unacceptable for safety-critical tasks or tasks with low error tolerance (like holding a hot pan). The approach also requires accurate (knowing where the is, where the arms are) and doesn't handle cases where you need strategic long-horizon reasoning or error recovery—if the spills wine, it can't adaptively replan. Finally, the 50-demo requirement still assumes access to a full suit and infrastructure, which is expensive to set up.

WHAT COMES NEXT

The next generation of Mobile ALOHA will likely tackle three frontiers: (1) across environments— a single that works in any kitchen, not just the one where it was trained; (2) error recovery—teaching robots not just how to do a right, but what to do when things go wrong; (3) semantic understanding—learning from language descriptions of tasks, not just raw observations, so a human can say 'cook the shrimp like you did yesterday but hotter' and the adapts. You'll also see the data collection interface become even cheaper and more accessible, eventually dropping to consumer VR hardware instead of custom rigs. And crucially, we'll see these systems deployed in real homes and restaurants, not just research labs, which will create a loop of real data that makes them better.

Read on arxiv →HTML source →Project page →

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Design a low-cost, intuitive teleoperation interface

Collect mobile manipulation demonstrations

Train with behavior cloning and co-training

Deploy and evaluate on real kitchen tasks

MORE DEMONSTRATIONS

FIGURES

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Octo: An Open-Source Generalist Robot Policy

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics