mobile manipulation, bimanual manipulation, cooking, navigation
Mobile ALOHA represents a major leap forward in practical robotics: a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. that can move around your kitchen, use both arms in coordination, and learn complex multi-step tasks like sautéing shrimp, opening cabinets, and calling elevators—all from just 50 human demonstrations per Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening.. The breakthrough here isn't just the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hardware (though that's impressively affordable), it's the insight that you can teach a mobile Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to do real-world Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. by combining demonstrations on a mobile platform with data from existing static-arm datasets. With this approach, the system achieves 90% success rates on genuinely difficult tasks. This matters because almost every useful Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. job requires both mobility and dexterity—and until now, combining those two capabilities has been prohibitively complex.
ARCHITECTURE
THE PROBLEM
Before Mobile ALOHA, robotics Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. had a major blind spot: success. Previous work like the original ALOHA system could teach stationary robots to do impressive Manipulation & TasksBimanual manipulationUsing two arms or hands together. (think: folding laundry with both arms while staying in one place). But the real world doesn't work that way. Real tasks require a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. to navigate to a location, then manipulate something, then move again. The gap was huge. Existing Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects. systems either (a) used expensive, hard-to-control whole-body Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. interfaces that required experts, or (b) trained on simplified tasks that didn't require coordinated mobile + bimanual Control & PlanningControlThe method used to make the robot move the way you want.. You couldn't easily collect high-quality demonstrations for complex Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects. because the Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. systems were clunky. And when researchers tried Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. on Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects., Robot LearningData efficiencyHow much useful performance a method gets from limited data. was terrible—you'd need hundreds of demos, not dozens.
HOW IT WORKS
1
Design a low-cost, intuitive teleoperation interface
The team built Mobile ALOHA by augmenting the existing ALOHA two-arm system with a mobile base and, critically, a whole-body Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. interface that maps human body movements directly to the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.. Think of it like a motion-capture suit but simpler: the operator wears controllers that track their head, torso, and arms in real-time. This is genius because it lowers the cognitive load on the human demonstrator—you're not thinking "move Movement, Mechanics & Robot BodyJointA movable connection between robot parts. 3 by 5 degrees," you're just moving naturally and the interface does the Navigation & LocomotionMappingBuilding a representation of the environment.. This matters tremendously: if data collection is hard and requires expert operators, you won't get 50 clean demonstrations. You'll get 5 bad ones. By making Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. intuitive, they made data collection fast and high-quality.
mobile aloha
2
Collect mobile manipulation demonstrations
Using the Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. interface, humans collected 50 demonstrations per Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. for complex, real-world activities: sautéing shrimp (involves Navigation & LocomotionNavigationMoving through an environment toward a goal. to kitchen, arm coordination, timing), opening a two-door wall cabinet (spatial reasoning + bimanual coordination), calling and entering an elevator (Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. + Navigation & LocomotionNavigationMoving through an environment toward a goal.), and washing a pan under a faucet (fine-grained bimanual Control & PlanningControlThe method used to make the robot move the way you want. near water and sharp objects). Each demo is recorded as a sequence of full-body states: mobile base position/orientation, Movement, Mechanics & Robot BodyJointA movable connection between robot parts. angles for both arms, Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. states. The key insight is that 50 demos is achievable for humans in a reasonable timeframe—not hundreds.
teleop restroom 10x speed
3
Train with behavior cloning and co-training
The team used straightforward Robot LearningSupervised learningLearning from labeled input-output examples.: train a neural network to predict the next Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. (arm movements, base movements, Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. commands) given the current Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.. But here's where it gets clever—they didn't train only on the 50 Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects. demos. They co-trained on existing static ALOHA datasets (Manipulation & TasksBimanual manipulationUsing two arms or hands together. with no mobile base). This seems counterintuitive: why would stationary-arm data help a mobile Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.? The answer is that a lot of the hard problem—coordinating two arms in space, understanding object geometry—is identical whether the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. moves or not. The mobile base is almost a separate Control & PlanningControlThe method used to make the robot move the way you want. channel. By Robot LearningTrainingThe process of fitting a model using data or experience. on both datasets simultaneously, the network learns the Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. skills from the larger static Robot LearningDatasetA collection of training or evaluation data. and the mobility skills from the smaller mobile Robot LearningDatasetA collection of training or evaluation data.. This is why 50 demos works—you're not learning everything from scratch.
4
Deploy and evaluate on real kitchen tasks
After Robot LearningTrainingThe process of fitting a model using data or experience., the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. runs autonomously on the tasks it learned. No Control & PlanningPlanningFiguring out what the robot should do before or during movement., no explicit optimization—just Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions.: given the current Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings., output the Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.. The team tested on diverse, hard tasks: sautéing requires timing and force Control & PlanningControlThe method used to make the robot move the way you want.; opening cabinets requires precise door Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.; elevators require Navigation & LocomotionNavigationMoving through an environment toward a goal. and button pressing. They measured Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. (did the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. complete the full Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening.?) and safety (did it break anything or hurt anyone?). The result: up to 90% success on some tasks. This is the gold standard in robotics Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task., not because 90% is magical, but because 90% on real-world kitchen tasks with only 50 demos is unprecedented.
MORE DEMONSTRATIONS
cook shrimp
wipe wine
take elevator
use cabinets
wash pan
push chairs
high five third person
high five moving cam
wipe wine 9 trials 8x speed
take elevator 5 trials 8x speed
use cabinets 3 pots 8x speed
use cabinets distractors
push chairs morning 7 trials 8x speed
push chairs night 6 trials 8x speed
FIGURES
KEY RESULTS
Success rate on complex mobile manipulation tasksup to 90%
vs. ~10-20% for behavior cloning without co-training on static data
This is the headline result. Tasks like sautéing shrimp or opening a two-door cabinet require 10-30 sequential actions with both arms and mobile base coordination. Achieving 90% on real, untrained test instances is state-of-the-art for Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. at this Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. complexity.
Demonstrations required per task50
vs. 100-500 for previous mobile manipulation learning approaches
Robot LearningData efficiencyHow much useful performance a method gets from limited data. is critical in robotics because collecting real-world demonstrations is slow and expensive. Cutting requirements by 80-90% through co-training is a massive practical win. It means a company could train a new Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. in a workday instead of a week.
Number of distinct mobile manipulation tasks learned6 (sautéing, cabinet opening, elevator, pan washing, and others)
vs. 1-2 for previous work
Breadth matters. The team didn't just get one Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. working—they demonstrated the system generalizes across multiple Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. types: Navigation & LocomotionNavigationMoving through an environment toward a goal., bimanual coordination, fine-motor Control & PlanningControlThe method used to make the robot move the way you want., and reasoning about object geometry. This suggests the approach isn't a one-off trick.
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
If you're building robotics software, Mobile ALOHA teaches you three critical lessons. First: intuitive data collection beats powerful algorithms. The team didn't invent a new fancy learning algorithm—they used basic Imitation & Reinforcement LearningBehavior Cloning (BC)A simple type of imitation learning where the robot directly copies expert actions. that's been around for years. What they did differently was make it trivially easy to collect good data. Developers should obsess over reducing Movement, Mechanics & Robot BodyFrictionResistance between contacting surfaces that affects sliding and grasping. in data collection, because 50 clean demos beat 500 noisy ones. Second: Modern Robot LearningTransfer learningUsing knowledge from one task, domain, or robot to help with another. is real and underutilized. The insight that static-arm datasets help mobile robots isn't obvious, but it works. When you're building a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. system, hunt for related tasks or Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies you can borrow data from—it can cut your data requirements by half. Third, this proves that complex real-world Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. is now tractable with Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task.. For 5-10 years, the robotics field was split between 'tabletop Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.' (solved-ish with deep learning) and 'Manipulation & TasksMobile manipulationA robot both moves around and manipulates objects. in real homes' (impossibly hard). Mobile ALOHA closes that gap. As a developer, this means you can now build practical home robots or kitchen automation using end-to-end learning, not hand-engineered pipelines.
LIMITATIONS
Mobile ALOHA doesn't solve Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.. If the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. learns to sauté shrimp in one kitchen, it might fail in another kitchen with a different stove layout or lighting. The system is also brittle to Data, Distributions & Training IssuesDistribution shiftWhen the deployment data differs from the training data.—a different pan shape, or shrimp orientation, can cause failure. Additionally, 90% success means 1 in 10 times the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. fails, which is unacceptable for safety-critical tasks or tasks with low error tolerance (like holding a hot pan). The approach also requires accurate Perception & SensingState estimationCombining noisy sensor data to estimate the robot’s true state. (knowing where the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. is, where the arms are) and doesn't handle cases where you need strategic long-horizon reasoning or error recovery—if the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. spills wine, it can't adaptively replan. Finally, the 50-demo requirement still assumes access to a full Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. suit and Robot LearningTrainingThe process of fitting a model using data or experience. infrastructure, which is expensive to set up.
WHAT COMES NEXT
The next generation of Mobile ALOHA will likely tackle three frontiers: (1) Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. across environments—Robot LearningTrainingThe process of fitting a model using data or experience. a single Core ConceptsPolicyThe rule or model that maps observations or states to actions. that works in any kitchen, not just the one where it was trained; (2) error recovery—teaching robots not just how to do a Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. right, but what to do when things go wrong; (3) semantic understanding—learning from language descriptions of tasks, not just raw observations, so a human can say 'cook the shrimp like you did yesterday but hotter' and the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. adapts. You'll also see the data collection interface become even cheaper and more accessible, eventually dropping to consumer VR hardware instead of custom rigs. And crucially, we'll see these systems deployed in real homes and restaurants, not just research labs, which will create a Control & PlanningFeedbackInformation returned from sensors during action to help correct behavior. loop of real data that makes them better.