CONTROLCURRENT2025-08-11

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, C. Karen Liu

ARCHITECTURE
diffusion policy, latent diffusion model with classifier guidance
ROBOT
humanoid
TASK
locomotion, manipulation, motor control

BeyondMimic demonstrates that humanoid robots can learn to perform genuinely impressive athletic feats—aerial cartwheels, spin-kicks, flip-kicks, sprinting—directly from human motion capture data, and then compose these skills to solve completely new tasks like Navigation & LocomotionObstacle avoidanceMoving while avoiding collisions with obstacles. and joystick Control & PlanningControlThe method used to make the robot move the way you want.. The breakthrough is twofold: first, a motion tracking framework that reliably converts human motion data into real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hardware commands without the usual unnatural jerking or motion-specific tuning; second, a diffusion model that learns these primitives and generalizes them Modern Robot LearningZero-shotDoing a new task without task-specific training. to unseen tasks. This matters because previous humanoid Control & PlanningControlThe method used to make the robot move the way you want. systems either produced stiff, inhuman-looking motions or required hand-tuning for each new Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer.. BeyondMimic uses a single setup and shared hyperparameters across 14 completely different motion sequences, which is a massive engineering win. The team then deployed everything to real hardware without retraining—that Modern Robot LearningZero-shotDoing a new task without task-specific training. Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer is the kind of thing that keeps roboticists up at night trying to achieve.

ARCHITECTURE

THE PROBLEM

Before BeyondMimic, humanoid Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningControlThe method used to make the robot move the way you want. faced two stuck problems. First, motion tracking—the process of making a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. physically follow human motion capture data—was either brittle or required constant hand-tweaking per motion type. Systems like physics-informed neural networks and Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. approaches would either produce jerky, unnatural movements or need task-specific Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. shaping for each new behavior. Second, motion imitation was siloed: you'd train one Core ConceptsPolicyThe rule or model that maps observations or states to actions. for walking, another for jumping, another for Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.. Composing these into new behaviors or adapting them to novel goals at test time was nearly impossible. Prior work (like mocap-guided Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. or motion-specific controllers) could handle individual skills but couldn't scale across diverse, high-frequency dynamic motions, and certainly couldn't generalize to tasks outside their Data, Distributions & Training IssuesTraining distributionThe kinds of examples the model saw during training.. The industry standard was: one motion type = one Robot LearningTrainingThe process of fitting a model using data or experience. run = one hyperparameter set. BeyondMimic breaks this.

HOW IT WORKS

1

Compact Motion-Tracking MDP Formulation

The authors designed a motion tracking setup that removes motion-specific tuning by treating kinematic tracking as a standard Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. problem, but with an elegant twist: they use a single, shared Imitation & Reinforcement LearningReward functionThe rule that defines how rewards are assigned. across all 14 different motions (cartwheels, sprints, kicks, etc.). The key innovation is a compact formulation that tracks reference Movement, Mechanics & Robot BodyJointA movable connection between robot parts. angles while letting the Core ConceptsPolicyThe rule or model that maps observations or states to actions. discover balance, timing, and ground Movement, Mechanics & Robot BodyContactPhysical interaction between the robot and an object or surface. implicitly. This is elegant because it sidesteps the usual approach of hand-crafting cost functions for each Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. type. The same MDP setup and hyperparameters work for both a slow cartwheel and a fast sprint—proof that the formulation captures something fundamental about how humanoids move. The motion tracking pipeline outputs trajectories that are both dynamically feasible (the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can actually execute them) and human-like (they look natural, not robotic).

prep IMG 2643
prep IMG 2669
prep IMG 2661
prep IMG 4081
2

Distillation into a Unified Latent Diffusion Model

Once the team had mastered tracking 14 diverse motions, they faced the composition problem: how do you blend these into a single Core ConceptsPolicyThe rule or model that maps observations or states to actions. that can switch between them or combine them on the fly? They distilled all these motion-tracking policies into a single latent diffusion model—essentially a generative model that learns the underlying structure of human motion. Diffusion models are powerful here because they naturally capture uncertainty and multimodality: there are many valid ways to reach a waypoint or avoid an obstacle. By encoding policies in a learned Robot LearningLatent spaceA compressed internal representation space inside a model., the model becomes amenable to test-time guidance. This is where the real versatility unlocks: you can now specify goals at test time using simple cost functions (like 'reach this waypoint' or 'avoid this obstacle'), and the diffusion model generates a Core ConceptsTrajectoryA sequence of states or actions over time. that solves the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. while staying faithful to learned human motion priors. You're not retraining; you're reusing.

prep C1921
prep C1909
prep C1876
prep C1915
3

Classifier Guidance for Zero-Shot Task Generalization

The final trick is classifier guidance—a diffusion-specific technique for steering a generative model toward specific objectives at test time. The authors train a lightweight classifier or cost function that scores how well a candidate motion achieves a Core ConceptsGoalThe desired outcome or target state for a robot task. (e.g., reaching a waypoint, avoiding an obstacle, responding to joystick commands). During Robot LearningInferenceUsing a trained model to make predictions or choose actions., they nudge the diffusion sampling process toward high-scoring trajectories. This is powerful because it lets the model solve tasks it never saw during Robot LearningTrainingThe process of fitting a model using data or experience. without any Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.. Joystick Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations.? Never trained on that explicitly. Navigation & LocomotionObstacle avoidanceMoving while avoiding collisions with obstacles. with specific geometries? Works Modern Robot LearningZero-shotDoing a new task without task-specific training.. Motion inpainting (fill in missing frames)? Works too. The classifier guidance acts as a bridge between the motion priors learned from human data and the new Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. at hand. Critically, this all transfers to real hardware without retraining—the real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. runs the same diffusion model and classifier guidance that worked in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested..

teaser

MORE DEMONSTRATIONS

prep IMG 2747
prep IMG 2736
prep IMG 2704
prep IMG 2706
prep IMG 2715
prep IMG 2718
prep IMG 2750
prep IMG 2711
prep IMG 2707
prep IMG 2663
prep IMG 2664
prep IMG 2665
prep IMG 2667
prep IMG 2671
prep IMG 2676
prep IMG 2678
prep IMG 2688
prep IMG 2689
prep IMG 2720
prep IMG 2652
prep IMG 2725
prep IMG 2745
prep IMG 2686
prep C1890
prep C1918
prep C1895
prep C1893
prep C1867
prep C1881
prep C1883
prep C1907
prep C1926
prep C1934
prep C1868
prep C1869
prep C1870

FIGURES

KEY RESULTS

Motion Tracking Success Across Diverse Skills14 distinct motion sequences (cartwheels, spin-kicks, flip-kicks, sprinting) tracked with state-of-the-art quality using 1 unified MDP setup

vs. prior work requiring motion-specific tuning for each behavior

This is the paper's foundational win. Previous systems would need 14 different Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. functions, 14 different hyperparameter searches, and 14 different Robot LearningTrainingThe process of fitting a model using data or experience. runs. BeyondMimic does it with one. That's not just convenient—it means the formulation is capturing something robust and generalizable about humanoid Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia.. The fact that the same setup handles both acrobatic maneuvers (cartwheels) and Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. (sprinting) is striking.

Zero-Shot Transfer to Unseen Tasks on Real HardwareWaypoint navigation, joystick teleoperation, and obstacle avoidance all work without retraining

vs. prior humanoid control requiring task-specific fine-tuning or sim-to-real adaptation

This is the paper's biggest claim and hardest to achieve. The diffusion model trained on motion primitives can be steered toward completely new objectives (Navigation & LocomotionObstacle avoidanceMoving while avoiding collisions with obstacles. geometry it never saw, arbitrary waypoints) using only test-time cost functions, and then deployed directly on real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. hardware. No retraining, no Data, Distributions & Training IssuesDomain randomizationChanging simulator visuals or physics during training so policies transfer better to reality. re-tuning. This Modern Robot LearningZero-shotDoing a new task without task-specific training. Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gap closure is exceptionally rare in robotics and suggests the learned representations are robust and task-agnostic.

Composability of Motion PrimitivesDynamic task switching and motion composition (e.g., navigate, then kick, then return) using single diffusion model

vs. prior systems with siloed, non-composable policies

Previous humanoid systems trained separate policies for each Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. and struggled to compose them. BeyondMimic's diffusion formulation allows seamless switching and blending. This unlocks richer behaviors—not just 'do a cartwheel' but 'navigate around obstacles while maintaining motion quality.' It's a step toward the kind of hierarchical, flexible Control & PlanningControlThe method used to make the robot move the way you want. humans naturally perform.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a developer building humanoid robotics software, BeyondMimic presents a paradigm shift. First, it shows that motion capture data is a genuinely scalable supervision signal if you have the right abstraction—you don't need hand-crafted rewards per Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer., just a compact tracking MDP. If you're trying to build a humanoid that can do multiple behaviors, this framework says: collect mocap, track it robustly, distill into a diffusion model, and you get composability almost for free. Second, the Modern Robot LearningZero-shotDoing a new task without task-specific training. Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. via classifier guidance is a practical tool: instead of Robot LearningTrainingThe process of fitting a model using data or experience. separate policies for joystick Control & PlanningControlThe method used to make the robot move the way you want., waypoint following, and Navigation & LocomotionObstacle avoidanceMoving while avoiding collisions with obstacles., you train once on motion primitives, then guide at test time. This is much more sample-efficient and maintainable. Third, the Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer without retraining is rare enough that it's worth studying. The authors likely succeeded because their learned representations (motion primitives) are fundamentally robust—they capture human motion structure rather than task-specific artifacts. If you're deploying to real hardware, this teaches you to invest in learning good latent representations early rather than patching Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gaps later. The takeaway: use human demonstrations as a prior, distill to a Robot LearningLatent spaceA compressed internal representation space inside a model., and guide at test time. This is a more scalable path to versatile humanoid Control & PlanningControlThe method used to make the robot move the way you want. than Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. engineering or single-skill Robot LearningTrainingThe process of fitting a model using data or experience..

LIMITATIONS

BeyondMimic's main limitations are practical rather than conceptual. The motion tracking pipeline still relies on external mocap or Perception & SensingState estimationCombining noisy sensor data to estimate the robot’s true state. to function well—the real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. demos used mocap for waypoint and obstacle location, which isn't available in many real-world scenarios. The diffusion model, while versatile, still operates on a motion Modern Robot LearningPrimitive / action primitiveA simple reusable low-level movement or control building block. foundation: it can't invent entirely new classes of movement not present in the Robot LearningTrainingThe process of fitting a model using data or experience. data, and performance likely degrades gracefully as you request behaviors farther from the learned manifold. The paper doesn't deeply analyze failure modes—when does classifier guidance fail? How sensitive is performance to cost function design? Also, the approach assumes access to high-quality human mocap data for the relevant Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. space, which limits its applicability to domains where good mocap isn't available (underwater robots, aerial drones, etc.). Finally, computational cost of diffusion sampling at 10+ Hz for Simulation & Sim-to-RealReal-time controlProducing actions fast enough for live robot control. isn't discussed, which matters for Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot..

WHAT COMES NEXT

The natural next steps are (1) removing the mocap dependency by developing better proprioceptive and visual Perception & SensingState estimationCombining noisy sensor data to estimate the robot’s true state. so the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can operate fully autonomously in real environments, (2) extending the framework to full-body Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. by including arm and hand tracking in the mocap imitation pipeline, (3) scaling to longer-horizon reasoning (the current work is good for 1-5 second behaviors; what about multi-minute tasks with Control & PlanningPlanningFiguring out what the robot should do before or during movement.?), and (4) exploring online adaptation—can the diffusion model update its motion priors from real-world experience without retraining from scratch? There's also an interesting direction in hierarchical diffusion: use one diffusion model to select high-level motion primitives and another to refine them for specific tasks. Finally, if Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. worked this well, there's probably a paper in understanding *why*—what properties of the learned representations made them transfer? That understanding could unlock even more robust frameworks for other Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies.

RELATED PAPERS