CONTROLCURRENT2025-08-11

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

Qiayuan Liao, Takara E. Truong, Xiaoyu Huang, Yuman Gao, Guy Tevet, Koushil Sreenath, C. Karen Liu

ARCHITECTURE

diffusion policy, latent diffusion model with classifier guidance

ROBOT

humanoid

TASK

locomotion, manipulation, motor control

BeyondMimic demonstrates that humanoid robots can learn to perform genuinely impressive athletic feats—aerial cartwheels, spin-kicks, flip-kicks, sprinting—directly from human motion capture data, and then compose these skills to solve completely new tasks like and joystick . The breakthrough is twofold: first, a motion tracking framework that reliably converts human motion data into real hardware commands without the usual unnatural jerking or motion-specific tuning; second, a diffusion model that learns these primitives and generalizes them to unseen tasks. This matters because previous humanoid systems either produced stiff, inhuman-looking motions or required hand-tuning for each new . BeyondMimic uses a single setup and shared hyperparameters across 14 completely different motion sequences, which is a massive engineering win. The team then deployed everything to real hardware without retraining—that transfer is the kind of thing that keeps roboticists up at night trying to achieve.

ARCHITECTURE

THE PROBLEM

Before BeyondMimic, humanoid faced two stuck problems. First, motion tracking—the process of making a physically follow human motion capture data—was either brittle or required constant hand-tweaking per motion type. Systems like physics-informed neural networks and approaches would either produce jerky, unnatural movements or need task-specific shaping for each new behavior. Second, motion imitation was siloed: you'd train one for walking, another for jumping, another for . Composing these into new behaviors or adapting them to novel goals at test time was nearly impossible. Prior work (like mocap-guided or motion-specific controllers) could handle individual skills but couldn't scale across diverse, high-frequency dynamic motions, and certainly couldn't generalize to tasks outside their . The industry standard was: one motion type = one run = one hyperparameter set. BeyondMimic breaks this.

HOW IT WORKS

Compact Motion-Tracking MDP Formulation

The authors designed a motion tracking setup that removes motion-specific tuning by treating kinematic tracking as a standard problem, but with an elegant twist: they use a single, shared across all 14 different motions (cartwheels, sprints, kicks, etc.). The key innovation is a compact formulation that tracks reference angles while letting the discover balance, timing, and ground implicitly. This is elegant because it sidesteps the usual approach of hand-crafting cost functions for each type. The same MDP setup and hyperparameters work for both a slow cartwheel and a fast sprint—proof that the formulation captures something fundamental about how humanoids move. The motion tracking pipeline outputs trajectories that are both dynamically feasible (the can actually execute them) and human-like (they look natural, not robotic).

prep IMG 2643

prep IMG 2669

prep IMG 2661

prep IMG 4081

Distillation into a Unified Latent Diffusion Model

Once the team had mastered tracking 14 diverse motions, they faced the composition problem: how do you blend these into a single that can switch between them or combine them on the fly? They distilled all these motion-tracking policies into a single latent diffusion model—essentially a generative model that learns the underlying structure of human motion. Diffusion models are powerful here because they naturally capture uncertainty and multimodality: there are many valid ways to reach a waypoint or avoid an obstacle. By encoding policies in a learned , the model becomes amenable to test-time guidance. This is where the real versatility unlocks: you can now specify goals at test time using simple cost functions (like 'reach this waypoint' or 'avoid this obstacle'), and the diffusion model generates a that solves the while staying faithful to learned human motion priors. You're not retraining; you're reusing.

prep C1921

prep C1909

prep C1876

prep C1915

Classifier Guidance for Zero-Shot Task Generalization

The final trick is classifier guidance—a diffusion-specific technique for steering a generative model toward specific objectives at test time. The authors train a lightweight classifier or cost function that scores how well a candidate motion achieves a (e.g., reaching a waypoint, avoiding an obstacle, responding to joystick commands). During , they nudge the diffusion sampling process toward high-scoring trajectories. This is powerful because it lets the model solve tasks it never saw during without any . Joystick ? Never trained on that explicitly. with specific geometries? Works . Motion inpainting (fill in missing frames)? Works too. The classifier guidance acts as a bridge between the motion priors learned from human data and the new at hand. Critically, this all transfers to real hardware without retraining—the real runs the same diffusion model and classifier guidance that worked in .

teaser

MORE DEMONSTRATIONS

prep IMG 2747

prep IMG 2736

prep IMG 2704

prep IMG 2706

prep IMG 2715

prep IMG 2718

prep IMG 2750

prep IMG 2711

prep IMG 2707

prep IMG 2663

prep IMG 2664

prep IMG 2665

prep IMG 2667

prep IMG 2671

prep IMG 2676

prep IMG 2678

prep IMG 2688

prep IMG 2689

prep IMG 2720

prep IMG 2652

prep IMG 2725

prep IMG 2745

prep IMG 2686

prep C1890

prep C1918

prep C1895

prep C1893

prep C1867

prep C1881

prep C1883

prep C1907

prep C1926

prep C1934

prep C1868

prep C1869

prep C1870

FIGURES

KEY RESULTS

Motion Tracking Success Across Diverse Skills14 distinct motion sequences (cartwheels, spin-kicks, flip-kicks, sprinting) tracked with state-of-the-art quality using 1 unified MDP setup

vs. prior work requiring motion-specific tuning for each behavior

This is the paper's foundational win. Previous systems would need 14 different functions, 14 different hyperparameter searches, and 14 different runs. BeyondMimic does it with one. That's not just convenient—it means the formulation is capturing something robust and generalizable about humanoid . The fact that the same setup handles both acrobatic maneuvers (cartwheels) and (sprinting) is striking.

Zero-Shot Transfer to Unseen Tasks on Real HardwareWaypoint navigation, joystick teleoperation, and obstacle avoidance all work without retraining

vs. prior humanoid control requiring task-specific fine-tuning or sim-to-real adaptation

This is the paper's biggest claim and hardest to achieve. The diffusion model trained on motion primitives can be steered toward completely new objectives ( geometry it never saw, arbitrary waypoints) using only test-time cost functions, and then deployed directly on real hardware. No retraining, no re-tuning. This gap closure is exceptionally rare in robotics and suggests the learned representations are robust and task-agnostic.

Composability of Motion PrimitivesDynamic task switching and motion composition (e.g., navigate, then kick, then return) using single diffusion model

vs. prior systems with siloed, non-composable policies

Previous humanoid systems trained separate policies for each and struggled to compose them. BeyondMimic's diffusion formulation allows seamless switching and blending. This unlocks richer behaviors—not just 'do a cartwheel' but 'navigate around obstacles while maintaining motion quality.' It's a step toward the kind of hierarchical, flexible humans naturally perform.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a developer building humanoid robotics software, BeyondMimic presents a paradigm shift. First, it shows that motion capture data is a genuinely scalable supervision signal if you have the right abstraction—you don't need hand-crafted rewards per , just a compact tracking MDP. If you're trying to build a humanoid that can do multiple behaviors, this framework says: collect mocap, track it robustly, distill into a diffusion model, and you get composability almost for free. Second, the via classifier guidance is a practical tool: instead of separate policies for joystick , waypoint following, and , you train once on motion primitives, then guide at test time. This is much more sample-efficient and maintainable. Third, the transfer without retraining is rare enough that it's worth studying. The authors likely succeeded because their learned representations (motion primitives) are fundamentally robust—they capture human motion structure rather than task-specific artifacts. If you're deploying to real hardware, this teaches you to invest in learning good latent representations early rather than patching gaps later. The takeaway: use human demonstrations as a prior, distill to a , and guide at test time. This is a more scalable path to versatile humanoid than engineering or single-skill .

LIMITATIONS

BeyondMimic's main limitations are practical rather than conceptual. The motion tracking pipeline still relies on external mocap or to function well—the real demos used mocap for waypoint and obstacle location, which isn't available in many real-world scenarios. The diffusion model, while versatile, still operates on a motion foundation: it can't invent entirely new classes of movement not present in the data, and performance likely degrades gracefully as you request behaviors farther from the learned manifold. The paper doesn't deeply analyze failure modes—when does classifier guidance fail? How sensitive is performance to cost function design? Also, the approach assumes access to high-quality human mocap data for the relevant space, which limits its applicability to domains where good mocap isn't available (underwater robots, aerial drones, etc.). Finally, computational cost of diffusion sampling at 10+ Hz for isn't discussed, which matters for .

WHAT COMES NEXT

The natural next steps are (1) removing the mocap dependency by developing better proprioceptive and visual so the can operate fully autonomously in real environments, (2) extending the framework to full-body by including arm and hand tracking in the mocap imitation pipeline, (3) scaling to longer-horizon reasoning (the current work is good for 1-5 second behaviors; what about multi-minute tasks with ?), and (4) exploring online adaptation—can the diffusion model update its motion priors from real-world experience without retraining from scratch? There's also an interesting direction in hierarchical diffusion: use one diffusion model to select high-level motion primitives and another to refine them for specific tasks. Finally, if worked this well, there's probably a paper in understanding *why*—what properties of the learned representations made them transfer? That understanding could unlock even more robust frameworks for other morphologies.

Read on arxiv →HTML source →Project page →

BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Compact Motion-Tracking MDP Formulation

Distillation into a Unified Latent Diffusion Model

Classifier Guidance for Zero-Shot Task Generalization

MORE DEMONSTRATIONS

FIGURES

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy