DEPTH-ESTIMATIONCURRENT2026-02-17

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Zhen Wu, Xiaoyu Huang, Lujie Yang, Yuanhang Zhang, Koushil Sreenath, Xi Chen, Pieter Abbeel, Rocky Duan, Angjoo Kanazawa, Carmelo Sferrazza, Guanya Shi, C. Karen Liu

ARCHITECTURE

motion matching, reinforcement learning, behavior cloning with DAgger

ROBOT

Unitree G1 humanoid robot

KEY METRIC

96%

TASK

locomotion, parkour, obstacle traversal

Imagine a humanoid that doesn't just walk—it parkours. The Unitree G1 running PHP climbs 1.25-meter walls (96% of its own height), vaults over obstacles, and chains together multiple acrobatic skills in real time using only an onboard depth camera. This is a big deal because previous humanoid robots struggled with even basic obstacle traversal. PHP solves this by combining three key insights: (1) capturing human parkour motion through motion matching—treating movement as a nearest-neighbor search in motion space, (2) using to make the actually execute these human-inspired trajectories, and (3) adding so the autonomously decides which to use based on what it sees. The result feels fundamentally different from prior work—this moves with the fluidity and adaptability of a human, not the rigid, pre-planned of a conventional .

ARCHITECTURE

THE PROBLEM

Before PHP, humanoid research achieved stable walking on varied terrains, but parkour—dynamic, adaptive, human-like movement—remained out of reach. Prior work fell into two camps: (1) end-to-end agents trained in that transfer poorly to real robots and struggle to compose multiple skills, and (2) hand-crafted motion controllers that work for specific tasks but lack expressiveness and don't adapt on the fly. The core limitation? Robots lacked both the motion expressiveness of humans AND the perceptual awareness to make real-time decisions about which to execute. A might nail climbing one obstacle, but couldn't decide whether to climb the next one or step over it based on depth input. Existing motion capture retargeting ignored the long-horizon composition problem—you could animate one motion, but chaining them smoothly while preserving human fluidity was unsolved.

HOW IT WORKS

Motion Matching: Compose Atomic Human Skills into Fluid Trajectories

The team started by capturing human parkour motion from video and mocap data, then retargeted it to the 's body proportions. Rather than a from scratch, they formulated motion composition as a nearest-neighbor search: given the 's current , find the closest matching human motion in a feature space (capturing pose, , and context). This is brilliant because it preserves the elegance of human movement—no lerping or blending destroys the motion's natural rhythm. They built atomic skills (climbing, vaulting, stepping, rolling) and stitched them together seamlessly. The key insight: humans don't plan parkour as a sequence of angles; they transition fluidly between skills. Motion matching captures that fluidity by always finding the next frame that best matches the 's current , creating long-horizon trajectories that feel natural, not robotic.

RL Expert Policies: Train Robots to Actually Track Human Motions

Motion matching gives you the target , but the must execute it with real physics, imperfect actuators, and ground forces. The team trained separate expert policies for each atomic , each one learning to track the kinematic from motion matching while staying robust to perturbations. These experts were powerful but skill-specific—each one mastered climbing, or vaulting, or rolling. The magic: they work in space, not raw torques, which makes the learning problem tractable. The learns 'how hard do I push my legs to follow this climbing despite terrain variation?' This is computationally expensive to train (separate expert per ), but critical for real-world transfer.

Policy Distillation with DAgger: Collapse Multiple Experts into One Depth-Based Policy

Here's the practical problem: deploying 10 separate expert policies on a real is cumbersome and slow. The team distilled all experts into a single student using DAgger ( Aggregation), a technique that iteratively collects trajectories from the experts and trains a to mimic them. Crucially, the student takes only onboard depth images as input—no ground truth . During , DAgger gathers data where the expert knows the true and picks the best ; the student sees only depth and learns to make the same decision. This closed-loop is key: early mistakes teach the student to correct itself. The result: one lightweight that runs in real time on onboard compute, selecting and executing any of the parkour skills based on what the depth camera sees.

teaser

cat dash

134

Perception-Driven Decision-Making: Autonomous Skill Selection

The final piece is autonomous, context-aware behavior. The student doesn't just execute one fixed —it continuously perceives the obstacle landscape via and decides whether to step over, climb, vault, or roll based on obstacle geometry and height. The operator provides only a discrete 2D command (go forward, turn left/right, toggle speed). The handles the high-level decision-making. This is in the robotics sense: the uses data to inform behavior selection in real time. Real-world adaptation is critical here—if an obstacle is displaced mid-run (the paper tests this), the regenerates its decision and adjusts the chain on the fly. This closes the loop between what the camera sees and what the motors do.

MORE DEMONSTRATIONS

roll

obstacle displacement

step climb 3

multi good

continuous step

KEY RESULTS

Maximum Obstacle Climbing Height1.25 meters

vs. 96% of the G1's 1.3m height; prior humanoid systems rarely exceeded 0.3-0.5m

This is the flashiest result: the climbs almost as high as its own body length. For context, most humanoid robots from prior work could step over 0.2-0.3m obstacles; here, we're seeing nearly 4x that. Climbing 1.25m requires explosive leg power, precise balance at the peak, and coordinated descent—all executed fluidly.

Long-Horizon Multi-Obstacle Traversal with Real-Time AdaptationSuccessful navigation of obstacle courses with closed-loop obstacle displacement

vs. Prior motion-matching or RL work typically handled single-skill execution or pre-planned sequences, not adaptive multi-skill chains

The paper demonstrates the running a course with multiple obstacles, autonomously selecting skills and adapting when obstacles are moved in real time. This is harder than climbing one wall—the must compose skills, handle transitions, and recover from errors. Real-time adaptation (not pre-planned re-optimization) proves the system generalizes beyond data.

Skill Diversity10+ distinct parkour skills demonstrated (climbing, vaulting, rolling, crawling, stepping, sitting)

vs. Prior methods typically specialized in 1-2 skills per approach

The framework isn't a one-trick solution. It demonstrates cat vaults, speed vaults, platform climbs, rolling down from heights, crawling under obstacles, and more. This variety comes from the motion-matching + + DAgger pipeline scaling to multiple skills without multiplicative complexity for the student .

Perception Latency and Computational LoadReal-time execution on onboard compute (depth-based, single student policy)

vs. No multi-expert switching or expensive state estimation; lighter than running separate RL policies

The distillation to a single depth-based is pragmatic. The doesn't need ground truth or external tracking—just an onboard depth and one neural network. This is deployable on real hardware without a lab full of cameras.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For software developers building robotics systems, PHP demonstrates three critical lessons. First, motion matching is a underrated tool for . Instead of everything from raw pixels with (which requires massive and often fails in the real world), you can leverage human motion as a prior. Treat as a search problem—find the best-matching human motion, then learn to execute it. This dramatically cuts time and improves motion quality. Second, composition through modular experts that distill into a single student is a practical architecture. You don't need to train one monolithic ; break it into pieces (each expert for one ), then compress that knowledge into a lightweight that runs on robots. DAgger is the glue—it lets you transfer expert knowledge to a that runs on different, limited sensors. Third, perception-driven behavior is non-negotiable for real-world robotics. The isn't executing a fixed plan; it's perceiving obstacles in real time and adapting its selection. This is what enables the closed-loop obstacle displacement demos—the isn't brittle to perturbations. If you're building software, think about how to combine (1) pre-trained motion priors, (2) learnable experts, and (3) lightweight policies that make real-time decisions. PHP shows this scales to complex, dynamic tasks like parkour.

LIMITATIONS

PHP's limitations are real and worth acknowledging. First, motion matching requires high-quality human motion data—the system only captures skills present in the . If humans don't parkour in a certain way, the won't either. Second, the distillation pipeline (expert → DAgger → student ) is complex and requires careful collection; it's not as simple as end-to-end . Third, the system relies on onboard , which has limited range and can struggle with reflective surfaces or fast-moving obstacles. Fourth, the discrete command interface is limiting—the operator must still actively command the ; it's not fully autonomous decision-making about when to parkour. Fifth, is limited to one (Unitree G1) and relatively controlled obstacle courses; to wildly different morphologies or unstructured outdoor terrain is unproven. Finally, the paper doesn't deeply analyze failure modes—when does the fail to climb or vault? What are the geometric or kinematic boundaries of the approach?

WHAT COMES NEXT

The obvious next frontier is full autonomy: instead of a human sending commands, the plans where it wants to go, perceives the obstacle course, and self-navigates. This requires adding high-level (graph search over obstacle configurations) on top of the perception-driven selection. A second direction is —can you train motion matching and policies in (where data is infinite) and transfer to new hardware without retraining? The distillation pipeline hints at this, but it's not fully demonstrated. Third is humanoid morphology ; does PHP work on Boston Atlas, Tesla Optimus, or other humanoids with different proportions and actuators? If yes, it becomes a general framework; if no, there's per-robot tuning. Fourth, exploring how to handle longer obstacle courses with hundreds of obstacles and more unpredictable geometry. Fifth, integrating higher-level reasoning—not just selection, but , energy-efficient route selection, and semantic understanding of the . The dream: a humanoid that explores unknown terrain, perceives obstacles, plans a parkour route, and executes it autonomously, all in real time. PHP is a big step toward that.

Read on arxiv →HTML source →Project page →

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Motion Matching: Compose Atomic Human Skills into Fluid Trajectories

RL Expert Policies: Train Robots to Actually Track Human Motions

Policy Distillation with DAgger: Collapse Multiple Experts into One Depth-Based Policy

Perception-Driven Decision-Making: Autonomous Skill Selection

MORE DEMONSTRATIONS

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy