CONTROLCURRENT2023-10-25

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, Xiaolong Wang

ARCHITECTURE
model-based RL with implicit world model and latent trajectory optimization
ROBOT
multiple embodiments
KEY METRIC
317M parameters, 80 tasks
TASK
continuous control, trajectory optimization

TD-MPC2 is a model-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. system that learns to Control & PlanningControlThe method used to make the robot move the way you want. robots by building an internal mental model of how the world works, then Control & PlanningPlanningFiguring out what the robot should do before or during movement. actions in that model's compressed "thought space." What makes this a big deal: a single 317-million-parameter agent successfully learned to perform 80 different tasks—from a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. dog running and trotting, to Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks like assembling objects and pulling coffee handles—all without retraining. Across 104 total tasks spanning four different domains (simulated physics, Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact., muscle-based Control & PlanningControlThe method used to make the robot move the way you want.), TD-MPC2 consistently outperformed both model-free methods like SAC and prior model-based approaches like DreamerV3, using the exact same hyperparameters everywhere. For a developer, this means you can build one learning system that generalizes across radically different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. embodiments and goals instead of engineering separate solutions for each Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening..

ARCHITECTURE

THE PROBLEM

Before TD-MPC2, continuous Control & PlanningControlThe method used to make the robot move the way you want. in robotics faced a scalability crisis. Prior model-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. methods like the original TD-MPC required careful hyperparameter tuning for each new Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. or domain—what worked for a simulated Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. dog didn't work for Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. tasks. Model-free methods like SAC were more robust but learned inefficiently, requiring millions of real-world interactions because they didn't build a Modern Robot LearningWorld modelA model that predicts how the world will change after actions. to reason about consequences. DreamerV3, the previous state-of-the-art, improved multi-task learning but still struggled with the diversity problem: Robot LearningTrainingThe process of fitting a model using data or experience. a single agent on 80+ tasks across different embodiments (quadrupeds, hands, humanoids) and Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. spaces (continuous Movement, Mechanics & Robot BodyJointA movable connection between robot parts. commands, muscle activations) seemed impossible. The core limitation was that existing world models either required task-specific tuning or collapsed under the complexity of learning multiple incompatible environments simultaneously.

HOW IT WORKS

1

Learning an Implicit World Model in Latent Space

Instead of learning to reconstruct images (which wastes model capacity), TD-MPC2 learns a "decoder-free" Modern Robot LearningWorld modelA model that predicts how the world will change after actions.—it compresses observations into a compact latent representation and learns to predict how that latent Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. evolves under actions, without ever reconstructing pixels. This is crucial for scaling: by avoiding pixel prediction, the model conserves parameters for actually understanding Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia.. The implicit Modern Robot LearningWorld modelA model that predicts how the world will change after actions. essentially asks: 'given what I observe now and the Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. I take, what will the next latent Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. be?' This compression is learned via temporal difference (TD) learning, the same principle that powers value-function learning in Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards., making the entire system theoretically cohesive.

2

Planning via Latent Trajectory Optimization

Once the Modern Robot LearningWorld modelA model that predicts how the world will change after actions. is trained, TD-MPC2 doesn't sample random Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequences and pick the best one (like prior methods). Instead, it performs gradient-based optimization directly in the Robot LearningLatent spaceA compressed internal representation space inside a model.: it starts with a guess about the next 8-12 actions, then iteratively improves that sequence by asking the learned Modern Robot LearningWorld modelA model that predicts how the world will change after actions. 'if I tweak this Core ConceptsActionA command the robot sends to its motors, controller, or low-level system., how much better does my predicted Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. become?' This is vastly more efficient than sampling thousands of trajectories. The optimization happens on each step as the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. acts, creating a tight loop: observe → plan in Robot LearningLatent spaceA compressed internal representation space inside a model. → execute → observe again. This local, online Control & PlanningPlanningFiguring out what the robot should do before or during movement. is what makes the system robust to model errors.

3

Scaling Through Architectural Improvements and Better Data

TD-MPC2 makes several technical improvements over the original: better latent Data, Distributions & Training IssuesRegularizationMethods used to reduce overfitting. (keeping the learned representation stable), improved Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. Data, Distributions & Training IssuesNormalizationRescaling inputs or features to stabilize learning., and more careful Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. scaling. But the real scaling breakthrough comes from Robot LearningTrainingThe process of fitting a model using data or experience. on massive, diverse datasets. The authors collected 545M Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. transitions from 240 separate single-task agents across 80 tasks, then train one unified 317M-parameter agent on all of it. The key insight: collecting this data is cheap (done offline via parallel Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Robot LearningTrainingThe process of fitting a model using data or experience.), but Robot LearningTrainingThe process of fitting a model using data or experience. one large multi-task model generalizes better than 80 separate specialists. This mirrors how large language models work—scale up data and model size, and emergent capabilities appear.

tdmpc2 humanoid spin
tdmpc2 humanoid run
tdmpc2 humanoid stand
4

Single Hyperparameter Set Across All Domains

A hidden win in TD-MPC2 is Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation. to hyperparameter choices. The method achieves strong performance on DMControl (simulated humanoids and quadrupeds), Meta-World (dexterous hand Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.), ManiSkill2 (object Manipulation & TasksAssemblyPutting components together in a structured way.), and MyoSuite (muscle-actuated Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running.)—all with one learning rate, batch size, and Control & PlanningPlanningFiguring out what the robot should do before or during movement. horizon. This is not a technical detail: in prior work, you'd need different hyperparameters for Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. vs. muscle Control & PlanningControlThe method used to make the robot move the way you want., or for vision-based tasks vs. proprioceptive ones. The unified performance suggests TD-MPC2 found a sweet spot in the algorithm design that generalizes across this landscape.

tdmpc2 assembly
tdmpc2 dog run
tdmpc2 turn faucet 1
tdmpc2 cheetah run

MORE DEMONSTRATIONS

tdmpc2 pick ycb 1
tdmpc2 dog trot
tdmpc2 coffee pull
tdmpc2 obj hold
tdmpc2 pick ycb 5
tdmpc2 walker run
tdmpc2 pick place wall
tdmpc2 stick pull
tdmpc2 cartpole swingup
tdmpc2 pendulum swingup
tdmpc2 pick ycb 6
tdmpc2 pick ycb 0
tdmpc2 walker run backwards
tdmpc2 sweep into
tdmpc2 cup spin
tdmpc2 reacher three hard
tdmpc2 peg unplug side
tdmpc2 key turn
tdmpc2 dog walk
tdmpc2 fish swim
tdmpc2 quadruped run
tdmpc2 lever pull
tdmpc2 hopper hop
tdmpc2 stack cube
tdmpc2 finger turn hard
tdmpc2 pen twirl
tdmpc2 pendulum spin
tdmpc2 cup catch
tdmpc2 door open
tdmpc2 pick cube
tdmpc2 cheetah run front
tdmpc2 pick ycb 2
tdmpc2 acrobot swingup
tdmpc2 finger spin
tdmpc2 turn faucet 0
tdmpc2 stick push
tdmpc2 bin picking
tdmpc2 push wall
tdmpc2 plate slide back side
tdmpc2 hopper hop backwards
tdmpc2 cheetah jump
tdmpc2 pick out of hole
tdmpc2 turn faucet 2
tdmpc2 reacher hard
tdmpc2 pick ycb 3
tdmpc2 pick ycb 4
tdmpc2 cheetah run backwards

FIGURES

KEY RESULTS

Performance across 104 online RL tasksConsistently outperforms SAC (model-free) and DreamerV3 (prior model-based), with single hyperparameter set

vs. DreamerV3 and SAC which require task-specific tuning

This is the broadest continuous Control & PlanningControlThe method used to make the robot move the way you want. Simulation & Sim-to-RealBenchmarkA standard test used to compare methods fairly. reported in Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. at the time. The fact that one set of hyperparameters works across 4 completely different Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. domains (Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running., Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact., object Manipulation & TasksAssemblyPutting components together in a structured way., muscle Control & PlanningControlThe method used to make the robot move the way you want.) is unprecedented. It means the algorithm found a robustly general solution rather than one that exploits domain-specific structure.

Massively multi-task scaling: single 317M-parameter agentSuccessfully learns 80 tasks across multiple domains, embodiments, and action spaces

vs. prior multi-task agents which typically handle 10-20 tasks

This is the first Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. that a single large agent can handle this level of Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. diversity. The agent controls different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. bodies (quadrupeds, humanoids, hands), operates in different spaces (proprioceptive vs. high-dimensional), and solves qualitatively different problems (Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. vs. Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.). Success here validates that scaling model size is the right direction for robotics AI.

Scaling efficiency: 1M to 317M parametersAgent capabilities consistently increase with model size on 80-task dataset

vs. saturating performance in prior methods

The paper shows that performance doesn't plateau as you add parameters—larger models solve harder and more diverse tasks. This suggests the Robot LearningTrainingThe process of fitting a model using data or experience. paradigm is fundamentally sound and hints at a power-law scaling relationship similar to language models. A developer should care: this means future improvements will come from bigger models and more data, not algorithm tweaks.

Open-source release324 model checkpoints (1M-317M parameters) + two datasets (545M and 345M transitions, 34GB and 20GB)

vs. most prior work which released only paper results

The authors released all Robot LearningTrainingThe process of fitting a model using data or experience. data and model weights. This is rare in robotics Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. and immediately enables: community validation, Modern Robot LearningTransfer learningUsing knowledge from one task, domain, or robot to help with another. to new tasks, and faster iteration by others. The 545M transition Robot LearningDatasetA collection of training or evaluation data. alone is a valuable asset for future research.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a developer building robotics software, TD-MPC2 changes what's possible in two ways. First, it shows that model-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards.—learning to plan with a learned Modern Robot LearningWorld modelA model that predicts how the world will change after actions.—can be practical and scalable. If you're building a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. that needs to adapt to new tasks, TD-MPC2 suggests you should invest in building good world models rather than collecting endless task-specific data. The planning-in-latent-space approach is clever: you get the Robot LearningSample efficiencyHow quickly a method learns from each example or interaction. of model-based methods (needing fewer Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. interactions) without the computational cost of pixel reconstruction. Second, the scaling results suggest that bigger, unified agents beat task-specialized ones. Instead of Robot LearningTrainingThe process of fitting a model using data or experience. 80 separate Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. agents for 80 tasks, train one large agent on all the data at once. This is counterintuitive compared to Robot LearningSupervised learningLearning from labeled input-output examples. (where you'd need separate classifiers), but it works here because the tasks share underlying physics. For a developer at a robotics company, this means: focus on collecting diverse, high-quality transition data, then let a large unified model learn from it. The code and models are open-source, so you can start experimenting immediately rather than reimplementing from scratch.

LIMITATIONS

TD-MPC2 does not solve several critical problems for real robots. First, all experiments are in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. or with pre-recorded offline datasets—there's no Imitation & Reinforcement LearningDemonstrationAn example of a task being done correctly, often by a human. of the 317M agent actually controlling a physical Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. in real time. Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. transfer is hard, and latent Movement, Mechanics & Robot BodyDynamicsThe study of motion including forces, torques, mass, and inertia. learned from Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. may not capture real-world Movement, Mechanics & Robot BodyFrictionResistance between contacting surfaces that affects sliding and grasping., wear, and unexpected contacts. Second, the method requires learning a separate Modern Robot LearningWorld modelA model that predicts how the world will change after actions., which adds Robot LearningTrainingThe process of fitting a model using data or experience. complexity and Simulation & Sim-to-RealLatencyDelay between input, computation, and action. compared to pure model-free methods. If you have unlimited real-world data and don't care about Robot LearningSample efficiencyHow quickly a method learns from each example or interaction., SAC might be simpler. Third, the paper doesn't thoroughly explore failure modes: what happens when the Modern Robot LearningWorld modelA model that predicts how the world will change after actions. is confidently wrong? Control & PlanningPlanningFiguring out what the robot should do before or during movement. in Robot LearningLatent spaceA compressed internal representation space inside a model. can amplify model errors. Finally, scaling to 317M parameters helps, but there's no clear evidence this approach scales to Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. tasks requiring high-dimensional visual reasoning (like navigating cluttered scenes) or long-horizon reasoning (100+ steps into the future).

WHAT COMES NEXT

The natural next step is closing the Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. gap: taking the 317M agent trained in Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. and successfully deploying it to physical robots. This requires either better Data, Distributions & Training IssuesDomain randomizationChanging simulator visuals or physics during training so policies transfer better to reality. during Robot LearningTrainingThe process of fitting a model using data or experience. or Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. with real-world data. Beyond that, the architecture suggests a path toward more general robotics AI: scale further (500M+ parameters), train on more diverse tasks (500+ instead of 80), and potentially add vision as input rather than relying on proprioceptive Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.. The paper hints at scaling laws—if they hold, a 1B-parameter agent trained on 1000+ tasks across real and simulated domains might exhibit surprising Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before.. Another frontier is interpretability: what does the 317M agent's Robot LearningLatent spaceA compressed internal representation space inside a model. actually represent? Understanding this could unlock better Modern Robot LearningTransfer learningUsing knowledge from one task, domain, or robot to help with another. and faster adaptation to new tasks.

RELATED PAPERS