CONTROLCURRENT2023-10-25

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, Xiaolong Wang

ARCHITECTURE

model-based RL with implicit world model and latent trajectory optimization

ROBOT

multiple embodiments

KEY METRIC

317M parameters, 80 tasks

TASK

continuous control, trajectory optimization

TD-MPC2 is a model-based system that learns to robots by building an internal mental model of how the world works, then actions in that model's compressed "thought space." What makes this a big deal: a single 317-million-parameter agent successfully learned to perform 80 different tasks—from a dog running and trotting, to tasks like assembling objects and pulling coffee handles—all without retraining. Across 104 total tasks spanning four different domains (simulated physics, , muscle-based ), TD-MPC2 consistently outperformed both model-free methods like SAC and prior model-based approaches like DreamerV3, using the exact same hyperparameters everywhere. For a developer, this means you can build one learning system that generalizes across radically different embodiments and goals instead of engineering separate solutions for each .

ARCHITECTURE

THE PROBLEM

Before TD-MPC2, continuous in robotics faced a scalability crisis. Prior model-based methods like the original TD-MPC required careful hyperparameter tuning for each new or domain—what worked for a simulated dog didn't work for tasks. Model-free methods like SAC were more robust but learned inefficiently, requiring millions of real-world interactions because they didn't build a to reason about consequences. DreamerV3, the previous state-of-the-art, improved multi-task learning but still struggled with the diversity problem: a single agent on 80+ tasks across different embodiments (quadrupeds, hands, humanoids) and spaces (continuous commands, muscle activations) seemed impossible. The core limitation was that existing world models either required task-specific tuning or collapsed under the complexity of learning multiple incompatible environments simultaneously.

HOW IT WORKS

Learning an Implicit World Model in Latent Space

Instead of learning to reconstruct images (which wastes model capacity), TD-MPC2 learns a "decoder-free" —it compresses observations into a compact latent representation and learns to predict how that latent evolves under actions, without ever reconstructing pixels. This is crucial for scaling: by avoiding pixel prediction, the model conserves parameters for actually understanding . The implicit essentially asks: 'given what I observe now and the I take, what will the next latent be?' This compression is learned via temporal difference (TD) learning, the same principle that powers value-function learning in , making the entire system theoretically cohesive.

Planning via Latent Trajectory Optimization

Once the is trained, TD-MPC2 doesn't sample random sequences and pick the best one (like prior methods). Instead, it performs gradient-based optimization directly in the : it starts with a guess about the next 8-12 actions, then iteratively improves that sequence by asking the learned 'if I tweak this , how much better does my predicted become?' This is vastly more efficient than sampling thousands of trajectories. The optimization happens on each step as the acts, creating a tight loop: observe → plan in → execute → observe again. This local, online is what makes the system robust to model errors.

Scaling Through Architectural Improvements and Better Data

TD-MPC2 makes several technical improvements over the original: better latent (keeping the learned representation stable), improved , and more careful scaling. But the real scaling breakthrough comes from on massive, diverse datasets. The authors collected 545M transitions from 240 separate single-task agents across 80 tasks, then train one unified 317M-parameter agent on all of it. The key insight: collecting this data is cheap (done offline via parallel ), but one large multi-task model generalizes better than 80 separate specialists. This mirrors how large language models work—scale up data and model size, and emergent capabilities appear.

tdmpc2 humanoid spin

tdmpc2 humanoid run

tdmpc2 humanoid stand

Single Hyperparameter Set Across All Domains

A hidden win in TD-MPC2 is to hyperparameter choices. The method achieves strong performance on DMControl (simulated humanoids and quadrupeds), Meta-World (dexterous hand ), ManiSkill2 (object ), and MyoSuite (muscle-actuated )—all with one learning rate, batch size, and horizon. This is not a technical detail: in prior work, you'd need different hyperparameters for vs. muscle , or for vision-based tasks vs. proprioceptive ones. The unified performance suggests TD-MPC2 found a sweet spot in the algorithm design that generalizes across this landscape.

tdmpc2 assembly

tdmpc2 dog run

tdmpc2 turn faucet 1

tdmpc2 cheetah run

MORE DEMONSTRATIONS

tdmpc2 pick ycb 1

tdmpc2 dog trot

tdmpc2 coffee pull

tdmpc2 obj hold

tdmpc2 pick ycb 5

tdmpc2 walker run

tdmpc2 pick place wall

tdmpc2 stick pull

tdmpc2 cartpole swingup

tdmpc2 pendulum swingup

tdmpc2 pick ycb 6

tdmpc2 pick ycb 0

tdmpc2 walker run backwards

tdmpc2 sweep into

tdmpc2 cup spin

tdmpc2 reacher three hard

tdmpc2 peg unplug side

tdmpc2 key turn

tdmpc2 dog walk

tdmpc2 fish swim

tdmpc2 quadruped run

tdmpc2 lever pull

tdmpc2 hopper hop

tdmpc2 stack cube

tdmpc2 finger turn hard

tdmpc2 pen twirl

tdmpc2 pendulum spin

tdmpc2 cup catch

tdmpc2 door open

tdmpc2 pick cube

tdmpc2 cheetah run front

tdmpc2 pick ycb 2

tdmpc2 acrobot swingup

tdmpc2 finger spin

tdmpc2 turn faucet 0

tdmpc2 stick push

tdmpc2 bin picking

tdmpc2 push wall

tdmpc2 plate slide back side

tdmpc2 hopper hop backwards

tdmpc2 cheetah jump

tdmpc2 pick out of hole

tdmpc2 turn faucet 2

tdmpc2 reacher hard

tdmpc2 pick ycb 3

tdmpc2 pick ycb 4

tdmpc2 cheetah run backwards

FIGURES

KEY RESULTS

Performance across 104 online RL tasksConsistently outperforms SAC (model-free) and DreamerV3 (prior model-based), with single hyperparameter set

vs. DreamerV3 and SAC which require task-specific tuning

This is the broadest continuous reported in at the time. The fact that one set of hyperparameters works across 4 completely different domains (, , object , muscle ) is unprecedented. It means the algorithm found a robustly general solution rather than one that exploits domain-specific structure.

Massively multi-task scaling: single 317M-parameter agentSuccessfully learns 80 tasks across multiple domains, embodiments, and action spaces

vs. prior multi-task agents which typically handle 10-20 tasks

This is the first that a single large agent can handle this level of diversity. The agent controls different bodies (quadrupeds, humanoids, hands), operates in different spaces (proprioceptive vs. high-dimensional), and solves qualitatively different problems ( vs. ). Success here validates that scaling model size is the right direction for robotics AI.

Scaling efficiency: 1M to 317M parametersAgent capabilities consistently increase with model size on 80-task dataset

vs. saturating performance in prior methods

The paper shows that performance doesn't plateau as you add parameters—larger models solve harder and more diverse tasks. This suggests the paradigm is fundamentally sound and hints at a power-law scaling relationship similar to language models. A developer should care: this means future improvements will come from bigger models and more data, not algorithm tweaks.

Open-source release324 model checkpoints (1M-317M parameters) + two datasets (545M and 345M transitions, 34GB and 20GB)

vs. most prior work which released only paper results

The authors released all data and model weights. This is rare in robotics and immediately enables: community validation, to new tasks, and faster iteration by others. The 545M transition alone is a valuable asset for future research.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

For a developer building robotics software, TD-MPC2 changes what's possible in two ways. First, it shows that model-based —learning to plan with a learned —can be practical and scalable. If you're building a that needs to adapt to new tasks, TD-MPC2 suggests you should invest in building good world models rather than collecting endless task-specific data. The planning-in-latent-space approach is clever: you get the of model-based methods (needing fewer interactions) without the computational cost of pixel reconstruction. Second, the scaling results suggest that bigger, unified agents beat task-specialized ones. Instead of 80 separate agents for 80 tasks, train one large agent on all the data at once. This is counterintuitive compared to (where you'd need separate classifiers), but it works here because the tasks share underlying physics. For a developer at a robotics company, this means: focus on collecting diverse, high-quality transition data, then let a large unified model learn from it. The code and models are open-source, so you can start experimenting immediately rather than reimplementing from scratch.

LIMITATIONS

TD-MPC2 does not solve several critical problems for real robots. First, all experiments are in or with pre-recorded offline datasets—there's no of the 317M agent actually controlling a physical in real time. transfer is hard, and latent learned from may not capture real-world , wear, and unexpected contacts. Second, the method requires learning a separate , which adds complexity and compared to pure model-free methods. If you have unlimited real-world data and don't care about , SAC might be simpler. Third, the paper doesn't thoroughly explore failure modes: what happens when the is confidently wrong? in can amplify model errors. Finally, scaling to 317M parameters helps, but there's no clear evidence this approach scales to tasks requiring high-dimensional visual reasoning (like navigating cluttered scenes) or long-horizon reasoning (100+ steps into the future).

WHAT COMES NEXT

The natural next step is closing the gap: taking the 317M agent trained in and successfully deploying it to physical robots. This requires either better during or with real-world data. Beyond that, the architecture suggests a path toward more general robotics AI: scale further (500M+ parameters), train on more diverse tasks (500+ instead of 80), and potentially add vision as input rather than relying on proprioceptive . The paper hints at scaling laws—if they hold, a 1B-parameter agent trained on 1000+ tasks across real and simulated domains might exhibit surprising . Another frontier is interpretability: what does the 317M agent's actually represent? Understanding this could unlock better and faster adaptation to new tasks.

Read on arxiv →HTML source →Project page →

TD-MPC2: Scalable, Robust World Models for Continuous Control

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Learning an Implicit World Model in Latent Space

Planning via Latent Trajectory Optimization

Scaling Through Architectural Improvements and Better Data

Single Hyperparameter Set Across All Domains

MORE DEMONSTRATIONS

FIGURES

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy