Day 38

DreamerV3 — RL in a learned world model

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

  • DreamerV3 — DeepMind 2023 (Hafner et al.). Universal Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm: same hyperparameters across 150+ tasks. Trains a Modern Robot LearningWorld modelA model that predicts how the world will change after actions. from pixels, then does PPO-style Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. inside it.
  • Recurrent State-Space Model (RSSM) — DreamerV3's world-model core. Has deterministic Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. h_t and stochastic Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. z_t. Predicts (h, z, reward, done) given Core ConceptsActionA command the robot sends to its motors, controller, or low-level system..
  • Imagination — Roll out the Modern Robot LearningWorld modelA model that predicts how the world will change after actions. from a real Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. for K steps (typically 15) and train actor/critic on imagined data. Drastically improves Robot LearningSample efficiencyHow quickly a method learns from each example or interaction..
  • Symlogsign(x) · log(1 + |x|). Used on rewards/returns for stability across magnitudes.
  • Twohot encoding — Discretize critic targets into a categorical distribution. Reduces gradient Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation..
  • R²-Dreamer — Apr 2025 successor (Berkeley). "Real-Robot Dreamer." Explicit handling of Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. delays and Core ConceptsObservationThe information the robot receives from sensors, such as images, depth, touch, or joint readings. lag for hardware Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot..
  • Robot LearningSample efficiencyHow quickly a method learns from each example or interaction. — Crucial for DreamerV3's appeal: solves Atari with ~50× less data than PPO.

Real-world analogy

PPO (Day 22) is "try, see what happened, update." DreamerV3 is "try, see what happened, build a mental model, imagine a thousand more attempts, update from those." When real interaction is expensive (real robots), imagination is cheap.

Hour 1 — Reading

Hour 2 — Read the JAX implementation

cd ~/robo47-wm
git clone https://github.com/danijar/dreamerv3
cd dreamerv3
  • Read in this order (~30 min):
  • dreamerv3/agent.py — top-level agent, RSSM + actor + critic
  • dreamerv3/jaxnets.py — RSSM forward
  • dreamerv3/jaxutils.py — symlog, twohot helpers

LAB

Hour 3 — Lab: train DreamerV3 on a control task (90 min wall-clock)

What you're building. Train DreamerV3 on dmc_walker_walk (DeepMind Control & PlanningControlThe method used to make the robot move the way you want. Suite, Walker-2D walking). Compare wall-clock and Robot LearningSample efficiencyHow quickly a method learns from each example or interaction. vs Day 22's PPO.

What success looks like. 1. DreamerV3 trains for 1M env steps and reaches mean Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. ≥ 700 (out of ~1000 max). 2. PPO Evaluation & ResearchBaselineA reference method used for comparison. on the same Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. reaches similar Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. but takes ~5× more env steps. 3. Plot figures/day38_dreamer_vs_ppo.png showing both learning curves.

Step 1 — Install + smoke test (10 min)

cd ~/robo47-wm/dreamerv3
uv pip install dm_control
uv pip install -r requirements.txt

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.