Day 23

MuJoCo Playground: massively-parallel PPO on a quadruped

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

  • MuJoCo Playground — Google DeepMind's 2025 GPU-accelerated MuJoCo wrapper. Runs 4096 parallel envs on a single H100 with full physics. Released v0.1.0 in 2024-Q4.
  • MJX — MuJoCo translated to JAX, runs on GPU/TPU. The Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested. engine under MuJoCo Playground.
  • Brax — DeepMind's earlier JAX-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. library. Many Playground envs are inherited.
  • Massively-parallel Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. — 1000s of envs run synchronously on GPU. Unlocks throughput-based Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. (16M env steps/sec).
  • Go1 / Go2 — Unitree's quadruped robots; standard Simulation & Sim-to-RealSim-to-real (sim2real)Transferring a policy trained in simulation to a real robot. benchmarks.
  • Robot LearningEpisodeOne full attempt at a task from start to finish. Imitation & Reinforcement LearningReturnThe total accumulated reward over time. — Sum of rewards in one Robot LearningEpisodeOne full attempt at a task from start to finish.. The headline Evaluation & ResearchMetricA numerical measure of performance..
  • Domain rand. config — Parameters (mass, Movement, Mechanics & Robot BodyFrictionResistance between contacting surfaces that affects sliding and grasping., motor strength) randomized at Robot LearningEpisodeOne full attempt at a task from start to finish. start.
  • Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. shaping — Adding auxiliary terms (alive bonus, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. smoothness, Movement, Mechanics & Robot BodyContactPhysical interaction between the robot and an object or surface. penalty) to Imitation & Reinforcement LearningDense rewardA reward signal given frequently throughout the task.. Critical for Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running..

Real-world analogy

If single-env PPO (Day 22) was driving one car around a track to learn racing, MuJoCo Playground is putting 4096 cars on 4096 parallel tracks at once and synchronizing their lessons. The Evaluation & ResearchThroughputHow much data or how many actions a system can process in a given time. unlock is enormous.

Hour 1 — Reading

Hour 2 — Install + first env

ssh -i ~/.ssh/nebius_key ubuntu@<your-h100-ip>
cd ~ && mkdir -p robo47-rl && cd robo47-rl
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install -U "jax[cuda12]" "mujoco-mjx>=3.7" "mujoco-playground"
uv pip install brax flax orbax wandb tensorboard

# Verify GPU JAX
python -c "
import jax; print('devices:', jax.devices())
"

Expected: devices: [CudaDevice(id=0)]. If [CpuDevice(id=0)], your JAX wheel is wrong (Day 0 troubleshooting #11).

# Smoke test: Spot quadruped env loads
python -c "
from mujoco_playground import registry, locomotion
env = registry.load('SpotJoystickFlatTerrain')
print(f'obs dim: {env.observation_size}, act dim: {env.action_size}')
"

Expected:

obs dim: 48, act dim: 12

LAB

Hour 3 — Lab: train Spot to walk in 10 minutes (75 min)

What you're building. Train MuJoCo Playground's SpotJoystickFlatTerrain (BD Spot quadruped, joystick Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running.) using PPO with 4096 parallel envs. ~10 min wall-clock to a walking Core ConceptsPolicyThe rule or model that maps observations or states to actions.. Render an MP4 of the Core ConceptsPolicyThe rule or model that maps observations or states to actions..

What success looks like at the end. You have: 1. A trained Spot Core ConceptsPolicyThe rule or model that maps observations or states to actions. at runs/spot_pp0/policy.pkl reaching Robot LearningEpisodeOne full attempt at a task from start to finish. Imitation & Reinforcement LearningReturnThe total accumulated reward over time. ≥ 25 (well above the random Evaluation & ResearchBaselineA reference method used for comparison. ~0). 2. A 30-second video videos/day23_spot_walk.mp4 showing the simulated Spot walking forward at ~0.6 m/s. 3. Robot LearningTrainingThe process of fitting a model using data or experience. curve figures/day23_spot_training.png showing Imitation & Reinforcement LearningReturnThe total accumulated reward over time. rising from 0 to 25+ in ~6M env steps.

Step 1 — Use Playground's PPO trainer (5 min)

Playground ships with a Brax PPO trainer optimized for its envs. Don't reimplement.

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.