MuJoCo Playground: massively-parallel PPO on a quadruped

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

MuJoCo Playground — Google DeepMind's 2025 GPU-accelerated MuJoCo wrapper. Runs 4096 parallel envs on a single H100 with full physics. Released v0.1.0 in 2024-Q4.
MJX — MuJoCo translated to JAX, runs on GPU/TPU. The engine under MuJoCo Playground.
Brax — DeepMind's earlier JAX-based library. Many Playground envs are inherited.
Massively-parallel — 1000s of envs run synchronously on GPU. Unlocks throughput-based (16M env steps/sec).
Go1 / Go2 — Unitree's quadruped robots; standard benchmarks.
— Sum of rewards in one . The headline .
Domain rand. config — Parameters (mass, , motor strength) randomized at start.
shaping — Adding auxiliary terms (alive bonus, smoothness, penalty) to . Critical for .

Real-world analogy

If single-env PPO (Day 22) was driving one car around a track to learn racing, MuJoCo Playground is putting 4096 cars on 4096 parallel tracks at once and synchronizing their lessons. The unlock is enormous.

Hour 1 — Reading

MuJoCo Playground v0.1.0 release notes (~10 min): https://github.com/google-deepmind/mujoco_playground
Brax paper, abstract + Section 3 (~25 min): https://arxiv.org/abs/2106.13281
Watch MuJoCo Playground tour (DeepMind release video, if available; otherwise the README's GIFs)

Hour 2 — Install + first env

ssh -i ~/.ssh/nebius_key ubuntu@<your-h100-ip>
cd ~ && mkdir -p robo47-rl && cd robo47-rl
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install -U "jax[cuda12]" "mujoco-mjx>=3.7" "mujoco-playground"
uv pip install brax flax orbax wandb tensorboard

# Verify GPU JAX
python -c "
import jax; print('devices:', jax.devices())
"

Expected: devices: [CudaDevice(id=0)]. If [CpuDevice(id=0)], your JAX wheel is wrong (Day 0 troubleshooting #11).

# Smoke test: Spot quadruped env loads
python -c "
from mujoco_playground import registry, locomotion
env = registry.load('SpotJoystickFlatTerrain')
print(f'obs dim: {env.observation_size}, act dim: {env.action_size}')
"

Expected:

obs dim: 48, act dim: 12

LAB

Hour 3 — Lab: train Spot to walk in 10 minutes (75 min)

What you're building. Train MuJoCo Playground's SpotJoystickFlatTerrain (BD Spot quadruped, joystick ) using PPO with 4096 parallel envs. ~10 min wall-clock to a walking . Render an MP4 of the .

What success looks like at the end. You have: 1. A trained Spot at runs/spot_pp0/policy.pkl reaching ≥ 25 (well above the random ~0). 2. A 30-second video videos/day23_spot_walk.mp4 showing the simulated Spot walking forward at ~0.6 m/s. 3. curve figures/day23_spot_training.png showing rising from 0 to 25+ in ~6M env steps.

Step 1 — Use Playground's PPO trainer (5 min)

Playground ships with a Brax PPO trainer optimized for its envs. Don't reimplement.

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.