Day 22

PPO foundations + Abbeel primer + cart-pole from scratch

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (15 min)

  • Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. — Learning a Core ConceptsPolicyThe rule or model that maps observations or states to actions. π(a|s) that maximizes expected Imitation & Reinforcement LearningReturnThe total accumulated reward over time. E[Σγᵗ rₜ] from interaction with an Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces..
  • Markov Decision Process (MDP)(S, A, P, r, γ): states, actions, transition kernel, Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing., discount.
  • Core ConceptsPolicyThe rule or model that maps observations or states to actions. gradient — Update θ ← θ + α ∇_θ J(θ) where J = E[Σ rₜ]. The fundamental Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. gradient.
  • Advantage A(s, a) — How much better Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. a is than average at Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. s. A = Q − V.
  • GAE (Generalized Advantage Estimation) — Smoothed advantage estimator: Â_t = Σ (γλ)^l δ_{t+l} where δ_t = r_t + γV(s_{t+1}) − V(s_t). Trades bias and variance via λ.
  • PPO (Proximal Core ConceptsPolicyThe rule or model that maps observations or states to actions. Optimization) — Schulman 2017. Clips the Core ConceptsPolicyThe rule or model that maps observations or states to actions. ratio r_t = π(a|s)/π_old(a|s) to prevent destructive updates. Default Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. algorithm in 2026.
  • Clip ratio ε — Typically 0.2. min(r·A, clip(r, 1−ε, 1+ε)·A).
  • Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. V(s) — Critic. Predicts expected Imitation & Reinforcement LearningReturnThe total accumulated reward over time. from Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. s.
  • Entropy bonus — Encourage Imitation & Reinforcement LearningExplorationTrying different actions to discover useful behavior. by rewarding uncertain policies. Coefficient 0.01 typical.
  • Robot LearningRolloutA full run of a policy in simulation or the real world. — Collect N steps of Core ConceptsTrajectoryA sequence of states or actions over time. by running Core ConceptsPolicyThe rule or model that maps observations or states to actions. in env(s).
  • Vectorized envs — Run K parallel envs simultaneously to amortize Core ConceptsPolicyThe rule or model that maps observations or states to actions. forward pass.

Real-world analogy

PPO is "try a slightly different Core ConceptsPolicyThe rule or model that maps observations or states to actions., but only step in the direction of improvement if the new Core ConceptsPolicyThe rule or model that maps observations or states to actions. isn't too different from the old one." The clipping is a Safety & DeploymentGuardrailA system-level rule or limit placed around model behavior. — without it, one lucky-but-low-probability sequence of actions can yank the Core ConceptsPolicyThe rule or model that maps observations or states to actions. into a region from which it never recovers.

Hour 1 — Abbeel Deep RL primer

Watch Foundations of Deep Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Lecture 4 — TRPO and PPO

Video

Watch Foundations of Deep Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. Lecture 4 — TRPO and PPO

Open source

Pieter Abbeel's pacing on this is excellent. Watch at 1.25× if comfortable. The single most useful 30 minutes for understanding PPO.

Hour 2 — Spinning Up + 37 Implementation Details

The "37 details" blog is what separates "PPO works" from "PPO doesn't work" in your code. Read it before writing PPO from scratch.

LAB

Hour 3 — Lab: PPO from scratch on CartPole-v1 (90 min)

What you're building. A 250-line PPO implementation in pure PyTorch (no Stable-Baselines3, no CleanRL) that solves Gymnasium's CartPole-v1 to Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. = 500 within 100k env steps. You'll log learning curves and intentionally compare to the Stable-Baselines3 Evaluation & ResearchBaselineA reference method used for comparison..

What success looks like at the end. You have: 1. w4-rl/src/day22_ppo_cartpole.py (~250 lines). 2. CartPole-v1 Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. reaches 500 (the env's max) within 100k steps for ≥ 2 of 3 seeds. 3. figures/day22_ppo_curves.png showing Imitation & Reinforcement LearningRewardA score that tells the robot how well it is doing. vs env steps. 4. Comparison: SB3 PPO trained on the same env reaches 500 in similar wall-clock; you're not faster, but you understand it.

Step 1 — Install (5 min)

mkdir -p ~/robo47/w4-rl && cd ~/robo47/w4-rl
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install torch gymnasium[classic-control] stable-baselines3 wandb matplotlib numpy

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.