Day 37

LeWorldModel, ThinkJEPA, world-model–conditioned policies

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

  • LeWorldModel — Hugging Face's 2025 world-model project. JEPA backbone + Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. conditioning + LeRobot integration. Designed to plug into existing imitation pipelines.
  • ThinkJEPA — Late 2025 / early 2026 paper. Adds reasoning steps to JEPA prediction: instead of one-shot prediction, the model iterates internal "thoughts" before producing the final embedding.
  • Latent Robot LearningRolloutA full run of a policy in simulation or the real world. — Given obs_0, Core ConceptsActionA command the robot sends to its motors, controller, or low-level system. sequence a_0, a_1, ..., a_{T-1}, predict obs_1, obs_2, ..., obs_T entirely in embedding space. The "imagined Core ConceptsTrajectoryA sequence of states or actions over time.."
  • World-model–conditioned Core ConceptsPolicyThe rule or model that maps observations or states to actions. — Train Core ConceptsPolicyThe rule or model that maps observations or states to actions. via imitation (Week 3-style), but augment its observations with a JEPA-predicted "imagined" next embedding. Adds foresight at low cost.
  • Auxiliary loss — Adding world-model prediction loss to the main imitation loss. Typically weighted 0.1–0.5×.
  • Closed-loop vs open-loop Robot LearningRolloutA full run of a policy in simulation or the real world. — Closed: at each step, predict next embedding from real obs + Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.. Open: predict from previous predicted embedding. Open-loop drifts faster.
  • Drift horizon — How many steps before predicted embeddings diverge measurably from reality. JEPA models typically drift after 20–50 steps; full rollouts beyond ~100 steps are unreliable.

Real-world analogy

LeWorldModel is the "GPS for imitation policies": the Core ConceptsPolicyThe rule or model that maps observations or states to actions. is the driver; the Modern Robot LearningWorld modelA model that predicts how the world will change after actions. whispers "if you keep going this way, in 2 seconds you'll see X" — useful information that improves decisions without needing photorealistic prediction.

Hour 1 — Reading

Hour 2 — LeWorldModel codebase

cd ~/robo47-wm
git clone https://github.com/huggingface/leworldmodel
cd leworldmodel
uv pip install -e .
  • Read in this order (~30 min):
  • leworldmodel/models/world_model.py — the action-conditioned predictor
  • leworldmodel/training/policy_with_wm.py — example of imitation + world-model Movement, Mechanics & Robot BodyJointA movable connection between robot parts. Robot LearningTrainingThe process of fitting a model using data or experience.
  • examples/aloha_with_wm.py — full Robot LearningTrainingThe process of fitting a model using data or experience. script

LAB

Hour 3 — Lab: latent rollout on Day 36's clip (60 min)

What you're building. Use V-JEPA 2-AC to open-loop roll out 30 steps of imagined embeddings starting from frame 0, given a sequence of 30 mock actions. Compare against ground-truth embeddings at each step. Measure how cosine similarity decays with horizon.

What success looks like. You have: 1. src/day37_latent_rollout.py runnable. 2. Plot figures/day37_drift.png showing cosine similarity vs horizon (1 to 30 steps). 3. The curve should start near 0.9 (1-step prediction is good) and decay to ~0.3-0.5 by step 30 — the drift horizon.

Step 1 — Implement open-loop rollout (40 min)

# src/day37_latent_rollout.py
"""Day 37: open-loop latent rollout with V-JEPA 2-AC.
Demonstrates the drift horizon — when imagined trajectories diverge from reality.
"""
import numpy as np, torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from transformers import AutoModel, AutoVideoProcessor
import decord

DEVICE = "cuda"
DTYPE = torch.bfloat16
MODEL_NAME = "facebook/vjepa2-ac-vitl16"
VIDEO_PATH = "../../w3-imitation/runs/act_aloha/eval/videos/episode_0.mp4"
ROLLOUT_STEPS = 30


def main():
    proc = AutoVideoProcessor.from_pretrained(MODEL_NAME)
    model = AutoModel.from_pretrained(MODEL_NAME, torch_dtype=DTYPE).to(DEVICE).eval()

    vr = decord.VideoReader(VIDEO_PATH)
    frames = vr.get_batch(np.arange(ROLLOUT_STEPS + 1)).asnumpy()
    inputs = proc(videos=[frames], return_tensors="pt").to(DEVICE, DTYPE)

    with torch.no_grad():
        all_target_emb = model.encode_video(inputs.pixel_values)
        # Shape: (1, (T+1)*N_patches, D)

    # Open-loop rollout: start from frame 0 embedding, predict next, feed back, repeat
    n_patches_per_frame = all_target_emb.shape[1] // (ROLLOUT_STEPS + 1)
    current_emb = all_target_emb[:, :n_patches_per_frame].clone()  # frame 0
    actions = torch.zeros(1, 1, 14, device=DEVICE, dtype=DTYPE)  # mock zero action

    cosines = []
    for t in range(ROLLOUT_STEPS):
        with torch.no_grad():
            next_emb = model.predict_action_conditioned(
                context_emb=current_emb, actions=actions
            )
        actual_emb = all_target_emb[:, (t+1)*n_patches_per_frame : (t+2)*n_patches_per_frame]
        cos = F.cosine_similarity(
            next_emb.float().reshape(-1, next_emb.shape[-1]),
            actual_emb.float().reshape(-1, actual_emb.shape[-1]),
            dim=-1
        ).mean().item()
        cosines.append(cos)
        current_emb = next_emb  # OPEN-LOOP: feed prediction back

    print(f"Step  1 cosine: {cosines[0]:.3f}")
    print(f"Step 10 cosine: {cosines[9]:.3f}")
    print(f"Step 30 cosine: {cosines[29]:.3f}")

    # Closed-loop comparison
    cosines_closed = []
    for t in range(ROLLOUT_STEPS):
        with torch.no_grad():
            current = all_target_emb[:, t*n_patches_per_frame : (t+1)*n_patches_per_frame]
            next_pred = model.predict_action_conditioned(
                context_emb=current, actions=actions
            )
        actual = all_target_emb[:, (t+1)*n_patches_per_frame : (t+2)*n_patches_per_frame]
        c = F.cosine_similarity(
            next_pred.float().reshape(-1, next_pred.shape[-1]),
            actual.float().reshape(-1, actual.shape[-1]),
            dim=-1
        ).mean().item()
        cosines_closed.append(c)

    fig, ax = plt.subplots(figsize=(8, 4))
    ax.plot(range(1, ROLLOUT_STEPS + 1), cosines, "o-", label="open-loop")
    ax.plot(range(1, ROLLOUT_STEPS + 1), cosines_closed, "s-", label="closed-loop (1-step)")
    ax.axhline(0.5, color="r", linestyle="--", alpha=0.5)
    ax.set_xlabel("Rollout step")
    ax.set_ylabel("Mean cosine similarity")
    ax.set_title("V-JEPA 2-AC drift: open-loop vs closed-loop")
    ax.set_ylim(0, 1)
    ax.legend()
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig("../figures/day37_drift.png", dpi=120)
    print("Wrote figures/day37_drift.png")


if __name__ == "__main__":
    main()

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.