Course navigation
Week 6: Frontier EmbodimentDay 36
V-JEPA 2, V-JEPA 2-AC, LeJEPA
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (15 min)
- JEPA (Movement, Mechanics & Robot BodyJointA movable connection between robot parts. Embedding Predictive Architecture) — Yann LeCun 2022. An architecture that predicts representations of future states (in embedding space), not pixels. Sidesteps the "predict every pixel" inefficiency of generative world models.
- V-JEPA — Video JEPA. First major instantiation, Meta 2024. Predicts masked video tubelets in DINO-style embedding space.
- V-JEPA 2 — Meta 2024 (released late 2024). Scaled-up V-JEPA: 1B+ params, trained on 2M+ hours of video.
- V-JEPA 2.1 — Meta Mar 2026. Latest checkpoint with better Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task. recipes and added action-conditioning hooks.
- V-JEPA 2-AC — Action-Conditioned variant. Fine-tuned for "given current obs and next Core ConceptsActionA command the robot sends to its motors, controller, or low-level system., predict next obs embedding." Usable as a Modern Robot LearningWorld modelA model that predicts how the world will change after actions. for Control & PlanningPlanningFiguring out what the robot should do before or during movement..
- LeJEPA — LeRobot's JEPA implementation (HuggingFace, 2025). Smaller, robotics-data-trained.
- Predictor — Small transformer that takes context embeddings + mask → predicts target embeddings.
- Stop-gradient — On the target encoder. Prevents collapse where the predictor learns a degenerate constant function.
- EMA target — Target encoder is exponential moving average of online encoder. Standard self-supervised trick (BYOL, DINO, JEPA).
- Tubelet — A small spatiotemporal patch (e.g. 2×16×16) in video. The unit of masking.
Real-world analogy
Traditional video models predict pixels: "what will the next frame look like?" — wasteful, since most pixels (background, lighting) don't matter for Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.. JEPA predicts representations: "what will the next frame mean?" — discards low-level texture Data, Distributions & Training IssuesNoiseUnwanted variation or randomness in sensor readings or actuation., focuses on semantic content. Like the difference between transcribing every word in a meeting versus writing meeting minutes.
Hour 1 — Reading
- LeCun's A Path Towards Autonomous Machine Intelligence (2022) — sections on JEPA, ~25 min: https://openreview.net/pdf?id=BZ5a1r-kVsf
- V-JEPA paper, abstract + Section 3 (~20 min): https://arxiv.org/abs/2404.08471
- V-JEPA 2 / 2-AC blog (~15 min): https://ai.meta.com/blog/v-jepa-2-world-model-physical-reasoning/
Hour 2 — LeJEPA codebase
ssh -i ~/.ssh/nebius_key ubuntu@<your-h100-ip>
cd ~ && mkdir -p robo47-wm && cd robo47-wm
uv venv --python 3.12 .venv && source .venv/bin/activate
git clone https://github.com/huggingface/lejepa
cd lejepa
uv pip install -e .- Read these files for ~30 min:
lejepa/models/predictor.py— the small predictor transformerlejepa/models/encoder.py— DINOv3-style ViT encoderlejepa/training/loss.py— the L1/L2 prediction loss + stop-gradient logic
LAB
Hour 3 — Lab: V-JEPA 2-AC zero-shot inference (75 min)
What you're building. Run V-JEPA 2-AC on a short video clip from one of your imitation-learning rollouts (Day 16's ACT eval video). Use it to predict the embedding of the next frame given current frame + next Core ConceptsActionA command the robot sends to its motors, controller, or low-level system., then verify the prediction matches the real next frame's embedding within a small tolerance.
What success looks like at the end. You have:
1. w6-frontier/src/day36_vjepa2_ac.py runnable.
2. Console output: predicted-vs-actual embedding cosine similarity ≥ 0.85 across 10 random frame pairs.
3. Plot figures/day36_jepa_pred_quality.png showing cosine-similarity distribution; should be tightly clustered above 0.8.
Step 1 — Download V-JEPA 2-AC checkpoint (15 min)
huggingface-cli download facebook/vjepa2-ac-vitl16 --local-dir checkpoints/vjepa2_ac
ls -la checkpoints/vjepa2_ac/Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.