EgoScale and the data-collection paradigm shift

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (10 min)

EgoScale — Feb 2026 paper / (Meta + collaborators). Massive egocentric video (10,000+ hours) with synchronized hand pose, gaze, language. Targeted at generalist VLAs.
Egocentric video — First-person video, e.g. from head-mounted GoPro or smart glasses. Captures from the actor's POV.
Project Aria — Meta's research smart-glasses platform. EgoScale uses Aria-style devices.
labels from video — Use VLMs (Day 33) to auto-label actions from egocentric video. Cheap; less accurate than .
Hand-tracking model — Reconstructs 3D hand pose from RGB video. Standard approach: HaMeR, MANO model.
Why this matters — Until 2025, data was bottlenecked by hours (~$50/hour, slow). Egocentric video can be collected at scale ($0.50/hour from existing footage). 100× cost reduction.

Real-world analogy

Pre-EgoScale: train a chef by having them re-cook each recipe 100× while a lab tech holds their hands and records every motion. Post-EgoScale: just film professional chefs doing their job in their kitchens; auto-extract what their hands did. Same data, 100× cheaper.

Hour 1 — Reading

EgoScale paper, abstract + Section 3 (~30 min): https://arxiv.org/abs/2602.xxxxx (search "EgoScale 2026")
Ego4D (predecessor): abstract + figures (~15 min): https://ego4d-data.org/
Project Aria: https://www.projectaria.com/research/

Hour 2 — Inspect EgoScale samples

If the is downloadable:

huggingface-cli download facebook/egoscale-v1 --local-dir data/egoscale --include "samples/*"
ls data/egoscale/samples/

For each sample, look at:
The egocentric MP4 (~30 s)
The hand pose JSON (per-frame 3D positions)
The auto-generated labels

LAB

Hour 3 — Lab: extract hand pose from a 30-second clip (60 min)

What you're building. Take an egocentric video clip you record yourself (point your phone at your hands while making a sandwich, ~30 s). Run HaMeR (open-source 3D hand pose reconstruction) on it. Output a 30-second timeline of 3D hand keypoints.

Step 1 — Install HaMeR (15 min)

git clone https://github.com/geopavlakos/hamer
cd hamer
uv pip install -e .
huggingface-cli download geopavlakos/hamer --local-dir checkpoints/hamer

Step 2 — Record + run (30 min)

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.

Papers you will re-read after this

Perceptive humanoid parkour OmniRetarget — humanoid loco-manipulation BeyondMimic — guided diffusion humanoid EgoScale — egocentric dexterous data