Day 13
Modern perception: DINOv3, SAM 3, Depth-Anything V2
This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.
LECTURE & READING
Glossary primer (12 min)
- Vision Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. (VFM) — A large pre-trained vision model whose features generalize to many downstream tasks. The 2025–2026 stack: DINOv3, SAM 3, Depth-Anything V2.
- DINOv3 — Meta's self-supervised ViT (Apr 2025), 7B parameter version produces strong dense features without labels. Used as frozen backbone in many VLAs and world models.
- DINO embedding — Per-patch feature vector (typically 1024-d for ViT-L). Concatenated to form an
(num_patches, embed_dim)tensor per image. - SAM 3 — Meta's Segment Anything Model 3 (Dec 2024 release), promptable instance + semantic Perception & SensingSegmentationDividing an image into meaningful regions or object masks. across images and videos. Successor to SAM 2.
- Depth-Anything V2 — Tiktok / NUS monocular Evaluation & ResearchMetricA numerical measure of performance. depth estimation. Produces calibrated depth maps from a single Perception & SensingRGB imageA standard color image with red, green, and blue channels.; widely used as the depth backbone in 2025–2026.
- GeoVF — Geometric Vision Foundation models (umbrella term used in 2025+ papers): MASt3R, DUSt3R, VGGT, Spatial-Tracker. Predict 3D geometry directly from images.
- Modern Robot LearningZero-shotDoing a new task without task-specific training. Robot LearningInferenceUsing a trained model to make predictions or choose actions. — Apply a model with no task-specific Robot LearningTrainingThe process of fitting a model using data or experience.. The default mode for VFMs.
- CLS token — A learnable token whose final embedding is used as a global image representation in ViTs.
Real-world analogy
A vision Modern Robot LearningFoundation modelA large pretrained model that can be adapted to many tasks. is a polymath who's read every book ever written but hasn't done your specific homework. Ask them to "describe the cat in this photo" and they'll do it cold. Ask them to "diagnose this MRI" and they'll need a few hundred examples first — Modern Robot LearningFine-tuningTaking a pretrained model and adapting it to a specific robot or task.. The remarkable thing about 2024–2026 VFMs is how much they can do Modern Robot LearningZero-shotDoing a new task without task-specific training..
Hour 1 — Concept
Pick from these (~50 min total):
- DINOv3 paper, just abstract + Sec. 3 (~15 min): https://arxiv.org/abs/2508.10104
- SAM 3 release notes / blog: https://ai.meta.com/research/publications/sam-3
- Depth-Anything V2 paper, abstract + figures (~10 min): https://arxiv.org/abs/2406.09414
- Yannic Kilcher's video on the original DINO (still highly relevant)
Hour 2 — Set up the perception stack
Provision your Nebius H100 instance for the day.
ssh -i ~/.ssh/nebius_key ubuntu@<your-instance-ip>
cd ~
mkdir -p robo47-percep && cd robo47-percep
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
uv pip install transformers accelerate timm pillow matplotlib numpy
uv pip install opencv-python-headless einopsTest the stack:
python -c "
import torch
print(f'torch {torch.__version__}, cuda {torch.cuda.is_available()}, dev {torch.cuda.get_device_name(0)}')
"Expected:
torch 2.5.x+cu124, cuda True, dev NVIDIA H100 80GB HBM3LAB
Hour 3 — Lab: zero-shot perception pipeline on a household scene (90 min)
What you're building. A single Python script that takes one input image (a cluttered desk or kitchen) and produces: 1. DINOv3 patch embeddings visualized via PCA-RGB. 2. SAM 3 auto-segmentation of every visible instance. 3. Depth-Anything V2 Evaluation & ResearchMetricA numerical measure of performance. depth. 4. A 4-panel composite figure: input | DINO features | SAM masks | depth heatmap.
What success looks like at the end. You have:
1. w2-systems/src/day13_percep_stack.py runnable end-to-end on a single image.
2. Composite figure figures/day13_percep_stack.png showing all 4 panels at 4 × 256 px each.
3. Console output: shape of DINO features (N_patches, 1024), number of SAM masks (typically 10–30), depth range in meters.
4. Wall-clock time per Robot LearningInferenceUsing a trained model to make predictions or choose actions. logged: DINO < 200 ms, SAM 3 < 1 s, Depth-Anything V2 < 200 ms on H100.
Step 1 — Get a test image (5 min)
Either upload a photo or download a public one:
Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.