Modern perception: DINOv3, SAM 3, Depth-Anything V2

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (12 min)

Vision (VFM) — A large pre-trained vision model whose features generalize to many downstream tasks. The 2025–2026 stack: DINOv3, SAM 3, Depth-Anything V2.
DINOv3 — Meta's self-supervised ViT (Apr 2025), 7B parameter version produces strong dense features without labels. Used as frozen backbone in many VLAs and world models.
DINO embedding — Per-patch feature vector (typically 1024-d for ViT-L). Concatenated to form an (num_patches, embed_dim) tensor per image.
SAM 3 — Meta's Segment Anything Model 3 (Dec 2024 release), promptable instance + semantic across images and videos. Successor to SAM 2.
Depth-Anything V2 — Tiktok / NUS monocular depth estimation. Produces calibrated depth maps from a single ; widely used as the depth backbone in 2025–2026.
GeoVF — Geometric Vision Foundation models (umbrella term used in 2025+ papers): MASt3R, DUSt3R, VGGT, Spatial-Tracker. Predict 3D geometry directly from images.
— Apply a model with no task-specific . The default mode for VFMs.
CLS token — A learnable token whose final embedding is used as a global image representation in ViTs.

Real-world analogy

A vision is a polymath who's read every book ever written but hasn't done your specific homework. Ask them to "describe the cat in this photo" and they'll do it cold. Ask them to "diagnose this MRI" and they'll need a few hundred examples first — . The remarkable thing about 2024–2026 VFMs is how much they can do .

Hour 1 — Concept

Pick from these (~50 min total):

DINOv3 paper, just abstract + Sec. 3 (~15 min): https://arxiv.org/abs/2508.10104
SAM 3 release notes / blog: https://ai.meta.com/research/publications/sam-3
Depth-Anything V2 paper, abstract + figures (~10 min): https://arxiv.org/abs/2406.09414
Yannic Kilcher's video on the original DINO (still highly relevant)
Video
Yannic Kilcher's video on the original DINO (still highly relevant)
Open source

Hour 2 — Set up the perception stack

Provision your Nebius H100 instance for the day.

ssh -i ~/.ssh/nebius_key ubuntu@<your-instance-ip>

cd ~
mkdir -p robo47-percep && cd robo47-percep
uv venv --python 3.12 .venv
source .venv/bin/activate

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
uv pip install transformers accelerate timm pillow matplotlib numpy
uv pip install opencv-python-headless einops

Test the stack:

python -c "
import torch
print(f'torch {torch.__version__}, cuda {torch.cuda.is_available()}, dev {torch.cuda.get_device_name(0)}')
"

Expected:

torch 2.5.x+cu124, cuda True, dev NVIDIA H100 80GB HBM3

LAB

Hour 3 — Lab: zero-shot perception pipeline on a household scene (90 min)

What you're building. A single Python script that takes one input image (a cluttered desk or kitchen) and produces: 1. DINOv3 patch embeddings visualized via PCA-RGB. 2. SAM 3 auto-segmentation of every visible instance. 3. Depth-Anything V2 depth. 4. A 4-panel composite figure: input | DINO features | SAM masks | depth heatmap.

What success looks like at the end. You have: 1. w2-systems/src/day13_percep_stack.py runnable end-to-end on a single image. 2. Composite figure figures/day13_percep_stack.png showing all 4 panels at 4 × 256 px each. 3. Console output: shape of DINO features (N_patches, 1024), number of SAM masks (typically 10–30), depth range in meters. 4. Wall-clock time per logged: DINO < 200 ms, SAM 3 < 1 s, Depth-Anything V2 < 200 ms on H100.

Step 1 — Get a test image (5 min)

Either upload a photo or download a public one:

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.