Day 33

Gemini Robotics + Robot Academy IBVS primer

This is a valid v1.0 placeholder page for the later curriculum arc. Full interactive lab treatment ships after Week 1 dogfooding.

LECTURE & READING

Glossary primer (10 min)

  • Gemini Robotics — Google DeepMind 2025. Robotics adaptation of Gemini-2.0/2.5. Native Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. (image + video + audio + text → Core ConceptsActionA command the robot sends to its motors, controller, or low-level system.).
  • Gemini Robotics-ER — "Embodied Reasoning" variant. Spatial reasoning, scene understanding, plan generation.
  • Gemini Robotics-ER 1.6 — Apr 14, 2026 release. Latest ER. Stronger spatial grounding.
  • Native Modern Robot LearningMultimodalUsing more than one type of input, like vision, language, touch, or proprioception. — Trained from scratch on mixed image/text/audio/video tokens. Not "text LLM + vision adapter."
  • VLM-as-policy — Use Modern Robot LearningVision-Language Model (VLM)A model that understands both images and text. directly as Core ConceptsPolicyThe rule or model that maps observations or states to actions. via Movement, Mechanics & Robot BodyEnd-effectorThe tool at the end of a robot arm, like a gripper, hand, or suction cup. deltas in language form (e.g. "move +0.05 m in x"). Gemini Robotics-ER does this.
  • IBVS (Image-Based Visual Servoing) — Classical: drive image features (Pixel position of object) to a target by computing visual-feature Jacobian. Predates VLAs by 30 years; conceptually similar to "Core ConceptsPolicyThe rule or model that maps observations or states to actions. outputs EE deltas given vision".

Real-world analogy

Gemini Robotics is "Tesla AutoPilot": vertically integrated, proprietary, fed by enormous private data. ER 1.6 is the latest "FSD beta" with sharper spatial reasoning.

Hour 1 — Robot Academy IBVS primer (visual intuition first)

Watch Visual Servoing masterclass, focus on Image-Based VS lessons (~35 min): https://robotacademy.net.au/masterclass/vision-and-motion/

Why now? IBVS predates VLAs by decades but the conceptual loop — "vision → EE delta" — is identical. Modern policies are IBVS, with a learned visual-feature Jacobian. Watching Corke's animated IBVS demos makes "what does Gemini Robotics-ER actually do?" click.

Hour 2 — Reading

LAB

Hour 3 — Lab: Gemini Robotics-ER inference via API (75 min)

What you're building. Use Google's Gemini API (which exposes Gemini Robotics-ER 1.6 publicly as of Apr 2026) to do spatial reasoning queries on images, then use the responses to drive a simulated Panda toward a designated object.

Step 1 — Setup API key (10 min)

uv pip install google-generativeai
export GOOGLE_API_KEY=<your-key>  # from https://ai.google.dev

Step 2 — Spatial reasoning query (30 min)

Full source continues in the committed curriculum files. The v1.0 page exposes the day flow and lab surface without inventing content.

Completion controls unlock when this day graduates from placeholder to full lab.