Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, Andy Zeng
ARCHITECTURE
THE PROBLEM
Before SayCan, there were two separate worlds that couldn't talk to each other. On one side, you had large language models (like GPT-3 era models) that could generate plausible sequences of actions—"to clean a spill, you would find a towel, wet it, and wipe the area." On the other side, you had roboticists who had trained low-level skills through Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. or Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task.—discrete behaviors like "pick up object" or "go to location." The problem: language models have no idea what your specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can actually do. When researchers asked a language model "I spilled my drink, can you help?" it would generate responses like "You could try using a vacuum cleaner" or even "I'm sorry, I didn't mean to spill it." These are reasonable sentences, but they're completely infeasible for a mobile manipulator in a kitchen. The core issue is that language models are trained on text data from the internet, which contains no information about physics, Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits., or what's physically possible given your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s current Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. and Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces.. Prior work either used language models without grounding (producing infeasible plans) or manually engineered Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. decompositions (requiring tedious human specification for every new instruction).
HOW IT WORKS
Query the Language Model for Feasible Next Steps
Weight Feasibility with Value Functions
Iteratively Plan and Execute
Scale Performance with Better Language Models
MORE DEMONSTRATIONS
KEY RESULTS
vs. ~60-70% with FLAN (the initial language model)
This measures how often the language model—weighted by value functions—selects the right next Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. for a Modern Robot LearningLong-horizon taskA task requiring many coordinated steps, memory, or replanning.. 84% means that in most complex instructions, the system is choosing semantically and physically appropriate actions. This is a 40% relative error reduction, showing that scaling the language model dramatically improves Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningPlanningFiguring out what the robot should do before or during movement..
vs. roughly 50% with FLAN
Even if the right Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. is selected, the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. still has to physically execute it. 74% means roughly 3 out of 4 times, the selected Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. executes successfully in the real world. This accounts for failures in both the Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer.'s Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. (overestimating success likelihood) and the Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. itself. The fact that this improves with better language models suggests that semantic understanding of context helps—when the language model better understands the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., it requests skills at moments when they're most likely to succeed.
vs. short-horizon or language-free baselines
The system was tested on real kitchen and office tasks that required multiple steps across different spatial locations (navigate, pick, place, Imitation & Reinforcement LearningReturnThe total accumulated reward over time.). These are genuinely complex: "bring me a Coke" requires finding an object in an unknown location, picking it, and bringing it to the user. This isn't toy Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.; it's the kind of instruction a person would actually give a home Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions..
PERFORMANCE COMPARISON
WHY DEVELOPERS SHOULD CARE
If you're building robotics software, SayCan teaches you something fundamental: don't try to make one component do everything. Language models are incredible at semantic reasoning but terrible at physical grounding. Value functions excel at environment-specific decision-making but can't understand abstract intentions. By combining them, each compensates for the other's weaknesses. More practically, this means your natural language interface for robots doesn't have to be engineered from scratch. You can leverage pretrained language models off-the-shelf—OpenAI's API, open-source models like LLAMA—and combine them with whatever skills you've already trained via Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. or Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards.. The framework is general: it works with any Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. set and any language model. The second major insight is about iteration and grounding. Many roboticists try to plan everything upfront ("predict the entire sequence of 10 steps and then execute"), but SayCan plans one step at a time in a loop, constantly checking: "can the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actually do this right now?" This is more robust because plans adapt to reality. If something goes wrong, the next iteration of the language model sees the actual Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. and can redirect. For a software developer unfamiliar with robotics, this is counterintuitive—we're used to Control & PlanningPlanningFiguring out what the robot should do before or during movement. everything and executing plans faithfully. But robots live in the real world where things fail, slip, or move unexpectedly. Grounded, iterative Control & PlanningPlanningFiguring out what the robot should do before or during movement. is the right mental model.
LIMITATIONS
SayCan's limitations are honest and important. First, it requires pretrained skills—you need a library of behaviors like "pick up," "go to location," "open drawer" already trained and working. The paper doesn't address how to acquire these skills or how to build them for new robots/tasks; that's assumed solved. Second, value functions sometimes overestimate success likelihood, leading to impossible attempts that fail. The 74% Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. means 26% of selected skills fail in Core ConceptsExecutionActually carrying out planned or predicted actions on the robot., which is significant. Third, the system struggles with novel or compositionally complex instructions that require skills it wasn't trained for. If the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. has never learned to "tie a knot" or "wash a cup," no amount of language understanding will help. Finally, the approach is demonstrated on a specific mobile manipulator in controlled environments (kitchens, offices). Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. to different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies or highly dynamic environments isn't explored. The language model also can't correct for systematic biases in its Robot LearningTrainingThe process of fitting a model using data or experience.—if the Robot LearningTrainingThe process of fitting a model using data or experience. data has stereotypes or incorrect assumptions about how to accomplish tasks, those persist in the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s behavior.
WHAT COMES NEXT
The Core ConceptsTrajectoryA sequence of states or actions over time. is clear: tighter integration with foundation models and better value functions. The paper shows that simply scaling the language model (FLAN → PaLM) yields major improvements, so we should expect continued gains as models like GPT-4 level and beyond are applied. The next frontier is learning value functions online—currently they're fixed after Robot LearningTrainingThe process of fitting a model using data or experience., but a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. that updates its Modern Robot LearningAffordanceWhat actions an object allows, such as a handle being pullable or a button being pressable. estimates as it explores would be far more capable. Another direction is moving beyond single-robot Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. to multi-robot coordination: can a language model coordinate multiple robots with different Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. sets? The 2022 update teases "chain of thought prompting," suggesting the field is experimenting with having the language model verbalize its reasoning about why a particular sequence of steps makes sense, which could improve transparency and Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation.. We'll also see this pattern—grounded language models for robotics—applied to navigation-only robots, manipulation-only arms, and humanoid systems. The core principle (language models for semantic Control & PlanningPlanningFiguring out what the robot should do before or during movement. + environment-grounded value functions for Core ConceptsExecutionActually carrying out planned or predicted actions on the robot.) is agnostic to Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits..