LEARNINGFOUNDATIONAL2022-04-04

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, Andy Zeng

ARCHITECTURE
language model with pretrained skills and value functions
ROBOT
mobile manipulator
TASK
manipulation, navigation, long-horizon instruction following

Imagine telling a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. "I spilled my drink, can you help?" and having it actually understand what to do—not just generate text like "try using a vacuum cleaner," but actually pick up a towel, locate the spill, and clean it. That's what SayCan does. This paper solves a fundamental problem: large language models have incredible semantic knowledge about the world, but they're completely disconnected from physical reality. A language model trained on the internet knows *what* cleaning a spill means, but it doesn't know *if* your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can actually do it, or *whether* attempting to pick up a cup will succeed given the current Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. of the Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces.. SayCan bridges this gap by combining language models with real-world skills and value functions—essentially giving the language model "hands and eyes" grounded in what the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can actually do. The result: a mobile manipulator that completes long-horizon, abstract natural language instructions in real kitchens and offices.

ARCHITECTURE

THE PROBLEM

Before SayCan, there were two separate worlds that couldn't talk to each other. On one side, you had large language models (like GPT-3 era models) that could generate plausible sequences of actions—"to clean a spill, you would find a towel, wet it, and wipe the area." On the other side, you had roboticists who had trained low-level skills through Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards. or Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task.—discrete behaviors like "pick up object" or "go to location." The problem: language models have no idea what your specific Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. can actually do. When researchers asked a language model "I spilled my drink, can you help?" it would generate responses like "You could try using a vacuum cleaner" or even "I'm sorry, I didn't mean to spill it." These are reasonable sentences, but they're completely infeasible for a mobile manipulator in a kitchen. The core issue is that language models are trained on text data from the internet, which contains no information about physics, Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits., or what's physically possible given your Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s current Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. and Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces.. Prior work either used language models without grounding (producing infeasible plans) or manually engineered Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. decompositions (requiring tedious human specification for every new instruction).

HOW IT WORKS

1

Query the Language Model for Feasible Next Steps

Instead of asking the language model to generate a full plan from scratch, SayCan uses a dialogue structure. Given a high-level instruction like "Bring me a Coke can," the system prompts the language model to generate a sequence of reasonable sub-tasks: "1. Find a Coke can, 2. Pick it up, 3. Bring it to you, 4. Done." The language model naturally produces this because it has seen thousands of human-written instructions and their decompositions. However—and this is crucial—the system doesn't execute these steps blindly. Instead, each generated step is treated as a candidate Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. that the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. could perform. The language model is essentially scoring the likelihood that each Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. makes progress toward the Core ConceptsGoalThe desired outcome or target state for a robot task., which it does well because this is semantic reasoning about plans and instructions.

2

Weight Feasibility with Value Functions

Here's where the magic happens: each of the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s pretrained skills has an associated Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward.—a neural network trained to estimate the probability that executing this Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. will succeed from the current Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables.. For example, a "pick up object" Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. has a Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. that looks at the current camera image and predicts: "given the current pose of the Movement, Mechanics & Robot BodyGripperA common end-effector used to grasp objects. and the location of the object, how likely is it that this pick-up will succeed?" SayCan multiplies the language model's score (semantic relevance) by the Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward.'s score (physical feasibility). A Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. might be semantically perfect—"pick up the Coke can" absolutely makes sense for the instruction—but if the camera shows the can is too far away or occluded, the Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. scores it low, and the system won't select it yet. This combination is the core insight: language models provide world knowledge, value functions provide Core ConceptsEnvironmentThe external world the robot operates in, including objects, obstacles, people, and surfaces. grounding.

3

Iteratively Plan and Execute

Once a Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. is selected (the one with the highest combined score), the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. executes it. Then the process repeats: the system appends the executed Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. to the language model's response and queries it again, asking "what should I do next?" This creates a sequential Control & PlanningPlanningFiguring out what the robot should do before or during movement. loop where each step depends on whether the previous step succeeded. For "bring me a Coke," this might look like: execute "go to the kitchen," then ask the model again (it might suggest "find the Coke can"), execute that, ask again ("pick up the can"), and so on. The system terminates when the language model outputs "Done." This iterative approach is critical because it means the plan adapts to reality—if the first Coke can is unreachable, the Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. will score "pick it up" low, so the system can instead choose "find another Coke can" if the language model suggests it. The Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. continuously grounds the language model in its actual situation.

4

Scale Performance with Better Language Models

A beautiful property of SayCan is that it directly benefits from improvements in language models. The paper's 2022 update integrated Google's PaLM (Pathways Language Model), a substantially larger and better language model than the initial FLAN model. With PaLM instead of FLAN, the system improved from selecting the correct sequence of skills ~60-70% of the time to 84%, and successful Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. rose to 74%—cutting error rates in half. This isn't a tweak to the algorithm; it's the same approach with a better language model. For developers, this is huge: as language models improve (and they do, rapidly), your robotics system automatically gets better without code changes.

demo sequence compressed
palm saycan teaser compressed
mosaic 16 demo white compat

MORE DEMONSTRATIONS

demo sequence2 compressed
demo sequence3 compressed
saycan drawer compressed 1
saycan drawer compressed 2
saycan cot compressed

KEY RESULTS

Correct Skill Sequence Selection (with PaLM)84%

vs. ~60-70% with FLAN (the initial language model)

This measures how often the language model—weighted by value functions—selects the right next Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. for a Modern Robot LearningLong-horizon taskA task requiring many coordinated steps, memory, or replanning.. 84% means that in most complex instructions, the system is choosing semantically and physically appropriate actions. This is a 40% relative error reduction, showing that scaling the language model dramatically improves Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Control & PlanningPlanningFiguring out what the robot should do before or during movement..

Successful Execution Rate (with PaLM)74%

vs. roughly 50% with FLAN

Even if the right Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. is selected, the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. still has to physically execute it. 74% means roughly 3 out of 4 times, the selected Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. executes successfully in the real world. This accounts for failures in both the Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer.'s Imitation & Reinforcement LearningValue functionA prediction of how good a state or action is in terms of future reward. (overestimating success likelihood) and the Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. itself. The fact that this improves with better language models suggests that semantic understanding of context helps—when the language model better understands the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening., it requests skills at moments when they're most likely to succeed.

Task Complexity TestedLong-horizon, abstract instructions on a mobile manipulator

vs. short-horizon or language-free baselines

The system was tested on real kitchen and office tasks that required multiple steps across different spatial locations (navigate, pick, place, Imitation & Reinforcement LearningReturnThe total accumulated reward over time.). These are genuinely complex: "bring me a Coke" requires finding an object in an unknown location, picking it, and bringing it to the user. This isn't toy Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects.; it's the kind of instruction a person would actually give a home Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions..

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, SayCan teaches you something fundamental: don't try to make one component do everything. Language models are incredible at semantic reasoning but terrible at physical grounding. Value functions excel at environment-specific decision-making but can't understand abstract intentions. By combining them, each compensates for the other's weaknesses. More practically, this means your natural language interface for robots doesn't have to be engineered from scratch. You can leverage pretrained language models off-the-shelf—OpenAI's API, open-source models like LLAMA—and combine them with whatever skills you've already trained via Imitation & Reinforcement LearningImitation Learning (IL)Teaching a robot by showing it examples of how to do a task. or Imitation & Reinforcement LearningReinforcement Learning (RL)Teaching a robot through trial and error using rewards.. The framework is general: it works with any Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. set and any language model. The second major insight is about iteration and grounding. Many roboticists try to plan everything upfront ("predict the entire sequence of 10 steps and then execute"), but SayCan plans one step at a time in a loop, constantly checking: "can the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. actually do this right now?" This is more robust because plans adapt to reality. If something goes wrong, the next iteration of the language model sees the actual Core ConceptsStateThe robot’s current condition, such as joint positions, velocity, object positions, or internal variables. and can redirect. For a software developer unfamiliar with robotics, this is counterintuitive—we're used to Control & PlanningPlanningFiguring out what the robot should do before or during movement. everything and executing plans faithfully. But robots live in the real world where things fail, slip, or move unexpectedly. Grounded, iterative Control & PlanningPlanningFiguring out what the robot should do before or during movement. is the right mental model.

LIMITATIONS

SayCan's limitations are honest and important. First, it requires pretrained skills—you need a library of behaviors like "pick up," "go to location," "open drawer" already trained and working. The paper doesn't address how to acquire these skills or how to build them for new robots/tasks; that's assumed solved. Second, value functions sometimes overestimate success likelihood, leading to impossible attempts that fail. The 74% Simulation & Sim-to-RealSuccess rateHow often the robot completes a task correctly. means 26% of selected skills fail in Core ConceptsExecutionActually carrying out planned or predicted actions on the robot., which is significant. Third, the system struggles with novel or compositionally complex instructions that require skills it wasn't trained for. If the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. has never learned to "tie a knot" or "wash a cup," no amount of language understanding will help. Finally, the approach is demonstrated on a specific mobile manipulator in controlled environments (kitchens, offices). Modern Robot LearningGeneralizationThe robot’s ability to work in new situations it has not seen before. to different Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. morphologies or highly dynamic environments isn't explored. The language model also can't correct for systematic biases in its Robot LearningTrainingThe process of fitting a model using data or experience.—if the Robot LearningTrainingThe process of fitting a model using data or experience. data has stereotypes or incorrect assumptions about how to accomplish tasks, those persist in the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s behavior.

WHAT COMES NEXT

The Core ConceptsTrajectoryA sequence of states or actions over time. is clear: tighter integration with foundation models and better value functions. The paper shows that simply scaling the language model (FLAN → PaLM) yields major improvements, so we should expect continued gains as models like GPT-4 level and beyond are applied. The next frontier is learning value functions online—currently they're fixed after Robot LearningTrainingThe process of fitting a model using data or experience., but a Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. that updates its Modern Robot LearningAffordanceWhat actions an object allows, such as a handle being pullable or a button being pressable. estimates as it explores would be far more capable. Another direction is moving beyond single-robot Core ConceptsExecutionActually carrying out planned or predicted actions on the robot. to multi-robot coordination: can a language model coordinate multiple robots with different Modern Robot LearningSkillA reusable behavior like grasp, push, place, or open drawer. sets? The 2022 update teases "chain of thought prompting," suggesting the field is experimenting with having the language model verbalize its reasoning about why a particular sequence of steps makes sense, which could improve transparency and Modern Robot LearningRobustnessHow well a robot keeps working despite noise, disturbances, or variation.. We'll also see this pattern—grounded language models for robotics—applied to navigation-only robots, manipulation-only arms, and humanoid systems. The core principle (language models for semantic Control & PlanningPlanningFiguring out what the robot should do before or during movement. + environment-grounded value functions for Core ConceptsExecutionActually carrying out planned or predicted actions on the robot.) is agnostic to Core ConceptsEmbodimentThe robot’s physical form, including its body, joints, sensors, and actuation limits..

RELATED PAPERS