LEARNINGFOUNDATIONAL2022-04-04

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, Andy Zeng

ARCHITECTURE

language model with pretrained skills and value functions

ROBOT

mobile manipulator

TASK

manipulation, navigation, long-horizon instruction following

Imagine telling a "I spilled my drink, can you help?" and having it actually understand what to do—not just generate text like "try using a vacuum cleaner," but actually pick up a towel, locate the spill, and clean it. That's what SayCan does. This paper solves a fundamental problem: large language models have incredible semantic knowledge about the world, but they're completely disconnected from physical reality. A language model trained on the internet knows *what* cleaning a spill means, but it doesn't know *if* your can actually do it, or *whether* attempting to pick up a cup will succeed given the current of the . SayCan bridges this gap by combining language models with real-world skills and value functions—essentially giving the language model "hands and eyes" grounded in what the can actually do. The result: a mobile manipulator that completes long-horizon, abstract natural language instructions in real kitchens and offices.

ARCHITECTURE

THE PROBLEM

Before SayCan, there were two separate worlds that couldn't talk to each other. On one side, you had large language models (like GPT-3 era models) that could generate plausible sequences of actions—"to clean a spill, you would find a towel, wet it, and wipe the area." On the other side, you had roboticists who had trained low-level skills through or —discrete behaviors like "pick up object" or "go to location." The problem: language models have no idea what your specific can actually do. When researchers asked a language model "I spilled my drink, can you help?" it would generate responses like "You could try using a vacuum cleaner" or even "I'm sorry, I didn't mean to spill it." These are reasonable sentences, but they're completely infeasible for a mobile manipulator in a kitchen. The core issue is that language models are trained on text data from the internet, which contains no information about physics, , or what's physically possible given your 's current and . Prior work either used language models without grounding (producing infeasible plans) or manually engineered decompositions (requiring tedious human specification for every new instruction).

HOW IT WORKS

Query the Language Model for Feasible Next Steps

Instead of asking the language model to generate a full plan from scratch, SayCan uses a dialogue structure. Given a high-level instruction like "Bring me a Coke can," the system prompts the language model to generate a sequence of reasonable sub-tasks: "1. Find a Coke can, 2. Pick it up, 3. Bring it to you, 4. Done." The language model naturally produces this because it has seen thousands of human-written instructions and their decompositions. However—and this is crucial—the system doesn't execute these steps blindly. Instead, each generated step is treated as a candidate that the could perform. The language model is essentially scoring the likelihood that each makes progress toward the , which it does well because this is semantic reasoning about plans and instructions.

Weight Feasibility with Value Functions

Here's where the magic happens: each of the 's pretrained skills has an associated —a neural network trained to estimate the probability that executing this will succeed from the current . For example, a "pick up object" has a that looks at the current camera image and predicts: "given the current pose of the and the location of the object, how likely is it that this pick-up will succeed?" SayCan multiplies the language model's score (semantic relevance) by the 's score (physical feasibility). A might be semantically perfect—"pick up the Coke can" absolutely makes sense for the instruction—but if the camera shows the can is too far away or occluded, the scores it low, and the system won't select it yet. This combination is the core insight: language models provide world knowledge, value functions provide grounding.

Iteratively Plan and Execute

Once a is selected (the one with the highest combined score), the executes it. Then the process repeats: the system appends the executed to the language model's response and queries it again, asking "what should I do next?" This creates a sequential loop where each step depends on whether the previous step succeeded. For "bring me a Coke," this might look like: execute "go to the kitchen," then ask the model again (it might suggest "find the Coke can"), execute that, ask again ("pick up the can"), and so on. The system terminates when the language model outputs "Done." This iterative approach is critical because it means the plan adapts to reality—if the first Coke can is unreachable, the will score "pick it up" low, so the system can instead choose "find another Coke can" if the language model suggests it. The continuously grounds the language model in its actual situation.

Scale Performance with Better Language Models

A beautiful property of SayCan is that it directly benefits from improvements in language models. The paper's 2022 update integrated Google's PaLM (Pathways Language Model), a substantially larger and better language model than the initial FLAN model. With PaLM instead of FLAN, the system improved from selecting the correct sequence of skills ~60-70% of the time to 84%, and successful rose to 74%—cutting error rates in half. This isn't a tweak to the algorithm; it's the same approach with a better language model. For developers, this is huge: as language models improve (and they do, rapidly), your robotics system automatically gets better without code changes.

demo sequence compressed

palm saycan teaser compressed

mosaic 16 demo white compat

MORE DEMONSTRATIONS

demo sequence2 compressed

demo sequence3 compressed

saycan drawer compressed 1

saycan drawer compressed 2

saycan cot compressed

KEY RESULTS

Correct Skill Sequence Selection (with PaLM)84%

vs. ~60-70% with FLAN (the initial language model)

This measures how often the language model—weighted by value functions—selects the right next for a . 84% means that in most complex instructions, the system is choosing semantically and physically appropriate actions. This is a 40% relative error reduction, showing that scaling the language model dramatically improves .

Successful Execution Rate (with PaLM)74%

vs. roughly 50% with FLAN

Even if the right is selected, the still has to physically execute it. 74% means roughly 3 out of 4 times, the selected executes successfully in the real world. This accounts for failures in both the 's (overestimating success likelihood) and the itself. The fact that this improves with better language models suggests that semantic understanding of context helps—when the language model better understands the , it requests skills at moments when they're most likely to succeed.

Task Complexity TestedLong-horizon, abstract instructions on a mobile manipulator

vs. short-horizon or language-free baselines

The system was tested on real kitchen and office tasks that required multiple steps across different spatial locations (navigate, pick, place, ). These are genuinely complex: "bring me a Coke" requires finding an object in an unknown location, picking it, and bringing it to the user. This isn't toy ; it's the kind of instruction a person would actually give a home .

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building robotics software, SayCan teaches you something fundamental: don't try to make one component do everything. Language models are incredible at semantic reasoning but terrible at physical grounding. Value functions excel at environment-specific decision-making but can't understand abstract intentions. By combining them, each compensates for the other's weaknesses. More practically, this means your natural language interface for robots doesn't have to be engineered from scratch. You can leverage pretrained language models off-the-shelf—OpenAI's API, open-source models like LLAMA—and combine them with whatever skills you've already trained via or . The framework is general: it works with any set and any language model. The second major insight is about iteration and grounding. Many roboticists try to plan everything upfront ("predict the entire sequence of 10 steps and then execute"), but SayCan plans one step at a time in a loop, constantly checking: "can the actually do this right now?" This is more robust because plans adapt to reality. If something goes wrong, the next iteration of the language model sees the actual and can redirect. For a software developer unfamiliar with robotics, this is counterintuitive—we're used to everything and executing plans faithfully. But robots live in the real world where things fail, slip, or move unexpectedly. Grounded, iterative is the right mental model.

LIMITATIONS

SayCan's limitations are honest and important. First, it requires pretrained skills—you need a library of behaviors like "pick up," "go to location," "open drawer" already trained and working. The paper doesn't address how to acquire these skills or how to build them for new robots/tasks; that's assumed solved. Second, value functions sometimes overestimate success likelihood, leading to impossible attempts that fail. The 74% means 26% of selected skills fail in , which is significant. Third, the system struggles with novel or compositionally complex instructions that require skills it wasn't trained for. If the has never learned to "tie a knot" or "wash a cup," no amount of language understanding will help. Finally, the approach is demonstrated on a specific mobile manipulator in controlled environments (kitchens, offices). to different morphologies or highly dynamic environments isn't explored. The language model also can't correct for systematic biases in its —if the data has stereotypes or incorrect assumptions about how to accomplish tasks, those persist in the 's behavior.

WHAT COMES NEXT

The is clear: tighter integration with foundation models and better value functions. The paper shows that simply scaling the language model (FLAN → PaLM) yields major improvements, so we should expect continued gains as models like GPT-4 level and beyond are applied. The next frontier is learning value functions online—currently they're fixed after , but a that updates its estimates as it explores would be far more capable. Another direction is moving beyond single-robot to multi-robot coordination: can a language model coordinate multiple robots with different sets? The 2022 update teases "chain of thought prompting," suggesting the field is experimenting with having the language model verbalize its reasoning about why a particular sequence of steps makes sense, which could improve transparency and . We'll also see this pattern—grounded language models for robotics—applied to navigation-only robots, manipulation-only arms, and humanoid systems. The core principle (language models for semantic + environment-grounded value functions for ) is agnostic to .

Read on arxiv →HTML source →

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Query the Language Model for Feasible Next Steps

Weight Feasibility with Value Functions

Iteratively Plan and Execute

Scale Performance with Better Language Models

MORE DEMONSTRATIONS

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy

Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics