Human's natural language instruction is inherently ambiguous. Standard language grounding and planning methods fail to resolve ambiguity. We propose FISER, which explicitly reasons about human's internal intentions as intermediate steps.
The robot disamiguates the instruction into a concrete robot-understandable task in the social reasoning phase (Phase 1), and then accomplishes the grounded planning in the embodied reasoning phase (Phase 2).
For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks.
Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.
The following is our problem formulation and proposed method. White nodes represent observable variables, while grey nodes are unobservable. The robot is given the trajectory with a final state at time t', and an utterance u. We propose to explicitly model human's intentions by modeling the human's overall plan G^h as a set of predicates p_k. We further assume that human selects a subgoal p* that needs help, and then specifies a robot's task G^r, which is the underlying intention of human when saying u.
Our FISER framework opts to decompose the problem into two parts -- social reasoning and embodied reasoning.
Specifically, social reasoning is aimed at predicting the sub-task for which the human is asking for assistance, which can be inferred from the context of both the instruction and the observed historical actions of the person in the shared environment.
After grounding the instructions into robot-understandable tasks, the robot can then do planning and interact with the environment, in a separate embodied reasoning phase.
To further enhance the model's ability to follow ambiguous instructions, we propose to explicitly add an extra plan recognition stage, where a set of logical predicates is used to help with inferring the human's overall plan.
We implement a Transformer-based model trained in a supervised learning manner to predict specified sub-tasks (and the human's underlying plan) at intermediate layers. This step-by-step approach distinctly differs from the more commonly employed end-to-end methods in previous works.
The Transformer-based model has four parts of inputs, which are passed separately into different Transformer Encoder Layers, and interact with each other through a Modality Interaction module after each Transformer layer. The first 2N layers form the social reasoning phase and the last N layers form the embodied reasoning phase. The embeddings at Layer 2N are used for recognizing robot's task, and the last layer embeddings are used for predicting actions.
We evaluate our framework by training a Transformer-based model from scratch for the challenging HandMeThat benchmark, and then compare them with multiple competitive baselines, including the state-of-the-art prior work on HandMeThat, and the CoT prompting on the largest available pre-trained language models.
The HandMeThat benchmark introduces a household ambiguous instruction following task rendered in text. Their instructions are split into four difficulty levels, and the gaps between levels correspond to different challenges.
Level 1: No ambiguity, pure planning. E.g., "give me the book on the sofa."
Level 2: Social reasoning is required, but it is sufficient to to infer human's goals. E.g., human is storing books into a box and then asks, "could you pass that from the sofa?"
Level 3: Pragmatic reasoning in language use is further required. E.g., there are multiple books everywhere in the room and only one coat, which is on the sofa. Both a book and a coat are helpful to human's goal, and human asks, "could you pass that from the sofa?" In this case, the human is asking for the book, since "from the sofa" is required to disambiguate which book the human is referring to, but is unnecessary if the human wanted the coat.
Level 4: Tasks with inherent ambiguities. It cannot be resolved with the existing information, but can potentially be resolved with a strong prior over what human is likely to do. E.g., taking one more fruit will complete the goal of packing picnics, but there are many kinds of fruits in the refrigerator to choose from. From the perspective of completing the goal, any fruit will do, but human preferences may make a difference--the human may want apples instead of bananas at this time.
Key Insights:
Explicitly modeling human intentions works better than directly predicting actions.
Separating the social and embodied reasoning steps by explicitly recognizing the robot's task is beneficial.
Explicitly recognizing the human's plan further helps with the social reasoning stage in the most ambiguous cases (Level 4).
Pre-trained LLMs, despite having access to common-sense knowledge, do not adequately perform the complex social and embodied reasoning in this task. Incorporation of domain-specific knowledge through CoT can help.
Failure case analysis for GPT-4 Turbo with CoT prompts versus our models trained from scratch with the FISER framework.
Planning Failure: Hallucination (go to a place where the target object is not located at, i.e., fail to locate the object), missing steps, or invalid actions.
Redundant Behavior: The model gives the human an object that is already at its target location or even one which was just manipulated by the human.
Incorrect Intention: For GPT-4, common-sense reasoning is performed but not aligned with the ground-truth human intention. For Transformer, the model can reach the object it wants, but that object is not what the human wants.
Prompting methods alone cannot provide the model with the type of social and embodied reasoning needed to solve this task. Training a small-scale model on this specific domain, however, can solve the problem more efficiently and reliably.
An additional experiment to see if the performance of the pre-trained LLMs could be improved. Here we provide additional assistance by filtering out a proportion of irrelevant objects from the environment (which assumes access to the ground-truth human goals).
Even in the case that only relevant objects are remained, it can only achieve less than 85% success rate, which aligns with the planning failure rates in the failure case analysis. As the irrelevant objects increase, the success rate of GPT-4 Turbo drops dramatically, showing the challenges of social reasoning in HandMeThat tasks.
LLM's performance relies on a very large proportion of objects being filtered out, which provides insight that LLMs cannot effectively select relevant environment information and focus on relevant objects, which is required in embodied reasoning.
We study the challenging HandMeThat benchmark, comprising ambiguous instruction following tasks requiring sophisticated embodied and social reasoning. We find that existing approaches for training models end-to-end, or for prompting powerful pre-trained LLMs, are both insufficient to solve these tasks. We hypothesized that performance could be improved by building a model that explicitly performs social reasoning to infer the human's intentions from their prior actions in the environment. Our results provide evidence for this hypothesis, and show that our approach, Follow Instructions with Social and Embodied Reasoning (FISER), enhances performance over the most competitive prompting baselines by 70\%, setting the new state-of-the-art for HandMeThat.
@article{wan2024fiser,
author = {Wan, Yanming and Wu, Yue and Wang, Yiping and Mao, Jiayuan and Jaques, Natasha},
title = {Infer Human’s Intentions Before Following Natural Language Instructions},
booktitle = {ArXiv Preprint},
eprint = {2409.18073},
year = {2024},
}