Learning to Follow Navigational Directions

Learning to Follow Navigational Directions Adam Vogel and Dan Jurafsky Department of Computer Science Stanford University {acvogel,jurafsky}@stanford.edu Abstract We present a system that learns to follow navigational natural language directions. Where traditional models learn from linguistic annotation or word distributions, our approach is grounded in the world, learning by apprenticeship from 1. go vertically down until you’re underneath eh routes through a map paired with English diamond mine descriptions. Lacking an explicit align- 2. then eh go right until you’re ment between the text and the reference 3. you’re between springbok and highest view- path makes it difficult to determine what point portions of the language describe which aspects of the route. We learn this corre- Figure 1: A path appears on the instruction giver’s spondence with a reinforcement learning map, who describes it to the instruction follower. algorithm, using the deviation of the route we follow from the intended path as a reward signal. We demonstrate that our sys- grounded interaction with the world. This draws tem successfully grounds the meaning of on the intuition that children learn to use spatial spatial terms like above and south into ge- language through a mixture of observing adult lan- ometric properties of paths. guage usage and situated interaction in the world, usually without explicit definitions (Tanz, 1980). 1 Introduction Our system learns to follow navigational direc- Spatial language usage is a vital component for tions in a route following task. We evaluate our physically grounded language understanding sys- approach on the HCRC Map Task corpus (Ander- tems. Spoken language interfaces to robotic assis- son et al., 1991), a collection of spoken dialogs tants (Wei et al., 2009) and Geographic Informa- describing paths to take through a map. In this tion Systems (Wang et al., 2004) must cope with setting, two participants, the instruction giver and the inherent ambiguity in spatial descriptions. instruction follower, each have a map composed The semantics of imperative and spatial lan- of named landmarks. Furthermore, the instruc- guage is heavily dependent on the physical set- tion giver has a route drawn on her map, and it ting it is situated in, motivating automated learn- is her task to describe the path to the instruction ing approaches to acquiring meaning. Tradi- follower, who cannot see the reference path. Our tional accounts of learning typically rely on lin- system learns to interpret these navigational direc- guistic annotation (Zettlemoyer and Collins, 2009) tions, without access to explicit linguistic annota- or word distributions (Curran, 2003). In con- tion. trast, we present an apprenticeship learning sys- We frame direction following as an apprentice- tem which learns to imitate human instruction fol- ship learning problem and solve it with a rein- lowing, without linguistic annotation. Solved us- forcement learning algorithm, extending previous ing a reinforcement learning algorithm, our sys- work on interpreting instructions by Branavan et tem acquires the meaning of spatial words through al. (2009). Our task is to learn a policy, or mapping from world state to action, which most closely follows the reference route. Our state space combines world and linguistic features, representing both our current position on the map and the com- municative content of the utterances we are interpreting. During training we have access to the reference path, which allows us to measure the utility, or reward, for each step of interpretation. Us- ing this reward signal as a form of supervision, we learn a policy to maximize the expected reward on unseen examples. 2 Related Work Figure 2: The instruction giver and instruction fol- Levit and Roy (2007) developed a spatial seman- lower face each other, and cannot see each others tics for the Map Task corpus. They represent maps. instructions as Navigational Information Units, which decompose the meaning of an instruction into orthogonal constituents such as the reference spatial terms as a subset of deictic language, which object, the type of movement, and quantitative as- depends heavily on non-linguistic context. Levin- pect. For example, they represent the meaning of son (2003) conducted a cross-linguistic semantic “move two inches toward the house” as a reference typology of spatial systems. Levinson categorizes object (the house), a path descriptor (towards), and the frames of reference, or spatial coordinate sys- 1 a quantitative aspect (two inches). These represen- tems , into tations are then combined to form a path through 1. Egocentric: Speaker/hearer centered frame the map. However, they do not learn these rep- of reference. Ex: “the ball to your left”. resentations from text, leaving natural language processing as an open problem. The semantics 2. Allocentric: Speaker independent. Ex: “the in our paper is simpler, eschewing quantitative as- road to the north of the house” pects and path descriptors, and instead focusing on reference objects and frames of reference. This Levinson further classifies allocentric frames of simplifies the learning task, without sacrificing the reference into absolute, which includes the cardi- core of their representation. nal directions, and intrinsic, which refers to a fea- Learning to follow instructions by interacting tured side of an object, such as “the front of the with the world was recently introduced by Brana- car”. Our spatial feature representation follows van et al. (2009), who developed a system which this egocentric/allocentric distinction. The intrin- learns to follow Windows Help guides. Our re- sic frame of reference occurs rarely in the Map inforcement learning formulation follows closely Task corpus and is ignored, as speakers tend not from their work. Their approach can incorpo- to mention features of the landmarks beyond their rate expert supervision into the reward function names. in a similar manner to this paper, but also allows Regier (1996) studied the learning of spatial for quite robust unsupervised learning. The Map language from static 2-D diagrams, learning to Task corpus is free form conversational English, distinguish between terms with a connectionist whereas the Windows instructions are written by a model. He focused on the meaning of individual professional. In the Map Task corpus we only ob- terms, pairing a diagram with a given word. In serve expert route following behavior, but are not contrast, we learn from whole texts paired with a told how portions of the text correspond to parts of path, which requires learning the correspondence the path, leading to a difficult learning problem. 1Not all languages exhibit all frames of reference. Terms The semantics of spatial language has been for ‘up’ and ‘down’ are exhibited in most all languages, while studied for some time in the linguistics literature. ‘left’ and ‘right’ are absent in some. Gravity breaks the sym- metry between ‘up’ and ‘down’ but no such physical distinc- Talmy (1983) classifies the way spatial meaning is tion exists for ‘left’ and ‘right’, which contributes to the dif- encoded syntactically, and Fillmore (1997) studies ficulty children have learning them. between text and world. We use similar geometric features as Regier, capturing the allocentric frame of reference. Spatial semantics have also been explored in physically grounded systems. Kuipers (2000) developed the Spatial Semantic Hierarchy, a knowl- edge representation formalism for representing different levels of granularity in spatial knowl- edge. It combines sensory, metrical, and topolog- ical information in a single framework. Kuipers et al. demonstrate its effectiveness on a physical Figure 3: Sample state transition. Both actions get robot, but did not address the learning problem. credit for visiting the great rock after the indian country. Action a1 also gets credit for passing the More generally, apprenticeship learning is well great rock on the correct side. studied in the reinforcement learning literature, where the goal is to mimic the behavior of an expert in some decision making domain. Notable ex- 4 Reinforcement Learning Formulation amples include (Abbeel and Ng, 2004), who train We frame the direction following task as a sequen- a helicopter controller from pilot demonstration. tial decision making problem. We interpret utterances in order, where our interpretation is ex- 3 The Map Task Corpus pressed by moving on the map. Our goal is to construct a series of moves in the map which most The HCRC Map Task Corpus (Anderson et al., closely matches the expert path. 1991) is a set of dialogs between an instruction We define intermediate steps in our interpreta- giver and an instruction follower. Each participant tion as states in a set S, and interpretive steps as has a map with small named landmarks. Addition- actions drawn from a set A. To measure the fi- ally, the instruction giver has a path drawn on her delity of our path with respect to the expert, we + map, and must communicate this path to the in- define a reward function R : S × A ! R which struction follower in natural language. Figure 1 measures the utility of choosing a particular action shows a portion of the instruction giver’s map and in a particular state. Executing action a in state s 0 a sample of the instruction giver language which carries us to a new state s , and we denote this tran- 0 describes part of the path. sition function by s = T (s; a). All transitions are deterministic in this paper.2 The Map Task Corpus consists of 128 dialogs, For training we are given a set of dialogs D. together with 16 different maps. The speech has Each dialog d 2 D is segmented into utter- been transcribed and segmented into utterances, ances (u ; : : : ; u ) and is paired with a map, based on the length of pauses. We restrict our 1 m which is composed of a set of named landmarks attention to just the utterances of the instruction (l ; : : : ; l ).

Load more