Jointly Improving Parsing and Perception for Natural Language

Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog Jesse Thomason,∗ Aishwarya Padmakumar,y Jivko Sinapov,z Nick Walker,∗ Yuqian Jiang,y Harel Yedidsion,y Justin Hart,y Peter Stone,y and Raymond J. Mooneyy ∗Paul G. Allen School of Computer Science and Engineering, University of Washington yDepartment of Computer Science, University of Texas at Austin zDepartment of Computer Science, Tufts University Abstract In this work, we present methods for parsing natural language to underlying meanings, and using robotic sensors to create multi-modal models of perceptual concepts. We combine these steps to- wards language understanding into a holistic agent for jointly improving parsing and perception on a robotic platform through human-robot dialog. We train and evaluate this agent on Amazon Mechanical Turk, then demonstrate it on a robotic plat- Figure 1: Mechanical Turk web interface. form initialized from that conversational data. Our experiments show that improving both parsing and perception compo- work (Thomason et al., 2015), strengthening its nents from conversations improves com- parsing over time. The agent also uses oppor- munication quality and human ratings of tunistic active learning (Thomason et al., 2017) the agent. to ask questions about nearby objects to refine multi-modal perceptual concept models (Thoma- 1 Introduction son et al., 2016) on-the-fly during command di- alogs. Pre-programming robots with fixed language un- We evaluate this language understanding agent derstanding components limits them, since dif- on Mechanical Turk (Figure1), and implement the ferent speakers and environments use and elicit agent on a physical robot.1 different words. Rather than force humans to use language that robots around them can 2 Conversational Agent understand, robots should dynamically adapt— continually learning new language constructions We implement and evaluate a conversational di- and perceptual concepts as they are used in con- alog agent that uses a semantic parser to translate text. human utterances into semantic meaning represen- Untrained human users providing natural lan- tations, then grounds those meaning representa- guage commands to robots expect world knowl- tions using both a static knowledge base of facts edge, perceptual knowledge, and semantic under- about an office environment and perceptual con- standing from verbal robots. We present a holistic cept models that consider multi-modal representa- 2 system for jointly improving parsing and percep- tions of physical objects. tion on a robotic system for natural language com- We build multi-modal concept models to con- mands through human-robot dialog. This learn- 1The demonstration video can be viewed at https:// ing agent uses clarification questions in a conver- youtu.be/PbOfteZ_CJc sation with a human partner to understand lan- 2The source code for this conversational dialog agent, as well as the experiments described in the follow- guage commands. The agent induces additional ing section, can be found at https://github.com/ training data for a semantic parser, similar to prior thomason-jesse/grounded_dialog_agent tion dialog to refine that belief. When describing objects in the real world, humans can use words the agent has never heard before. In this work, we examine the lexical neighbors of unknown words in word-embedding space, and ask the user directly whether the unseen words are perceptual in nature if any neighbors are. We Figure 2: The robot used in our experiment, and then check for synonyms (e.g., users marked tall the objects explored by the robot for grounding as a synonym for long), and add a new perceptual perceptual predicates. concept if no known words are suitable synonyms (e.g., red, a neighbor of yellow, was added in our experiments in this way). nect robot perception to concept labels. We use multi-modal feature representations across various We introduce opportunistic active learning sensorimotor contexts by exploring those objects questions as a sub-dialog routine for the agent, in with a robot arm, performing behaviors such as which it can query about objects local to the hu- grasping, lifting, and dropping objects, in addi- man and the robot to refine its perceptual concept tion to extracting visual information from them models before applying them to the remote test (Figure2)(Sinapov et al., 2016; Thomason et al., object items, similar to previous work (Thomason 2016). et al., 2017). We connect these feature representations of ob- 3 Experiments jects to language labels by learning discriminative classifiers on the feature spaces for each percep- Mechanical Turk users instruct the agent in three tual language concept (e.g., red or heavy)(Sinapov tasks: navigation, delivery, and relocation. We de- et al., 2014; Thomason et al., 2016). ploy the agent in a simulated office environment Since there are multiple possible groundings for populated by rooms, people, and object items. We ambiguous utterances like the office and varied randomly split the set of possible tasks into initial- confidences for perceptual concept models on dif- ization (10%), train (70%), and test sets (20%). ferent objects, we create a confidence distribution Sixteen users (graduate students at the univer- over the possible groundings for a semantic parse. sity across several fields) engaged with a faux- This confidence probability distribution is used as agent on initialization set tasks using the web in- part of an update procedure for helping the agent terface. We used these commands as a scaffold understand the user’s intent during dialog. on which to build an ontology, lexicon, and initial We implement a conversational dialog agent for utterance-semantic parse pairs. Of them, 44 pairs, command understanding similar to that in previ- D0, were used to train an initial parser. ous work (Thomason et al., 2015). The differences We use these initial parsing resources to create between this agent and the previous one are: 1) a baseline agent A1 with a parser P1 trained only grounding semantic parses in both static knowl- on the initialization pairs D0 mentioned above and edge and perceptual knowledge; 2) dynamically concept models for several predicates Pc;1, but adding new ontological predicates for novel per- with no initial object examples against which to ceptual concepts; and 3) leveraging opportunis- train them. tic active learning for refining perceptual concept Three training phases i are carried out by agent models on-the-fly. Ai, after which parser Pi+1 and concept predi- Dialog begins with a human user commanding cates Pc;i+1 are trained to instantiate agent Ai+1. the robot to perform a task. The agent maintains Agent A4 with parser P4 and perception models a belief state modeling the unobserved true task Pc;4 is tested by interacting with users trying to in the user’s mind, and uses the language sig- accomplish tasks from the unseen test set of tasks. ∗ nals from the user to infer it. The command is We also test an ablation agent, A4, with parser ∗ first parsed by the agent’s semantic parser, then P1 and perception models Pc;4 (trained perception grounded against static and perceptual knowledge. with simply initialized parser). These groundings are used to update the agent’s Table1 gives a breakdown of the numbers of belief state, and the agent engages in a clarifica- workers who engaged with our HITs. Condition Number of Workers Submitted Completed Vetted Nav. Del. Rel. HIT Tasks Correct Correct Correct Train (A1;A2;A3) 297 162 113 36 44 18 Untrained (A1) 150 67 44 17 22 10 ∗ ∗ Test (A4) 148 83 50 20 29 10 Test (A4) 143 79 42 16 23 10 Table 1: Breakdown of the number of workers in our experiment. We here count only workers that submitted the HIT with the correct code. Workers that completed all tasks and the survey finished the HIT entirely. Vetted workers’ data was kept for evaluation after basic checks. The Train condition ∗ (A1;A2;A3 agents) draws from the training set of tasks, while the Untrained (A1 untrained agent), Test ∗ (A4 agent with trained perception and untrained parser), and Test (A4 agent with trained parser and perception) conditions draw from the test set of tasks. For our embodied demonstration, we use Learned Concept Model for can the BWIBot (Khandelwal et al., 2014, 2017), equipped with a Kinova MICO arm, an Xtion ASUS Pro camera, a Hokuyo lidar, a Blue Snow- ball microphone, and a speaker. Speech transcripts 0.32 0.22 0.2 0.13 are provided by the Google Speech API3 and speech synthesis is performed with the Festival 4 Speech Synthesis System. Tabletop perception, 0.07 0.03 0.03 0 required for both the dialog interaction and the ex- ecution of the resulting command, is implemented Figure 3: The perceptual concept model learned with RANSAC (Fischler and Bolles, 1981) plane for can after training from conversations with hu- fitting and Euclidean clustering as provided by man users. Point Cloud Library (Rusu and Cousins, 2011). Figure4 gives quantitative and qualitative mea- sures of performance in Mechanical Turk, as well non-visual word rattling. as some snapshots from the embodied demonstration. The agent acquires new perceptual concept Acknowledgements models (25 in total), and synonym words for ex- isting concepts, during the three phases of train- This work was supported by a National Science ing. Table3 shows the learned perceptual concept Foundation Graduate Research Fellowship to the model for can on test objects. first author, an NSF EAGER grant (IIS-1548567), 4 Conclusion and an NSF NRI grant (IIS-1637736). Portions of this work have taken place in the Learning In this article, we presented a holistic system for Agents Research Group (LARG) at UT Austin. jointly improving semantic parsing and grounded LARG research is supported in part by NSF perception on a robotic system for interpreting nat- (CNS-1305287, IIS-1637736, IIS-1651089, IIS- ural language commands during human-robot dia- 1724157), TxDOT, Intel, Raytheon, and Lockheed log.

Jointly Improving Parsing and Perception for Natural Language

What the Neurocognitive Study of Inner Language Reveals About Our Inner Space Hélène Loevenbruck

CNS 2014 Program

Talking to Computers in Natural Language

Revealing the Language of Thought Brent Silby 1

Natural Language Processing (NLP) for Requirements Engineering: a Systematic Mapping Study

Detecting Politeness in Natural Language by Michael Yeomans, Alejandro Kantor, Dustin Tingley

Natural Language Processing

“Linguistic Analysis” As a Misnomer, Or, Why Linguistics Is in a State of Permanent Crisis

Natural Language Processing and Language Learning

LDL-2014 3Rd Workshop on Linked Data in Linguistics

A Joint Model of Language and Perception for Grounded Attribute Learning

From Natural Language to Programming Language