<<

Jointly Improving and for Natural Commands through -Robot Dialog Jesse Thomason,∗ Aishwarya Padmakumar,† Jivko Sinapov,‡ Nick Walker,∗ Yuqian Jiang,† Harel Yedidsion,† Justin Hart,† Peter Stone,† and Raymond J. Mooney† ∗Paul G. Allen School of Computer Science and Engineering, University of Washington †Department of Computer Science, University of Texas at Austin ‡Department of Computer Science, Tufts University

Abstract

In this work, we present methods for parsing natural language to underlying meanings, and using robotic sensors to create multi-modal models of perceptual . We combine these steps to- wards language understanding into a holis- tic agent for jointly improving parsing and perception on a robotic platform through human-robot dialog. We train and eval- uate this agent on Amazon Mechanical Turk, then demonstrate it on a robotic plat- Figure 1: Mechanical Turk web interface. form initialized from that conversational data. Our experiments show that improv- ing both parsing and perception compo- work (Thomason et al., 2015), strengthening its nents from conversations improves com- parsing over time. The agent also uses oppor- munication quality and human ratings of tunistic active (Thomason et al., 2017) the agent. to ask questions about nearby objects to refine multi-modal perceptual models (Thoma- 1 Introduction son et al., 2016) on-the-fly during command di- alogs. Pre-programming robots with fixed language un- We evaluate this language understanding agent derstanding components limits them, since dif- on Mechanical Turk (Figure1), and implement the ferent speakers and environments use and elicit agent on a physical robot.1 different . Rather than force to use language that robots around them can 2 Conversational Agent understand, robots should dynamically adapt— continually learning new language constructions We implement and evaluate a conversational di- and perceptual concepts as they are used in con- alog agent that uses a semantic parser to translate text. human utterances into semantic represen- Untrained human users providing natural lan- tations, then grounds those meaning representa- guage commands to robots expect world knowl- tions using both a static knowledge base of facts edge, perceptual knowledge, and semantic under- about an office environment and perceptual con- standing from verbal robots. We present a holistic cept models that consider multi-modal representa- 2 system for jointly improving parsing and percep- tions of physical objects. tion on a robotic system for natural language com- We build multi-modal concept models to con- mands through human-robot dialog. This learn- 1The demonstration video can be viewed at https:// ing agent uses clarification questions in a conver- youtu.be/PbOfteZ_CJc sation with a human partner to understand lan- 2The source code for this conversational dialog agent, as well as the experiments described in the follow- guage commands. The agent induces additional ing section, can be found at https://github.com/ training data for a semantic parser, similar to prior thomason-jesse/grounded_dialog_agent tion dialog to refine that belief. When describing objects in the real world, hu- mans can use words the agent has never heard be- fore. In this work, we examine the lexical neigh- bors of unknown words in -embedding space, and ask the user directly whether the unseen words are perceptual in nature if any neighbors are. We Figure 2: The robot used in our experiment, and then check for synonyms (e.g., users marked tall the objects explored by the robot for grounding as a synonym for long), and add a new perceptual perceptual predicates. concept if no known words are suitable synonyms (e.g., red, a neighbor of yellow, was added in our experiments in this way). nect robot perception to concept labels. We use multi-modal feature representations across various We introduce opportunistic active learning sensorimotor contexts by exploring those objects questions as a sub-dialog routine for the agent, in with a robot arm, performing behaviors such as which it can query about objects local to the hu- grasping, lifting, and dropping objects, in addi- man and the robot to refine its perceptual concept tion to extracting visual information from them models before applying them to the remote test (Figure2)(Sinapov et al., 2016; Thomason et al., object items, similar to previous work (Thomason 2016). et al., 2017). We connect these feature representations of ob- 3 Experiments jects to language labels by learning discriminative classifiers on the feature spaces for each percep- Mechanical Turk users instruct the agent in three tual language concept (e.g., red or heavy)(Sinapov tasks: navigation, delivery, and relocation. We de- et al., 2014; Thomason et al., 2016). ploy the agent in a simulated office environment Since there are multiple possible groundings for populated by rooms, people, and object items. We ambiguous utterances like the office and varied randomly split the set of possible tasks into initial- confidences for perceptual concept models on dif- ization (10%), train (70%), and test sets (20%). ferent objects, we create a confidence distribution Sixteen users (graduate students at the univer- over the possible groundings for a semantic parse. sity across several fields) engaged with a faux- This confidence probability distribution is used as agent on initialization set tasks using the web in- part of an update procedure for helping the agent terface. We used these commands as a scaffold understand the user’s intent during dialog. on which to build an ontology, lexicon, and initial We implement a conversational dialog agent for utterance-semantic parse pairs. Of them, 44 pairs, command understanding similar to that in previ- D0, were used to train an initial parser. ous work (Thomason et al., 2015). The differences We use these initial parsing resources to create between this agent and the previous one are: 1) a baseline agent A1 with a parser P1 trained only grounding semantic parses in both static knowl- on the initialization pairs D0 mentioned above and edge and perceptual knowledge; 2) dynamically concept models for several predicates Pc,1, but adding new ontological predicates for novel per- with no initial object examples against which to ceptual concepts; and 3) leveraging opportunis- train them. tic active learning for refining perceptual concept Three training phases i are carried out by agent models on-the-fly. Ai, after which parser Pi+1 and concept predi- Dialog begins with a human user commanding cates Pc,i+1 are trained to instantiate agent Ai+1. the robot to perform a task. The agent maintains Agent A4 with parser P4 and perception models a belief state modeling the unobserved true task Pc,4 is tested by interacting with users trying to in the user’s , and uses the language sig- accomplish tasks from the unseen test set of tasks. ∗ nals from the user to infer it. The command is We also test an ablation agent, A4, with parser ∗ first parsed by the agent’s semantic parser, then P1 and perception models Pc,4 (trained perception grounded against static and perceptual knowledge. with simply initialized parser). These groundings are used to update the agent’s Table1 gives a breakdown of the numbers of belief state, and the agent engages in a clarifica- workers who engaged with our HITs. Condition Number of Workers Submitted Completed Vetted Nav. Del. Rel. HIT Tasks Correct Correct Correct

Train (A1,A2,A3) 297 162 113 36 44 18 Untrained (A1) 150 67 44 17 22 10 ∗ ∗ Test (A4) 148 83 50 20 29 10 Test (A4) 143 79 42 16 23 10

Table 1: Breakdown of the number of workers in our experiment. We here count only workers that submitted the HIT with the correct code. Workers that completed all tasks and the survey finished the HIT entirely. Vetted workers’ data was kept for evaluation after basic checks. The Train condition ∗ (A1,A2,A3 agents) draws from the training set of tasks, while the Untrained (A1 untrained agent), Test ∗ (A4 agent with trained perception and untrained parser), and Test (A4 agent with trained parser and perception) conditions draw from the test set of tasks.

For our embodied demonstration, we use Learned Concept Model for can the BWIBot (Khandelwal et al., 2014, 2017), equipped with a Kinova MICO arm, an Xtion ASUS Pro camera, a Hokuyo lidar, a Blue Snow- ball microphone, and a speaker. Speech transcripts 0.32 0.22 0.2 0.13 are provided by the Google Speech API3 and speech synthesis is performed with the Festival 4 Speech Synthesis System. Tabletop perception, 0.07 0.03 0.03 0 required for both the dialog interaction and the ex- ecution of the resulting command, is implemented Figure 3: The perceptual concept model learned with RANSAC (Fischler and Bolles, 1981) plane for can after training from conversations with hu- fitting and Euclidean clustering as provided by man users. Point Cloud Library (Rusu and Cousins, 2011). Figure4 gives quantitative and qualitative mea- sures of performance in Mechanical Turk, as well non-visual word rattling. as some snapshots from the embodied demonstra- tion. The agent acquires new perceptual concept Acknowledgements models (25 in total), and synonym words for ex- isting concepts, during the three phases of train- This work was supported by a National Science ing. Table3 shows the learned perceptual concept Foundation Graduate Research Fellowship to the model for can on test objects. first author, an NSF EAGER grant (IIS-1548567), 4 Conclusion and an NSF NRI grant (IIS-1637736). Portions of this work have taken place in the Learning In this article, we presented a holistic system for Agents Research Group (LARG) at UT Austin. jointly improving semantic parsing and grounded LARG research is supported in part by NSF perception on a robotic system for interpreting nat- (CNS-1305287, IIS-1637736, IIS-1651089, IIS- ural language commands during human-robot dia- 1724157), TxDOT, Intel, Raytheon, and Lockheed log. We show, via a large-scale Mechanical Turk Martin. Peter Stone serves on the Board of Direc- experiment, that users are better able to commu- tors of Cogitai, Inc. The terms of this arrangement nicate tasks and rate the system more usable af- have been reviewed and approved by the Univer- ter this dialog-based learning procedure. We em- sity of Texas at Austin in accordance with its pol- body this learning agent in a physical robot plat- icy on objectivity in research. We would like to form to demonstrate its learning abilities for the thank Subhashini Venugopalan for her help in en- gineering deep visual feature extraction for lan- 3https://cloud.google.com/speech/ 4http://www.cstr.ed.ac.uk/projects/ guage grounding. festival/ (a) Navigation Semantic F1 (b) Delivery Semantic F1 (c) Relocation Semantic F1

(d) Use Navigation. (e) Use Delivery. (f) Use Relocation.

(g) The robot asks questions (h) The robot decides grasps a (i) The robot hands over the about items to learn rattling. rattling container. item at the destination.

Figure 4: Top: The average semantic slot f scores between the semantic roles in the target task and the task confirmed by the user. Middle: Survey prompt responses about usability. Bottom: The agent learns a new word, rattling, which requires perception using the auditory sensing modality, and uses this new concept model to correctly identify and move the target item

References Proceedings of the 25th International Joint Confer- ence on Artificial Intelligence (IJCAI). Martin A. Fischler and Robert C. Bolles. 1981. Ran- dom sample consensus: A paradigm for model fit- Jivko Sinapov, Connor Schenck, and Alexander ting with applications to image analysis and auto- Stoytchev. 2014. Learning relational object cate- mated cartography. Commun. ACM, 24(6):381–395. gories using behavioral exploration and multimodal perception. In IEEE International Conference on Piyush Khandelwal, Fangkai Yang, Matteo Leonetti, Robotics and Automation. Vladimir Lifschitz, and Peter Stone. 2014. Plan- ning in Action Language BC while Learning Action Jesse Thomason, Aishwarya Padmakumar, Jivko Costs for Mobile Robots. In Proceedings of the In- Sinapov, Justin Hart, Peter Stone, and Raymond J. ternational Conference on Automated Planning and Mooney. 2017. Opportunistic active learning for Scheduling (ICAPS). grounding natural language descriptions. In Pro- ceedings of the 1st Annual Conference on Robot Piyush Khandelwal, Shiqi Zhang, Jivko Sinapov, Mat- Learning (CoRL-17), volume 78, pages 67–76. Pro- teo Leonetti, Jesse Thomason, Fangkai Yang, Ilaria ceedings of Machine Learning Research. Gori, Maxwell Svetlik, Priyanka Khante, Vladimir Lifschitz, J. K. Aggarwal, Raymond Mooney, and Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Pe- Peter Stone. 2017. Bwibots: A platform for bridg- ter Stone, and Raymond Mooney. 2016. Learning ing the gap between ai and human–robot interaction multi-modal grounded linguistic by play- research. The International Journal of Robotics Re- ing “I spy”. In Proceedings of the 25th International search (IJRR), 36. Joint Conference on Artificial Intelligence (IJCAI), pages 3477–3483. Radu Bogdan Rusu and Steve Cousins. 2011. 3D is here: Point Cloud Library (PCL). In IEEE Inter- Jesse Thomason, Shiqi Zhang, Raymond Mooney, and national Conference on Robotics and Automation Peter Stone. 2015. Learning to interpret natural lan- (ICRA), Shanghai, China. guage commands through human-robot dialog. In Proceedings of the 24th International Joint Confer- Jivko Sinapov, Priyanka Khante, Maxwell Svetlik, and ence on Artificial Intelligence (IJCAI), pages 1923– Peter Stone. 2016. Learning to order objects using 1929. haptic and proprioceptive exploratory behaviors. In