Miscommunication Detection and Recovery for Spoken Dialogue Systems in Physically Situated Contexts
Total Page:16
File Type:pdf, Size:1020Kb
Miscommunication Detection and Recovery for Spoken Dialogue Systems in Physically Situated Contexts Matthew Marge CMU-LTI-15-015 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Alexander I. Rudnicky (Chair), Carnegie Mellon University Alan W Black, Carnegie Mellon University Justine Cassell, Carnegie Mellon University Matthias Scheutz, Tufts University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright c 2015 Matthew Marge Keywords: physically situated dialogue, spoken dialogue systems, human-robot communication, human-robot interaction, recovery strategies, instance-based learning To my wife, Boyoung Abstract Intelligent agents that are situated in a physical space, like robots and some virtual assistants, enable people to accomplish tasks. Their potential utility in many applications has also brought about a desire to communicate with them effectively using spoken language. Similarly, advances in speech recognition and synthesis have enabled spoken dialogue systems to run on these platforms. These agents face an array of problems when they try to apply instructions from people to their surroundings, even when there are little to no speech recognition errors. These situated grounding problems can be referential ambiguities or impossible- to-execute instructions. Navigation tasks are a prime example of where such problems can occur. Existing approaches to dialogue system error detection and recovery traditionally use speech and language input as evidence. In this thesis, we incorporate additional information from the agent’s planner and surroundings to detect and recover from situated grounding problems. The goal of this thesis was to produce a framework for handling situated grounding problems in task-oriented spoken dialogue. First, we created a repre- sentation for grounding that incorporates an agent’s surroundings to judge when it is grounded (i.e., on the same page) with a user. Second, we developed a method for detecting situated grounding problems that combines evidence from the user’s language with the agent’s plan and map information. Third, we implemented re- covery strategies that allow an agent to repair situated grounding problems. Fourth, we used an instance-based learning approach to improve how an agent selects recovery strategies over time. When an agent encounters a new grounding problem, it looks back on its interaction history to consider how it resolved similar situations. Detecting and recovering from situated grounding problems allow spoken dialogue systems for agents to be more resilient to miscommunication and operate in a wide range of scenarios with people. v Acknowledgments When I began my time at Carnegie Mellon, I arrived with a passion for artificial intelligence and wanted to discover ways to improve how we can interact with and benefit from intelligent agents. I would like to thank my advisor, Alex Rudnicky, for helping me hone that passion into this body of work. He taught me to think about the big picture, while at the same time pondering deeply about the inner workings of system processes. I am grateful for the academic freedom he gave me to pursue this line of research. My thesis committee members, Alan Black, Justine Cassell, and Matthias Scheutz, have been exceptionally supportive from the moment I began exploring this thesis topic. I am thankful for my many conversations with Alan as we discussed this thesis and other academic pursuits. I also appreciate valuable advice from Justine and Matthias that helped me find the balance between ambition and empirically grounded work. My work has been supported by the National Science Foundation Graduate Research Fellowship. I have benefited from many fruitful interactions at Carnegie Mellon. I am thankful to Satanjeev Banerjee, Thomas Harris, and Carolyn Rose´ for their advice and many discussions during the early years. Thanks to Maxim Makatchev, Lu´ıs Marujo, Prasanna Kumar Muthukumar, and Vittorio Perera for their companionship during the highs and lows of the doctoral program. I would like to thank Aasish Pappu, Ming Sun, and Long Qin for being wonderful officemates and colleagues whom with I could brainstorm my research ideas. This work also benefited from regular meetings with the Dialogs on Di- alogs reading group; in particular I would like to thank David Cohen, Heriberto Cuayahuitl,´ Rohit Kumar, Jose´ David Lopes, Yoichi Matsuyama, Elijah Mayfield, Gabriel Parent, Alexandros Papangelis, Antoine Raux, and Zhou Yu for their in- sightful conversations. I would also like to thank the regular participants of the weekly Sphinx Lunch for their feedback on my research. I’m also thankful to many other friends I gained throughout my years at Carnegie Mellon: Nia Bradley, Jonathan Clark, Sourish Chaudhuri, Brian Coltin, Kevin Dela Rosa, Michael Denkowski, Brian Langner, Meghana Kshirsagar, Subhodeep Moitra, Kenton Murray, Gabriel Parent, Danny Rashid, David Reitter, Hugo Rodrigues, Nathan Schneider, and Rushin Shah. Thank you for keeping me upbeat and positive. My colleagues at the Army Research Lab have been a valuable resource for feedback and ideas. Thanks to Clare Voss for her support and advice on crossing vii the finish line. I’m thankful to Jamal Laoudi for his assistance with my final thesis experiments. Thanks are also in order for David Baran, Claire Bonial, Taylor Cassidy, Philip David, Arthur William Evans, Susan Hill, Reginald Hobbs, Judith Klavans, Stephen LaRocca, Douglas Summers-Stay, Stephen Tratz, Michelle Vanni, Garrett Warnell, and Stuart Young for their feedback on my dissertation. I’m grateful to Melissa Holland for giving me the opportunity to join such exceptional colleagues. During my time in Pittsburgh, I greatly benefited from weekly meetings with the First Trinity Lutheran Student Fellowship. Thanks to Pastor Eric Andrae for his prayers and support throughout the entirety of my doctoral program. I’m thankful to my family for their constant love and support. I could not have achieved this without my parents Susan and Kenneth; they encouraged me to pursue my childhood dreams from the beginning. But I am most thankful to the person that helped me reach the end – my wife, Boyoung. To her I am most grateful for helping me to hit this milestone. Thank you for your love, enthusiasm, and unwavering patience as I pushed through deadlines and a mountain of work so that this dissertation could be complete. I could not survive without you holding the fort and preparing daily lunches and dinners for me. Thank you for helping me keep my focus, and for setting us up for a wonderful future. viii Contents 1 Introduction 1 1.1 Problem Space . .2 1.2 Thesis Statement . .5 1.3 Thesis Contributions . .6 2 Miscommunication in Spoken Dialogue 9 2.1 Sources of Miscommunication in Dialogue . .9 2.1.1 Human-Human Miscommunication . 10 2.1.2 Miscommunication in Spoken Dialogue Systems . 11 2.2 Error Detection in Spoken Dialogue Systems . 11 2.3 Dialogue Error Recovery . 12 2.4 Chapter Summary . 14 3 Spoken Language for Agents in Physically Situated Contexts 15 3.1 Presentation Formats for Agents . 15 3.1.1 Method . 16 3.1.2 Measurements . 18 3.1.3 Results . 18 3.2 The TeamTalk Corpus: Route Instructions for Robots . 20 3.2.1 Robot Navigation with Route Instructions . 21 3.2.2 Open Space Navigation Task . 22 3.2.3 Participation . 24 3.2.4 Corpus Annotation . 25 3.2.5 Session Log Information . 28 3.3 Chapter Summary . 28 ix x CONTENTS 4 Grounding as an Error Handling Process 31 4.1 Related Work . 31 4.1.1 Models of Discourse Grounding: Theory . 32 4.1.2 Grounding as Decision Making Under Uncertainty . 33 4.1.3 Symbol Grounding: Theory and Applications . 36 4.1.4 Clarification Requests . 37 4.1.5 Learning Approaches for Dialogue Control . 38 4.2 Plan Grounding Model . 39 4.2.1 Question Content Selection . 40 4.2.2 Measuring Costs of Adjusting Constraints . 41 4.2.3 Scenario Discussion . 43 4.2.4 Question Presentation . 49 4.2.5 Canceling the Task . 49 4.3 Chapter Summary . 49 5 Miscommunication Recovery in Physically Situated Dialogue 51 5.1 Miscommunication in Physically Situated Domains . 51 5.2 Existing Work . 53 5.3 Method . 53 5.3.1 Experiment Design . 54 5.3.2 Hypotheses . 57 5.3.3 Measures . 58 5.3.4 Participation . 59 5.4 Results . 60 5.4.1 Correctness . 60 5.4.2 Referential Ambiguity . 61 5.4.3 Impossible-to-Execute . 62 5.5 Discussion . 63 5.5.1 Limitations . 66 5.6 Chapter Summary . 66 6 Detecting Situated Grounding Problems with TeamTalk 69 6.1 Virtual Agent Navigation Domain . 69 6.1.1 Object . 70 6.1.2 Action . 71 6.1.3 Path . 71 CONTENTS XI 6.1.4 Existential Knowledge . 71 6.1.5 Core Heuristics . 71 6.2 Research Platform: TeamTalk with USARSim . 71 6.2.1 TeamTalk Human-Robot Dialogue Framework . 71 6.2.2 USARSim and MOAST . 76 6.2.3 TeamTalk Software Architecture . 77 6.3 Identifying Miscommunication in Physically Situated Dialogue . 82 6.3.1 Levels of Understanding in Human-Computer Dialogue . 82 6.3.2 Core Capabilities for Detecting Miscommunication . 84 6.3.3 Maintaining Environment Context . 86 6.4 Detecting Grounding Problems with the Situated Reasoner . 87 6.4.1 Overview . 87 6.4.2 Representing Context in Physically Situated Dialogue . 88 6.4.3 Task-Independent Skills . 94 6.4.4 Physically Situated Skills . 95 6.4.5 Evaluation . 99 6.5 Chapter Summary . 99 7 Recovery Strategies for Physically Situated Contexts 101 7.1 Understanding Human Miscommunication Behaviors . 101 7.1.1 Human-Human Navigation Dialogue Corpora . 102 7.1.2 Recovery Strategies from Dialogue Corpora . 103 7.1.3 Tabulating Question Types in the SCARE Corpus .