Adapting Spoken Dialog Systems Towards Domains and Users

Adapting Spoken Dialog Systems Towards Domains and Users Ming Sun CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Alexander I. Rudnicky, CMU (Chair) Alan W. Black, CMU Roni Rosenfeld, CMU Amanda Stent, YAHOO! Research Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies c 2016 , Ming Sun Keywords: Lexicon learning, cloud speech recognition adaptation, high-level intention understanding, spoken dialog systems Abstract Spoken dialog systems have been widely used, such as the voice applications or agents in smart phone or smart car environments. However, speech systems are built using the developers’ understanding of the application domain and of potential users in the field, driven by observations collected by sampling the population at a given time. Therefore, the deployed models may not perfectly fit the real-life usage or may no longer be valid with the changing dynamics of the domain/users over time. A system which automatically adapts to the domain and users is naturally desired. In this thesis, we focus on three realistic problems in language-based communication between human and machine. First, current speech systems with fixed vocabulary have difficulty understanding out-of-vocabulary words (OOVs), leading to misun- derstanding or even task failures. Our approaches can learn new words during the conversation or even before the conversation by detecting the presence of OOVs or anticipating the new words ahead of time. Our experiments show that OOV-related recognition and understanding errors can be therefore prevented. Second, cloud- based automatic speech recognition (cloud-ASR) is widely used by current dialog applications. The problem though is that it lacks the flexibility to adapt to domains or users. Our method, which combines hypotheses from 1) a local and adaptive ASR and 2) the cloud-ASR, can provide better recognition accuracy. Third, when interacting with a dialog system, users’ intention may go beyond individual domains hosted by the system. However, current multi-domain dialog systems do not have the awareness of the user’s high-level intentions, resulting in lost opportunities to assist the user in a timely manner or personalize the interaction experience. We built models to recognize the complex user intentions and enable the system to communicate with the user at the task level, in addition to the individual domain level. We believe that adaptation in these three levels can contribute to the quality of human-machine interactions. Acknowledgement First of all, I am grateful to have my four committee members. My advisor Alex guided me for the past 6 years and encouraged me to work on the research I am interested in. He taught me how to shape research ideas and design experiments. He has been extremely supportive to my research ever since — giving me freedom to experiment my ideas, offering helpful advice when I stray off path and even presenting my work at international conferences on my behalf due to travel visa issue. Alan is the one who sparked my curiosity about the speech world when I took his undergraduate course and built a pizza delivery dialog system back in 2009. He has been a fantastic resource for me in brainstorming research ideas, planning my career and telling good jokes. Roni introduced me to statistical language modeling and machine learning. I am honored to have worked with him as teaching assistant in these two courses. His critical thinking and excellent teaching skills (e.g., deriving every formula on board using colorful chalk) equipped me with the necessary techniques to conduct my own research. Amanda has been so supportive and resourceful to me throughout my dissertation. She directed me to all types of inspiring external resources that broadened my horizons. Every time I left a meeting with her, I was able to imagine entirely new research directions which I previously had not thought were possible. I am so lucky to have these four excellent committee members on board throughout my journey of discovering new possibilities in language technologies. I want to thank Aasish Pappu for playing April fool on me the night before my defense: he tricked me into thinking that it is a US convention to mail a hard copy of the dissertation to my external committee. Alok Parlikar consistently asked me about my defense date almost every other week in the past two years (for some hidden reason). I am fortunate to collaborate with my lab-mates in different projects: Long Qin, Matthew Marge, Yun-Nung Chen, Seshadri Sridharan, Zhenhao Hua and Justin Chiu. It has been a great experience and I have learned a lot from each of them. My work is not possible without the research assistants in my lab: Yulian Tamres-Rudnicky and Arnab Dash. We spent a long but fun time together in the lab collecting data. I also want to thank Thomas Schaaf, Ran Zhao, Jill Lehman and Albert Li for helping me with preparing the defense talk. I want to thank tutors in CMU Global Communication Center for helping me revising the structure and language of this document. I am grateful to my sponsors — General Motors Advanced Technical Center (Israel) and YAHOO! Research. Especially, I want to thank Dr. Ute Winter. She funded part of my research, brought me to Israel for an internship and since then has been a great resource for me in planning my career. Most of all, I want to thank my parents and my grandparents for believing in me to complete my PhD. They have to develop so much independence at this age while I am 7000 miles away from home. Without their constant love and understanding, I could not have gone this far. Last but definitely not the least, I want to thank Fei Lu, my fiance and her family for their support. This dissertation is not possible without Fei’s constant supervision. Thank you for having faith in me and letting me pursue my dream. Thank you for tolerating me for not only the past two years but also many more to come. This dissertation is as much yours as it is mine. iv Contents 1 Introduction 1 1.1 Thesis Statement . .4 1.2 Thesis Contributions . .4 2 Lexicon Adaptation 5 2.1 Introduction . .5 2.2 Detect-and-Learn Out-of-vocabulary Words . .5 2.2.1 Related Work . .5 2.2.2 Method . .6 2.2.3 Experiments . .6 2.2.4 Results . .7 2.2.5 OOV Learning in Dialog . .9 2.2.6 Detect-and-learn Summary . 12 2.3 Expect-and-Learn Out-of-vocabulary Words . 12 2.3.1 Related Work . 13 2.3.2 Lexicon Semantic Relatedness . 13 2.3.3 OOV Learning Procedure . 14 2.3.4 Experiments . 15 2.3.5 Expect-and-Learn Summary . 21 2.4 Conclusion . 22 2.5 Possible extensions . 22 2.5.1 Combination of two approaches . 22 2.5.2 Topic level similarity in expect and learn ................. 22 2.5.3 OOV learning for dialog system . 22 3 Cloud ASR Adaptation 25 3.1 Introduction . 25 3.2 Experiment Setup . 26 3.3 Local Recognizer Comparison . 27 3.4 Combine Local Recognizer with Cloud-based Recognizer . 28 3.5 Required Data for Adaptation . 31 3.6 Conclusion . 31 v 4 Intention Adaptation 33 4.1 Introduction . 33 4.2 Related Work . 34 4.3 Data Collection . 35 4.3.1 App Recording Interface . 35 4.3.2 Task Annotation . 36 4.3.3 Task-Related Spoken Dialog . 37 4.3.4 Data Statistics . 38 4.4 Modeling User Intentions . 40 4.4.1 High-level Intentions . 42 4.4.2 Online Prediction . 53 4.5 Conclusion . 55 4.6 Possible extensions . 56 4.6.1 Data Gathering . 57 4.6.2 Sharing Context . 60 4.6.3 Large Scale Dialog Framework with Web Connectivity . 60 5 Conclusion 63 Bibliography 65 vi List of Figures 1.1 Spoken Dialog System . .2 2.1 OOV detection performance for different fragments . .7 2.2 OOV recovery results on 5K task . .8 2.3 OOV recovery results on 20K task . .8 2.4 The OOV prediction performance across different resources . 19 3.1 Data split on the training set for comparing local recognizers . 27 3.2 Individual decoder performance . 28 3.3 Data split for combining local recognizer with cloud recognizer . 29 3.4 System performance on different amount of adaptation data . 32 4.1 Multi-domain dialog examples . 34 4.2 Example of activities grouped based on location. 36 4.3 Annotation interface without user annotation . 37 4.4 Multi-app annotation example . 37 4.5 Multi-domain dialog example . 38 4.6 Histogram of number of annotation sessions . 39 4.7 Histogram of number of dialogs . 39 4.8 Histogram of number of utterances . 39 4.9 Histogram of number of apps . 39 4.10 Intention understanding and realization example . 43 4.11 Example of RepSeq with three app sequences. 45 4.12 Evaluation of the neighbor-based intention models . 48 4.13 Evaluation of the cluster-based intention models . 51 4.14 Key phrases examples . 52 4.15 Example of the agent learning a recurrence of task . 58 4.16 Example of the agent learning a new task . 59 4.17 Illustration of overlapping domain knowledge. 61 vii viii List of Tables 2.1 Sentences with OOVs . 10 2.2 Candidate questions to elicit OOVs from user . 11 2.3 Oracle OOV discovery capability . 17 2.4 Examples of related words in different web resources . 17 2.5 Examples of salvaged OOVs and corresponding source IVs .

Adapting Spoken Dialog Systems Towards Domains and Users

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support