Information State Based Speech Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Information State Based Speech Recognition GOTHENBURG MONOGRAPHS IN LINGUISTICS 41 Information State Based Speech Recognition Rebecca Jonson Doctoral dissertation in Linguistics, University of Gothenburg c Rebecca Jonson 2010 Printed by Reprocentralen, University of Gothenburg, 2010. Dissertation edition, May 2010 Published at: http://hdl.handle.net/2077/22169 Distribution: Insititutionen f¨or filosofi, lingvistik och vetenskapsteori, G¨oteborgs Universitet Box 200, 405 30 G¨oteborg Abstract Ph.D dissertation in Linguistics at University of Gothenburg, Sweden, 2010 Title: Information State Based Speech Recognition Author: Rebecca Jonson Language: English Department: Department of Philosophy, Linguistics and Theory of Sciences, University of Gothenburg, Box 200, SE-405-30 G¨oteborg Series: Gothenburg Monographs in Linguistics 41 Published at: http://hdl.handle.net/2077/22169 One of the pitfalls in spoken dialogue systems is the brittleness of automatic speech recognition (ASR). ASR systems often misrecognize user input and they are unreliable when it comes to judging their own performance. Recognition failures and deficient confidence estimation affect the performance of a dialogue system as a whole and the impression it makes on a user. Humans outperform ASR systems on most tasks related to speech understanding. One of the reasons is that humans make use of much more knowledge. For example humans appear to take a variety of knowledge-based aspects of the current dialogue into account when processing speech. The main purpose of this thesis is to investigate whether speech recognition also can benefit from the use of higher level knowledge sources and dialogue context when used in spoken dialogue systems. One of the major contributions of this thesis is to provide more insight into what type of knowledge sources in spoken dialogue systems would be potential contributors to the task of ASR and how such knowledge can be represented computationally. In the framework of information state based dialogue management we have an important source of semantic and pragmatic knowl- edge represented in the information state. We will investigate if the knowledge in the information state can help to alleviate the search problem and reliability estimation in speech recognition. We call this knowledge and context aware approach to speech recognition information state based speech recognition. The first part of this thesis investigates approaches to obtaining better initial language models more rapidly for spoken dialogue systems and ways of dynamically selecting the most appropriate models based on the dialogue context. The second part of this thesis concerns the use of the speech recognition output and investi- gates how additional knowledge sources can enhance a dialogue system’s decision-making on how to proceed and make use of speech recognition hypotheses. The thesis presents several experimental studies addressing the issues described above and proposes an integration of the explored techniques into the GoDiS dialogue system. Keywords: dialogue systems, speech recognition, language modelling, dialogue move, dialogue context, ASR, higher level knowledge, linguistic knowledge, N-Best re-ranking, confidence scoring, confidence annotation, information state, ISU approach. Acknowledgements A long journey is coming to an end and it is time to express my gratitude to everyone travelling with me along the way. This journey would not even have started if it was not for my supervisor Prof. Robin Cooper, who was the first to encourage me to become a researcher, and gave me the opportunity to carry out my PhD studies remotely from Spain. Moreover, without Robin holding the compass and helping me to navigate I would probably be lost on the seven seas. Thanks for your guidance, support, patience and never-ending encouragement. Thanks for reading and re-reading every single paragraph of the manuscript. Also my deep gratitude for contributing to this thesis with your perspicaciousness, brilliant ideas and insightful comments. In fact, the inspiration for this voyage started long before my PhD studies with my first encounter with speech recognition and dialogue systems. Thanks to Prof. Jim Hieronymous and Prof. Luis Hern´andez for introducing me to the world of speech recognition and to Staffan Larsson for evoking my interest in dialogue systems and introducing me to the GoDiS “club”. The research questions in this thesis started to grow while working at Telef´onica R&D (TID). I am therefore sincerely grateful to my dear TID colleagues for all interesting discussions. Gracias! This thesis has clearly benefitted from all constructive feedback and insightful comments from my second supervisor Rolf Carlson, from Oliver Lemon (Chapter 8), from Steve Young and Karl Weilhammer (Chapter 4 and Chapter 7), from Joakim Nivre (Chapter 4, Chapter 6 and Chapter 8), from Stina Ericsson and from David Hjelm as well as from all valuable comments from anonymous reviewers of my published papers. For a voyage to be able to reach its end it is also necessary to have the machinery working. A big thank you to Robert Andersson for being a brilliant remote system administrator keeping the boat’s engine healthy and up to date. Warm thanks to the Gothenburg Dialogue lab people (Aarne, Ann-Charlotte, Bj¨orn, David, H˚akan, Jessica, Peter, Staffan, Stina) who have contributed in many different ways and given me a helping hand with GF and GoDiS whenever needed. A special thanks to David Hjelm for your technical help, for inspiring discussions about speech and dialogue and for being a brilliant colleague both at the university and at Artificial Solutions. I also owe a great deal to all students and colleagues who participated in the experiments. An asset when travelling is all the people you meet. I am so lucky to have had the opportunity, thanks to GSLT1, to get to know all amazing GSLT PhD students. I am also happy to have met so many wonderful colleagues, both at the Department of Linguistics and at Artificial Solutions. I am sincerely grateful to have had the opportunity to participate in the TALK project2 and being able to meet so many interesting people sharing the same research interests, as well as receiving funding for my work and for the conference travels to present my work. This journey would not have been possible without the support of my family and friends. Thanks to my parents for always believing in me and encouraging whatever journey I decide to embark on. Besos for Bella for being an extraordinary sister. Kramar to my brother Joakim and his wonderful kids. Lots of love to my everlasting friends (Annica, Kristin, Magdalena, Mira, Sofia, Therese) for being so close although we are so far away. My deepest thanks to Oscar for being the best possible travelmate and making every new day a thrilling adventure. Finally, thanks beforehand to everyone that has the intention of reading parts of this thesis. Enjoy the journey! 1The Swedish Graduate School of Language Technology 2TALK (Talk and Look, Tools for Ambient Linguistic Knowledge), IST-507802 Contents 1 Introduction 1 1.1Whyisspeechsodifficultfordialoguesystems?............... 2 1.2Researchquestions............................... 5 1.3Thesisoutline.................................. 7 I Preliminaries 9 2 Background 11 2.1AbriefintroductiontoASR.......................... 11 2.1.1 Digitalsignalprocessing........................ 13 2.1.2 Acoustic modelling ........................... 14 2.1.3 Language modelling ........................... 17 2.1.4 Decoding................................ 19 2.1.5 The three fundamental ranges . .................... 21 2.2Abriefintroductiontospokendialoguesystems............... 22 2.2.1 Spoken language understanding .................... 23 2.2.2 Dialoguemanagement......................... 26 2.2.3 Naturallanguagegeneration...................... 28 2.2.4 Abriefhistoricalbackground..................... 29 2.3EvaluationmetricsinASR........................... 30 2.3.1 Perplexity................................ 30 2.3.2 Worderrorrateandsentenceerrorrate................ 31 2.3.3 Worderrorratevsconcepterrorrate................. 33 2.3.4 Dialoguemoveerrorrate........................ 33 2.3.5 Significancetesting........................... 34 2.4 Survey of attempts to compensate for ASR deficiencies in spoken dialogue systems..................................... 35 2.4.1 Improvingthefrontend........................ 36 2.4.2 Improvingdigitalsignalprocessing.................. 39 2.4.3 Improving acoustic modelling . .................... 42 2.4.4 Improving language modelling . .................... 45 2.4.5 ImprovingASRhypothesesselection................. 55 v vi 2.4.6 Improving confidence annotation ................... 59 2.4.7 Improvementonotherlevels...................... 62 2.5Conclusions................................... 63 3 Baseline systems 65 3.1Theinformationstateupdateapproach.................... 65 3.2 The TrindiKit platformfordialoguesystemdevelopment......... 66 3.3 The GoDiS dialoguesystem.......................... 67 3.3.1 GoDiS informationstate....................... 67 3.3.2 GoDiS dialoguemoves......................... 70 3.3.3 GoDiS plansandaccommodation................... 71 3.3.4 GoDiS ruleformat........................... 73 3.3.5 GoDiS grounding behaviour . .................... 74 3.4 The baseline GoDiS applications....................... 78 3.4.1 DJ-GoDiS:TheMP3player..................... 78 3.4.2 AgendaTalk :Thetalkingcalendar................. 80 3.5 The TrindiKit logging format .......................