Rapid Generation of Pronunciation Dictionaries for New Domains and Languages
Total Page:16
File Type:pdf, Size:1020Kb
Rapid Generation of Pronunciation Dictionaries for new Domains and Languages Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Tim Schlippe aus Ettlingen Tag der mündlichen Prüfung: 10.12.2014 Erste Gutachterin: Prof. Dr.-Ing. Tanja Schultz Zweite Gutachterin: Prof. Marelie Davel Acknowledgements The road to a Doctorate is long and winding, and would be impossible were it not for the people met along the way. I would like to gratefully acknowledge my primary supervisor Prof. Tanja Schultz. She believed in my research and supported me with many useful discussions. Her great personality and excellent research skills had a very strong effect on my scientific career. All the travels to conferences, workshops and project meetings which are one of the best experiences in my life would be not possible without her support. Tanja and my co-advisor Prof. Marelie Davel also deserve thanks for reading through my dissertation to provide constructive criticism. It was very kind of Marelie to take the long trip from South Africa to Karlsruhe to participate in the dissertation committee. Special thanks to Dr. Stephan Vogel, Prof. Pascale Fung, Prof. Haizhou Li, Prof. Florian Metze, and Prof. Alan Black for their support regarding research visits for me and my students and for very close and beneficial collaboration. Furthermore, I would like to thank all my friends and colleagues at the Cog- nitive Systems Lab for a great time. Their support is magnificent. Thanks to Ngoc Thang Vu, Michael Wand, Matthias Janke, Dominic Telaar, Dominic Heger, Christoph Amma, Christian Herff, Felix Putze, Jochen Weiner, Mar- cus Georgi, Udhyakumar Nallasamy, Dirk Gehrig for many great travel ex- periences and lovely activities after work. Moreover, thanks to Helga Scherer and Alfred Schmidt for their support. Even he did not work at CSL and left the Karlsruhe Institute of Technology in 2009, Prof. Matthias Wölfel was always an inspiring example. Thanks to all students I have supervised in their master, diploma, bachelor theses and student research projects for their collaboration and encouragement. Finally, I am grateful for the patience and support of my family and friends. Especially, without my father Peter Schlippe who always supported me, my career would have never been possible. Sarah Kissel, Johannes Paulke and Norman Enghusen – thank you for always beeing there for me. Summary Automatic speech recognition allows people an intuitive communication with machines. Compared to the keyboard and mouse as input modalities, au- tomatic speech recognition enables a more natural and efficient way where hands and eyes are free for other activities. Applications, such as dicta- tion systems, voice control, voice-driven telephone dialogue systems and the translation of spoken language, are already present in daily life. However, automatic speech recognition systems exist only for a small fraction of the 7,100 languages in the world [LSF14] since the development of such systems is usually expensive and time-consuming. Therefore, porting speech technology rapidly to new languages with little effort and cost is an important part of research and development. Pronunciation dictionaries are a central component for both automatic speech recognition and speech synthesis. They provide the mapping from the ortho- graphic form of a word to its pronunciation, typically expressed as a sequence of phonemes. This dissertation will present innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. De- pending on various conditions, solutions are proposed and developed. – Start- ing from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists. We embedded many of the tools implemented in this work in the Rapid Language Adaptation Toolkit. Its web interface is publicly accessible and allows people to build initial speech recognition systems with little technical background. Furthermore, we demonstrate the potential of the developed methods in this thesis by many automatic speech recognition results. ii Summary The main achievements of this thesis for languages with writing system The main achievements of this thesis providing solutions for languages with writing system, can be summarized as follows: Rapid Vocabulary Selection: Strategies for the collection of Web texts (Crawling) help to define the vocabulary of the dictionary. It is important to collect words that are time- and topic-relevant for the domain of the automatic speech recognition system. This includes the use of information from RSS Feeds and Twitter which is particularly helpful for the automatic transcription of broadcast news. Rapid Generation of Text Normalization Systems: Text Normaliza- tion transfers writing variations of the same word into a canonical represen- tation. For rapid development of text normalization systems at low cost we present methods where Internet users generate training data for such systems by simple text editing (Crowdsourcing). Analyses of Grapheme-to-Phoneme Model Quality: Except for lang- uages with logographic writing systems, data-driven grapheme-to-phoneme (G2P) models trained from existing word-pronunciation pairs can be used to directly provide pronunciations for words that do not appear in the dic- tionary. In analyses with European languages we demonstrate that despite varying grapheme-to-phoneme relationships word-pronunciations pairs whose pronunciations contain 15k phoneme tokens are sufficient to obtain a high consistency in the resulting pronunciations for most languages. Automatic Error Recovery for Pronunciation Dictionaries: The man- ual production of pronunciations, e.g. collecting training data for grapheme- to-phoneme models, can lead to flawed or inadequate dictionary entries due to subjective judgements, typographical errors, and ’convention drift’ by multi- ple annotators. We propose completely automatic methods to detect, remove, and substitute inconsistent or flawed entries (filter methods). Web-based Tools and Methods for Rapid Pronunciation Creation: Crowdsourcing-based approaches are particularly helpful for languages with a complex grapheme-to-phoneme correspondence when data-driven approaches decline in performance or when no initial word-pronunciation pairs exist. Analyses of word-pronunciation pairs provided by Internet users on Wik- tionary (Web-derived pronunciations), demonstrate that many pronuncia- tions are available for languages of various language families – even for flex- ions and proper names. Additionally, Wiktionary’s paragraphs about the Summary iii word’s origin and language help us to automatically detect foreign words and produce their pronunciations with separate grapheme-to-phoneme models. In the second crowdsourcing approach a tool Keynounce was implemented and analyzed. Keynounce is an online game where users can proactively produce pronunciations. Finally, we present efficient methods for a rapid and low-cost semi-automatic pronunciation dictionary development. These are especially helpful to minimize potential errors and the editing effort in a human- or crowdsourcing-based pronunciation generation process. Cross-lingual G2P Model based Pronunciation Generation: For the production of the pronunciations we investigate strategies using grapheme- to-phoneme models derived from existing dictionaries of other languages, thereby substantially reducing the necessary manual effort. The main achievements of this thesis for languages without writing system The most difficult challenge is the construction of pronunciation dictionaries for languages and dialects, which have no writing system. If only a recording of spoken phrases is present, the corresponding phonetic transcription can be obtained using a phoneme recognizer. However, the resulting phoneme sequence does not contain any information about the word boundaries. This thesis presents procedures where exploiting the written translation of the spo- ken phrases helps the word discovery process (human translations guided lan- guage discovery). The assumption is that a human translator produces utter- ances in the (non-written) target language from prompts in a resource-rich source language. The main achievements of this thesis aimed at languages without writing system, can be summarized as follows: Word Segmentation from Phoneme Sequences through Cross-lingual Word-to-Phoneme Alignment: We implicitly define word boundaries from the alignment of words in the translation to phonemes. Furthermore, our procedures contain an automatic error compensation. Pronunciation Extraction from Phoneme Sequences through Cross- lingual Word-to-Phoneme Alignment: We gain a pronunciation dictio- nary whose words are represented by phoneme clusters and the corresponding pronunciations by their phoneme sequence. This enables an automatic “in- vention” of a writing system for the target language. iv Summary Zero-resource Automatic Speech Recognition: Finally, we tackle the task of bootstrapping automatic speech recognition systems without an a priori given language model, pronunciation dictionary, or transcribed speech data for the target language — available are only untranscribed speech and translations to resource-rich source languages of what