Accent Features and Idiodictionaries
Total Page:16
File Type:pdf, Size:1020Kb
PhD Dissertation Accent Features and Idiodictionaries: On Improving Accuracy for Accented Speakers in ASR Michael Tjalve Department of Phonetics and Linguistics University College London March 2007 Declaration I, Michael Tjalve, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Copyright The copyright of this thesis rests with the author and no quotation from it or information derived from it may be published without the prior written consent of the author. 2007 Michael Tjalve ii ABSTRACT One of the most widespread approaches to dealing with the problem of accent variation in ASR has been to choose the most appropriate pronunciation dictionary for the speaker from a predefined set of dictionaries. This approach is weak in two ways: firstly that accent types are more numerous and more variable than can be captured in a few dictionaries, even if the knowledge were available to create them; and secondly, accents vary in the composition and phonotactics of the phone inventory not just in which phones are used in which word. In this work, we identify not the speaker's accent, but accent features which allow us to predict by rule their likely pronunciation of all words in the dictionary. Any given speaker is associated with a set of accent features, but it is not a requirement that those features constitute a known accent. We show that by building a pronunciation dictionary for an individual, an idiodictionary , recognition accuracy can be improved over a system using standard accent dictionaries. The idiodictionary approach could be further enhanced by extending the set of phone models to improve the modelling of phone inventory and variation across accents. However an extended phoneme set is difficult to build since it requires specially-labelled training material, where the labelling is sensitive to the speaker's accent. An alternative is to borrow phone models of a suitable quality from other languages. In this work, we show that this phonetic fusion of languages can improve the recognition accuracy of the speech of an unknown accent. This work has practical application in the construction of speech recognition systems that adapt to speakers' accents. Since it demonstrates the advantages of treating speakers as individuals rather than just as members of a group, the work also has potential implications for how accents are studied in phonetic research generally. iii ACKNOWLEDGEMENTS Many people have contributed to this PhD in many different ways. First and foremost, I would like to thank my research supervisor Dr. Mark Huckvale for academic guidance and for many exciting discussions. I have enjoyed how he has challenged me and my ideas. It has been a privilege to work closely with one of the true experts in the field of accent variation in speech technology. I would like to thank Infinitive Speech Systems and Visteon Corporation for sponsoring my research. The speech data, phonetic data and the speech recogniser used in the majority of the experiments described on the pages below belong to Infinitive Speech Systems, and without this agreement, it would not have been possible to complete the research within the timeframe I had planned. I would like to thank Dr. Maha Kadirkamanathan, without whom this PhD would not even have begun. He has been my industrial mentor within speech technology and he is one of the few persons I have met on the commercial side of the speech community who truly understands the potential of academia and the industry working together. I would also like to thank Darren, Omri, Mark, Paul, Bernard and Gary at Infinitive Speech Systems for many good discussions, ideas and help. I am grateful to Dr. Roger Moore for giving me assistance with the Accents of the British Isles corpus. Also thanks to Simon King and the Centre of Speech Technology Research at University of Edinburgh for making the Unisyn dictionary available. iv Thanks to Mads Torgersen for very valuable proofreading and good suggestions. I would like to thank my mother, Merete Tjalve, for always believing in me and for taking an important part in making me who I am. I would like to thank my father, Eskild Tjalve, for inspiring scientific thinking and for showing the value of thinking outside of the box. Finally, I would like to thank my amazing life companion - my wife Rocío - and our wonderful children, Isabella and Oliver, for bearing up with me spending time on this project of mine. It has been a fascinating journey but it has also been a long one and I could not have done it without your support and understanding. Thank you for always being there. Jeg elsker jer! v TABLE OF CONTENTS Abstract..............................................................................................................................iii Acknowledgements............................................................................................................ iv Table of Contents................................................................................................................ 1 1 Introduction.................................................................................................................. 7 1.1 Aims and overview of thesis ................................................................................. 9 1.2 Scope of research................................................................................................. 11 1.3 General notes about the experiments................................................................... 12 1.3.1 The speech data............................................................................................. 12 1.3.2 The ASR engine............................................................................................ 15 1.3.3 The pronunciation dictionary........................................................................ 17 2 Accent Variation and ASR......................................................................................... 19 2.1 Introductory remarks ........................................................................................... 19 2.2 Accent variation................................................................................................... 19 2.3 The mechanics of an ASR engine........................................................................ 24 2.3.1 The acoustic signal........................................................................................ 25 2.3.2 The front-end ................................................................................................ 26 2.3.3 The back-end................................................................................................. 26 2.3.3.1 The acoustic models ...................................................................................26 2.3.3.2 The pronunciation dictionary......................................................................30 2.3.3.3 The grammar...............................................................................................31 2.3.4 The response ................................................................................................. 34 2.4 Why accent variation is a problem to ASR engines ............................................ 35 2.5 Summary and discussion ..................................................................................... 36 3 Phonetics and Phonology in ASR .............................................................................. 38 3.1 Introductory remarks ........................................................................................... 38 3.2 Phonetic and phonological variation ................................................................... 39 3.3 Phonetic and phonological representation........................................................... 41 3.3.1 The phoneme set........................................................................................... 42 3.3.2 The acoustic models...................................................................................... 44 3.3.3 The pronunciation dictionary........................................................................ 46 3.4 Summary and discussion ..................................................................................... 47 4 Accent Variation Modelling....................................................................................... 49 4.1 Introductory remarks ........................................................................................... 49 4.2 The acoustic models ............................................................................................ 50 4.2.1 Training of the acoustic models.................................................................... 51 4.2.1.1 Details of the experiment............................................................................52 4.2.1.2 Findings ......................................................................................................53 4.2.2 Speaker adaptation of the acoustic models ................................................... 54 4.2.2.1 Details of the experiment............................................................................57 4.2.2.2 Findings ......................................................................................................58 4.3 The pronunciation dictionary............................................................................... 60 4.3.1 The canonical dictionary..............................................................................