Pronunciation Modeling in Speech Synthesis

University of Pennsylvania ScholarlyCommons IRCS Technical Reports Series Institute for Research in Cognitive Science February 1998 Pronunciation Modeling In Speech Synthesis Corey Andrew Miller University of Pennsylvania Follow this and additional works at: https://repository.upenn.edu/ircs_reports Miller, Corey Andrew, "Pronunciation Modeling In Speech Synthesis" (1998). IRCS Technical Reports Series. 55. https://repository.upenn.edu/ircs_reports/55 University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-09. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/55 For more information, please contact [email protected]. Pronunciation Modeling In Speech Synthesis Abstract This dissertation investigates the area of pronunciation modeling in speech synthesis. By pronunciation modeling, we mean architectures and principles for generating high-quality human-like pronunciations. The term pronunciation modeling has previously been applied in the context of speech recognition (e.g. Byrne et al. 1997). In that context, it describes theories and procedures for handling the pronunciation variation that naturally occurs across speakers. In contrast, our work is in the domain of text-to-speech synthesis, which, as we will show, requires modeling the pronunciation variation of an individual whose speech the synthesizer is attempting to model. We will explain our methodology for learning and reproducing pronunciation variation on an individual basis, and show how most crucial features of such variation can be easily generated using the architecture we describe. Throughout the course of this exposition, we highlight contributions to linguistic theory that such a thorough analysis of individual variation provides. We describe the postlexical module of an English text-to-speech synthesizer. This module is responsible for transforming underlying lexical pronunciations from a lexical database into contextually appropriate surface postlexical pronunciations. This transformation is achieved by machine learning of a corpus of hand-labeled postlexical pronunciations that have been aligned with lexical pronunciations. The machine learning is conducted by a neural network, whose architecture and data encoding we describe. A thorough analysis of the performance of the postlexical module is offered, with attention to the relative success of the neural network at learning a wide range of postlexical phenomena. We examine the extent to which a symbolic approach to allophony is warranted, and provide an acoustic analysis that attempts to provide an answer to this question. Assessments of the success of currently existing theories of phonetics, phonology and their interface are offered, based on the experience of generating a complete postlexical phonology of English for use in synthetic speech. Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-09. This thesis or dissertation is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/55 PRONUNCIATION MODELING IN SPEECH SYNTHESIS Corey Andrew Miller A DISSERTATION in Linguistics Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 1998 _____________________________ Mark Liberman Supervisor of Dissertation _____________________________ George Cardona Graduate Group Chairperson © COPYRIGHT Corey Andrew Miller 1998 DEDICATION To Jonathan Connett. iii ACKNOWLEDGMENTS I am very pleased to have had the encouragement and support of a committee of three linguists for whom I have the greatest respect and admiration: Mark Liberman, William Labov and Eugene Buckley. Each of them made my transition back to Penn pleasant after what seemed like a long absence. It was a great pleasure to have Mark Randolph both as an external reader and as a colleague at Motorola. Mark’s work at MIT a decade ago has served as an inspiration to me. Orhan Karaali made this dissertation possible in this millennium. As my manager for over two years at Motorola, Orhan insisted on making my dissertation a priority at work. Harry Bliss provided his voice to this project and our whole group is very grateful for his patience and cooperation. My colleagues at Motorola listened to my ideas and provided technical and theoretical assistance at every turn: Noel Massey, Jerry Corrigan, William Thompson, Andrew Mackie, Erica Zeinfeld, Otto Schnurr, Lea Adams, Michael Murdock, Joseph Goldberg and Lynette Melnar. My entire management was supportive in this effort: Ira Gerson, Kevin Kloker and James Mikulski. D. J. Stockley made the pursuit of intellectual property a positive learning experience. My colleagues at my former workplace, Franklin Electronic Publishers, provided a great environment as I wet my feet in the industrial world: Will Dowling, David Justice and Mike Wolff. I thank Matt Lennig and Kristin Precoda for introducing me to speech technology and software development at BNR. My linguist friends and colleagues at Penn and others schools provided support and encouragement while I lived in Philadelphia and afterwards: Christine Zeller, Tom Veatch, Peter Slomanson, Fabiola Varela-García, Mary O’Malley, Bill Reynolds, Hadass Sheffer, Stephanie Strassel, Nadia iv Biassou, Lisa Lavoie, Hikyoung Lee, Alex Dimitriadis, and Jason Eisner. I thank Brian Sietsema, Pat Keating, Sean Fulop and Howard Nusbaum for providing valuable advice and information. My mother always believed that this dissertation would become a reality, and to her I owe inexpressible gratitude for providing encouragement at the more difficult moments of my graduate career. The thought of the rest of my family including my father, Stephanie, Ken, Jordan, Avery, Lee, Elihu, and my late grandparents has always kept me going. Thank you Jonathan for paying the bills, arranging exotic vacations and, most of all, hanging in there. Finally, I would like to thank the Summer Institute of Linguistics for their excellent phonetic fonts which are freely distributed on the World Wide Web. v ABSTRACT PRONUNCIATION MODELING IN SPEECH SYNTHESIS Corey Andrew Miller Mark Liberman This dissertation investigates the area of pronunciation modeling in speech synthesis. By pronunciation modeling, we mean architectures and principles for generating high-quality human-like pronunciations. The term pronunciation modeling has previously been applied in the context of speech recognition (e.g. Byrne et al. 1997). In that context, it describes theories and procedures for handling the pronunciation variation that naturally occurs across speakers. In contrast, our work is in the domain of text-to-speech synthesis, which, as we will show, requires modeling the pronunciation variation of an individual whose speech the synthesizer is attempting to model. We will explain our methodology for learning and reproducing pronunciation variation on an individual basis, and show how most crucial features of such variation can be easily generated using the architecture we describe. Throughout the course of this exposition, we highlight contributions to linguistic theory that such a thorough analysis of individual variation provides. We describe the postlexical module of an English text-to-speech synthesizer. This module is responsible for transforming underlying lexical pronunciations from a lexical database into contextually appropriate surface postlexical pronunciations. This transformation is achieved by machine learning of a corpus of hand-labeled postlexical pronunciations that have been aligned with lexical pronunciations. The machine learning is conducted by a vi neural network, whose architecture and data encoding we describe. A thorough analysis of the performance of the postlexical module is offered, with attention to the relative success of the neural network at learning a wide range of postlexical phenomena. We examine the extent to which a symbolic approach to allophony is warranted, and provide an acoustic analysis that attempts to provide an answer to this question. Assessments of the success of currently existing theories of phonetics, phonology and their interface are offered, based on the experience of generating a complete postlexical phonology of English for use in synthetic speech. vii TABLE OF CONTENTS Chapter 1. Introduction .....................................................................................................1 1.1. What is speech synthesis pronunciation modeling? ................................................1 1.2. Computational phonology and speech technology..................................................2 1.2.1. Linguistic aspects of speech synthesis.............................................................4 1.2.2. Review of prior work in postlexical modeling..............................................14 1.2.3. Review of neural networks in phonology......................................................25 1.3. General phonological issues..................................................................................30 1.3.1. Phonetics-phonology interface ......................................................................30 1.3.2. Lexical phonology and the interface between syntax, morphology and phonology..................................................................................................................35 1.4. Sociolinguistics/Variation .....................................................................................43

Load more