A Prototype Text Analyzer for Mandarin Chinese TTS System
Total Page:16
File Type:pdf, Size:1020Kb
A Prototype Text Analyzer for Mandarin Chinese TTS System Chiao-ting Fang Uppsala University Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology June 11, 2017 Supervisors: Yan Shao, Uppsala University Kåre Sjölander, ReadSpeaker AB Andreas Björk, ReadSpeaker AB Abstract This project presents a prototype of a text analyzer for a rule-based Mandarin Chinese TTS voice, including components for text normalization, lexicons, pho- netic analysis, prosodic analysis, and a phone set. Our implementation shows that despite the linguistic differences, it is feasible to build a Chinese voice with the TTS framework for European languages used at ReadSpeaker AB. A number of challenges in disambiguation and tone sandhi have been identified during the implementation, which we discuss in detail. A comparison of the existing voices is designed, based on these cases, to better understand the per- formance level of commercial TTS systems. The results verify our conjecture of the difficult cases and also show that there is considerable disagreement on the tone sandhi pattern among the voices. Further research on these topics will contribute to the development of future TTS voices. Contents Acknowledgements 5 1 Introduction 6 1.1 Purpose .................................. 7 1.2 Limitations ................................ 7 1.3 Outline .................................. 7 2 Text-to-Speech Systems 8 2.1 Architecture ............................... 8 2.2 Components of Unit Selection TTS system ............. 9 2.2.1 Text Normalization ....................... 10 2.2.2 Phonetic Analysis ....................... 10 2.2.3 Prosodic Analysis ........................ 11 2.2.4 Internal Representation .................... 12 2.2.5 Waveform Synthesis ...................... 12 2.3 Other approaches to synthesize speech . 13 2.3.1 Articulatory Synthesis ..................... 13 2.3.2 Formant Synthesis ....................... 14 2.3.3 Diphone Synthesis ....................... 14 2.3.4 Hidden Markov Model-based Synthesis . 15 2.3.5 New Experimental Approaches . 16 3 An Overview of the Chinese Language 17 3.1 Introduction: Chinese or Mandarin? . 17 3.2 Phonology ................................ 18 3.3 Phonetic Representation and Romanization . 19 3.4 Morphology: What is a Word? ..................... 19 3.5 Writing Systems ............................. 20 4 Implementation 22 4.1 Text Normalization ........................... 22 4.1.1 Tokenization: ZPar ....................... 22 3 4.1.2 Normalization .......................... 23 4.2 Phonetic Analysis ............................ 26 4.2.1 Lexicons ............................. 27 4.2.2 Out-of-Vocabulary Words ................... 30 4.2.3 Disambiguation ......................... 30 4.3 Internal Representation: the Phone Set . 32 4.4 Prosodic Analysis ............................ 35 4.4.1 Prosody Beyond Tones ..................... 35 4.4.2 Third Tone Sandhi ....................... 36 4.4.3 Yi- and Bu-Tone Sandhi .................... 37 4.5 Waveform Synthesis ........................... 39 4.5.1 Speech Database ........................ 39 4.5.2 Segmentation and Generating the Output . 39 5 Evaluation 41 5.1 Evaluation Methods ........................... 41 5.1.1 Intelligibility .......................... 41 5.1.2 Naturalness ........................... 42 5.2 Existing Chinese TTS voices ...................... 42 5.3 Comparing the Voices .......................... 43 6 Conclusion 47 6.1 Summary ................................. 47 6.2 Future Work ............................... 48 Bibliography 49 A Complete List of Normalization Tasks 52 B Pinyin to R-sampa Mapping Chart 54 C Test Cases and Results 56 4 Acknowledgements I would like to express my gratitude to my supervisor Yan for his constant help, devotion, and encouragement during the entire project. His guidance and feed- back were invaluable. I am most grateful to Andreas and Kåre, my supervisors at ReadSpeaker AB, for giving me this opportunity to work with this exciting project. Their patience and immense knowledge in TTS always manage to answer any questions I have. I also wish to thank Erik for his help with the implemen- tation and Filip for checking the phone set. My time at ReadSpeaker has been both enjoyable and productive thanks to all the colleagues and especially the TTS team. I consider myself very lucky to be part of the group. I am indebted to Hoa and Caroline, who have gone through my writing tirelessly to improve it. This work would not be possible without the support of my friends and most of all, my family. I am sincerely thankful for their love and company whenever I need them the most. 5 1 Introduction A text-to-speech (TTS) system takes text as input and tries to generate the audio output of the text in the way that it would be read by a human. As speech is the most fundamental form of human languages, a wide variety of applications are now equipped with some forms of artificial voice not only for users who have difficulties in reading or understanding written text, but as an aid for thegen- eral public. Synthesized speech is also used by people who are unable to talk with their own voice. The history of synthesized voice has evolved over the years: early attempts include talking machines that imitate articulatory movements, dating back to the 18th century (Jurafsky and Martin, 2009). Researchers in formant synthesis from the 1950s onward have successfully produced understand- able speech by using varied signals to create speech waveform, but the quality of the voice is far from natural (Black, 2000). Commercial approaches today are mainly based on the concatenation of recorded speech, made possible by more powerful computers and larger storage space. The architecture of a modern concatenative TTS system generally includes two parts: text analysis and waveform synthesis (Taylor, 2009). In text analysis, tokenization of the words and sentences may be required in the first place de- pending on the language. Then the non-standard words in the input text, such as numbers, symbols, and abbreviations are converted to their written-out forms. The written text is later turned into phonetic transcription, usually by looking up in the lexicon or by grapheme-to-phoneme rules. Suprasegmental features are also encoded for natural-sounding prosody. The waveform synthesizer then gen- erates the speech by selecting appropriate segments from the speech database according to the transcription and prosodic markings. The output of the system is the artificial speech made by joining the chosen units. With better computing power and recording devices available, TTS system development is no longer limited to research institutes and laboratories. Many voices on the market are of good quality, covering a wide range of languages. TTS is also a field of research and development for companies who wish to incor- porate synthesized speech in their products. Although synthesized voices are now generally comprehensible, the models are continuously improved to handle tricky natural language cases as well as to capture and recreate the correct prosody. 6 1.1 Purpose The goal of this thesis project is to explore the development of a text analyzer for a Mandarin Chinese TTS voice under the models described by Taylor (2009) and Jurafsky and Martin (2009). Mandarin Chinese is known for its logographic script and tones, which require different NLP approaches for the processes. The result of the project is a prototype capable of dealing with many common text analysis tasks in a Chinese TTS system. This project is in collaboration with ReadSpeaker AB, a TTS company which provides TTS solutions to digital texts and applications in Uppsala, Sweden. By working with a non-alphabetic and tonal language like Mandarin Chinese, we also hope to improve the robustness of text analyzers used at ReadSpeaker in general. 1.2 Limitations Although Latin letters and foreign words may occur in a Chinese text, we only process Chinese characters and speech sound in this project. Non-character words are not analyzed and read in the output. Our focus is to improve the rule-based text analysis rather than fixing individual mistakes of words, so the output audio is only a demonstration of how the rules work rather than a sample voice of the product. 1.3 Outline Chapter 2 provides an overview of the major text-to-speech approaches with the focus on concatenative synthesis and unit selection. Chapter 3 provides the linguistic background of Chinese that is relevant for our TTS system. The imple- mentation is described in Chapter 4. The common criteria for TTS evaluation are presented in Chapter 5, along with a survey of some existing Chinese TTS services and a comparison of their performance. Chapter 6 sums up the project and discusses possible future work. 7 2 Text-to-Speech Systems Modern TTS systems are computer software that convert digital text into equiv- alent audio output. The conversion mainly contains two processes, text analysis and waveform synthesis. This chapter provides an overview of the major frame- work and examines the components of the processes used for synthesized speech. 2.1 Architecture Figure 2.1 shows the common form model of a TTS system proposed by Taylor (2009) widely adopted as the basis of concatenative TTS systems. The model is divided into two layers: the spoken and written signal of natural language, and their components – graphemes and phonemes. As both text and speech can be ambiguous in natural languages,