A Prototype Text Analyzer for TTS System

Chiao-ting Fang

Uppsala University Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology

June 11, 2017

Supervisors: Yan Shao, Uppsala University Kåre Sjölander, ReadSpeaker AB Andreas Björk, ReadSpeaker AB Abstract

This project presents a prototype of a text analyzer for a rule-based Mandarin Chinese TTS voice, including components for text normalization, lexicons, pho- netic analysis, prosodic analysis, and a phone set. Our implementation shows that despite the linguistic differences, it is feasible to build a Chinese voice with the TTS framework for European languages used at ReadSpeaker AB. A number of challenges in disambiguation and tone sandhi have been identified during the implementation, which we discuss in detail. A comparison of the existing voices is designed, based on these cases, to better understand the per- formance level of commercial TTS systems. The results verify our conjecture of the difficult cases and also show that there is considerable disagreement on the tone sandhi pattern among the voices. Further research on these topics will contribute to the development of future TTS voices. Contents

Acknowledgements 5

1 Introduction 6 1.1 Purpose ...... 7 1.2 Limitations ...... 7 1.3 Outline ...... 7

2 Text-to-Speech Systems 8 2.1 Architecture ...... 8 2.2 Components of Unit Selection TTS system ...... 9 2.2.1 Text Normalization ...... 10 2.2.2 Phonetic Analysis ...... 10 2.2.3 Prosodic Analysis ...... 11 2.2.4 Internal Representation ...... 12 2.2.5 Waveform Synthesis ...... 12 2.3 Other approaches to synthesize speech ...... 13 2.3.1 Articulatory Synthesis ...... 13 2.3.2 Formant Synthesis ...... 14 2.3.3 Diphone Synthesis ...... 14 2.3.4 Hidden Markov Model-based Synthesis ...... 15 2.3.5 New Experimental Approaches ...... 16

3 An Overview of the Chinese Language 17 3.1 Introduction: Chinese or Mandarin? ...... 17 3.2 Phonology ...... 18 3.3 Phonetic Representation and Romanization ...... 19 3.4 Morphology: What is a Word? ...... 19 3.5 Writing Systems ...... 20

4 Implementation 22 4.1 Text Normalization ...... 22 4.1.1 Tokenization: ZPar ...... 22

3 4.1.2 Normalization ...... 23 4.2 Phonetic Analysis ...... 26 4.2.1 Lexicons ...... 27 4.2.2 Out-of-Vocabulary Words ...... 30 4.2.3 Disambiguation ...... 30 4.3 Internal Representation: the Phone Set ...... 32 4.4 Prosodic Analysis ...... 35 4.4.1 Prosody Beyond Tones ...... 35 4.4.2 Third Tone Sandhi ...... 36 4.4.3 Yi- and Bu-Tone Sandhi ...... 37 4.5 Waveform Synthesis ...... 39 4.5.1 Speech Database ...... 39 4.5.2 Segmentation and Generating the Output ...... 39

5 Evaluation 41 5.1 Evaluation Methods ...... 41 5.1.1 Intelligibility ...... 41 5.1.2 Naturalness ...... 42 5.2 Existing Chinese TTS voices ...... 42 5.3 Comparing the Voices ...... 43

6 Conclusion 47 6.1 Summary ...... 47 6.2 Future Work ...... 48

Bibliography 49

A Complete List of Normalization Tasks 52

B to R-sampa Mapping Chart 54

C Test Cases and Results 56

4 Acknowledgements

I would like to express my gratitude to my supervisor Yan for his constant help, devotion, and encouragement during the entire project. His guidance and feed- back were invaluable. I am most grateful to Andreas and Kåre, my supervisors at ReadSpeaker AB, for giving me this opportunity to work with this exciting project. Their patience and immense knowledge in TTS always manage to answer any questions I have. I also wish to thank Erik for his help with the implemen- tation and Filip for checking the phone set. My time at ReadSpeaker has been both enjoyable and productive thanks to all the colleagues and especially the TTS team. I consider myself very lucky to be part of the group. I am indebted to Hoa and Caroline, who have gone through my writing tirelessly to improve it. This work would not be possible without the support of my friends and most of all, my family. I am sincerely thankful for their love and company whenever I need them the most.

5 1 Introduction

A text-to-speech (TTS) system takes text as input and tries to generate the audio output of the text in the way that it would be read by a human. As speech is the most fundamental form of human languages, a wide variety of applications are now equipped with some forms of artificial voice not only for users who have difficulties in reading or understanding written text, but as an aid for thegen- eral public. Synthesized speech is also used by people who are unable to talk with their own voice. The history of synthesized voice has evolved over the years: early attempts include talking machines that imitate articulatory movements, dating back to the 18th century (Jurafsky and Martin, 2009). Researchers in formant synthesis from the 1950s onward have successfully produced understand- able speech by using varied signals to create speech waveform, but the quality of the voice is far from natural (Black, 2000). Commercial approaches today are mainly based on the concatenation of recorded speech, made possible by more powerful computers and larger storage space. The architecture of a modern concatenative TTS system generally includes two parts: text analysis and waveform synthesis (Taylor, 2009). In text analysis, tokenization of the words and sentences may be required in the first place de- pending on the language. Then the non-standard words in the input text, such as numbers, symbols, and abbreviations are converted to their written-out forms. The written text is later turned into phonetic transcription, usually by looking up in the lexicon or by grapheme-to-phoneme rules. Suprasegmental features are also encoded for natural-sounding prosody. The waveform synthesizer then gen- erates the speech by selecting appropriate segments from the speech database according to the transcription and prosodic markings. The output of the system is the artificial speech made by joining the chosen units. With better computing power and recording devices available, TTS system development is no longer limited to research institutes and laboratories. Many voices on the market are of good quality, covering a wide range of languages. TTS is also a field of research and development for companies who wish to incor- porate synthesized speech in their products. Although synthesized voices are now generally comprehensible, the models are continuously improved to handle tricky natural language cases as well as to capture and recreate the correct prosody.

6 1.1 Purpose

The goal of this thesis project is to explore the development of a text analyzer for a Mandarin Chinese TTS voice under the models described by Taylor (2009) and Jurafsky and Martin (2009). Mandarin Chinese is known for its logographic script and tones, which require different NLP approaches for the processes. The result of the project is a prototype capable of dealing with many common text analysis tasks in a Chinese TTS system. This project is in collaboration with ReadSpeaker AB, a TTS company which provides TTS solutions to digital texts and applications in Uppsala, Sweden. By working with a non-alphabetic and tonal language like Mandarin Chinese, we also hope to improve the robustness of text analyzers used at ReadSpeaker in general.

1.2 Limitations

Although Latin letters and foreign words may occur in a Chinese text, we only process and speech sound in this project. Non-character words are not analyzed and read in the output. Our focus is to improve the rule-based text analysis rather than fixing individual mistakes of words, so the output audio is only a demonstration of how the rules work rather than a sample voice of the product.

1.3 Outline

Chapter 2 provides an overview of the major text-to-speech approaches with the focus on concatenative synthesis and unit selection. Chapter 3 provides the linguistic background of Chinese that is relevant for our TTS system. The imple- mentation is described in Chapter 4. The common criteria for TTS evaluation are presented in Chapter 5, along with a survey of some existing Chinese TTS services and a comparison of their performance. Chapter 6 sums up the project and discusses possible future work.

7 2 Text-to-Speech Systems

Modern TTS systems are computer software that convert digital text into equiv- alent audio output. The conversion mainly contains two processes, text analysis and waveform synthesis. This chapter provides an overview of the major frame- work and examines the components of the processes used for synthesized speech.

2.1 Architecture

Figure 2.1 shows the common form model of a TTS system proposed by Taylor (2009) widely adopted as the basis of concatenative TTS systems. The model is divided into two layers: the spoken and written signal of natural language, and their components – graphemes and phonemes. As both text and speech can be ambiguous in natural languages, Taylor introduces the idea of definite underlying forms (represented by words in the figure) that allows one-to-one mapping of grapheme to phoneme. In this model, text analysis and waveform synthesis are viewed as decoding and encoding processes. Their goals are to reveal the forms and generate the speech accordingly. However, as forms do not exist in reality, the input is decoded into graphemes that serve as hints for the forms, which are turned into perceivable phonemes later. Taylor’s model provides a slightly abstract but general architecture of TTS systems: the main task is to find out what the text refers to and return the correct phones.

written signal spoken signal

text decoding speech encoding

words graphemes (forms) phonemes

Figure 2.1: Common form model. Note that “words” is an underlying level that serves as the transit between graphemes and phonemes

8 An adapted version of Taylor’s common from model is described by Jurafsky and Martin (2009) shown in figure 2.2, completed with more detailed steps. Here input text goes through a number of processes in text analysis before being converted to phonemic representation of the language. In waveform synthesis, matching speech units are chosen from the speech database and joined together to create the speech output. At every stage, the output is passed onto the next component as the input, creating an hourglass-shaped model. Except for prosodic analysis, most components are essentially similar to those in Taylor’s common form model. Text decoding deals with both normalization and phonetic analysis, the latter plays an important role in the grapheme-to-phoneme transit. This model of unit selection and the terms used here will be adopted throughout our discussion.

text

Text normalization Phonetic analysis Prosodic analysis Text Analysis

Phonemic internal representation

Unit selection Unit database Waveform Synthesis

speech

Figure 2.2: Hourglass model for unit selection architecture

2.2 Components of Unit Selection TTS system

This section introduces the components of unit selection shown in figure 2.2 with the focus on text analysis. Unit selection is a type of concatenative synthesis ap- proach that joins the longest possible matching utterances from a large database, thus preserving the naturalness of the speech segments. We also incorporate the list of text analysis processes in Taylor (2009) to give a more general view of the whole system. The framework is language independent and most examples are in English, but approaches specifically required for Chinese are given as well.

9 2.2.1 Text Normalization

The goal of text normalization is to convert non-standard characters in the writ- ten text into their spelled-out form. Common normalization handles numbers, acronyms, abbreviation, symbols, and so on. The pronunciation of numbers usu- ally depends on the context. For example, 1685 can be decoded as year or ordinal number and 10 can be “October” or “tenth” in 2017/10/10. Some acronyms are spoken in spelled-out form (BMW), some as normal words (UNESCO). Abbre- viations need to be expanded and sometimes disambiguation is required to find out the correct form (Dr. can be “drive” or “doctor”). Another issue with the abbreviation is the punctuation that comes with it. Some punctuation marks are helpful hints to determine sentence boundaries where we can later insert pauses or adjust the intonation accordingly, but the exceptions in abbreviation must be handled first. Sentence splitting or sentence tokenization is thus an important task for such languages. Besides sentence final position, pauses may occur in the middle of the speech. Word boundaries can be used for locating the possible occurrences. But for lan- guages like Chinese that do not separate words with spaces, tokenization is nec- essary at this stage to ensure no pauses will be inserted incorrectly later. After this stage, the text should only contain graphemes of the language, with possi- bly sentence and word boundaries marked up to be used for prosodic analysis. Taylor (2009) mentions some other issues such as the encoding of the input and multilingual text, which are also possibly dealt with at this stage.

2.2.2 Phonetic Analysis

The next step is to find corresponding phonetic transcription for the normalized text. Pronunciation dictionaries are often used to look up the words, but there are also many out-of-vocabulary words (OOV) that require special techniques. A common type of OOV is proper noun. While a dictionary may include those that are most frequent, it is not possible to list all of them. Concatenation of existing entries is practical for many languages. For example, apelsin and juice can be combined to produce the pronunciation of apelsinjuice (‘orange juice’) in Swedish. Although this example is not an OOV word, the method works well for Chinese as we can guess the pronunciation of any word as long as all the charac- ters used are in the lexicon. For alphabet-based languages, grapheme-to-phoneme is possible when the mapping between the letters and sounds is systematic. This is relatively straightforward for languages like Finnish, but English may require additional rules for the irregular cases. The research of Spiegel (2003) seeks to identify possible source language of names by looking at letter sequence before applying language specific grapheme-to-phoneme rules. A grapheme-to-phoneme

10 approach in Chinese relies on known characters. A human may be able to guess the pronunciation of an unknown character by its components, but a computer processing encoding like Unicode has no means to handle characters with un- known or no encoding. To our knowledge, Chinese grapheme-to-phoneme rules do not deal with unknown characters. As the input of TTS is always in digital format, connecting the system to a larger dictionary will probably give a much better result than teaching computers to recognize the components of the char- acter. The first stage of text analysis in Taylor (2009) is viewed as a categorization task. The class of the input token is first identified – it can be a natural language token, abbreviation, email address, time, and so on. The decoding is then based on this class to find the underlying form. During this stage, the token maybe mapped to multiple possibilities and disambiguation is used to determine the correct form. Approaches include using part-of-speech (POS) tags (record as noun or verb), looking at the context (bass occurs in close proximity with music probably means the instrument, not fish), or even machine learning models like Bayesian classification and decision trees (Yarowsky, 1997). Disambiguation is an important part of phonetic analysis, but it is not limited to words consisting only of graphemes. The previous date and Dr. examples show a case where disam- biguation is needed in normalization. Depending on the design of a system, the steps in text normalization and phonetic analysis may be carried out in different order.

2.2.3 Prosodic Analysis

The last part of text analysis in Jurafsky and Martin (2009) is prosodic analysis, which Taylor (2009) also named as a separate stage in his overview of processes. Prosody is defined as the non-segmental features that generally stretch over more than one sound (segment) in spoken language, hence the term suprasegmental (Cruttenden, 1997). The tricky aspect of prosody is that the annotation conven- tions and models used now do not seem to be able to capture all features required for reproducing the speech, which is why synthesized speech is often described as monotonous. TTS input also offers little clues on how the text should be read. One type of prosodic markings from text we have seen is the sentence and (for Chinese) word boundaries. Pauses may also occur intermediately and their prediction generally relies on classification based on the annotated training data. On the word level, a token can be stressed for syntactic or pragmatic reasons. English examples of the former case include content/function word distinction and compound noun stress, which can be covered by rules. Pragmatic prominence is however harder to predict, as the distinction relies on semantic or discourse

11 information. Intonation is determined by pitch at the sentence level. In unit selection, it is mostly based on properties that come with the data rather than generated by models. Taylor (2009) regards Chinese tones as suprasegmental features in his model, but linguistically speaking they behave more like phonemes in the way that the pitch change affects the meaning of a word (Ladefoged and Johnson, 2014). This is the case for our prototype as the ReadSpeaker framework is designed for non-tonal European languages. To sum up, the implicitness of prosody in the written text and the absence of effective internal prosodic models make reproducing natural, human like prosody difficult.

2.2.4 Internal Representation

The processed input text is then turned into internal phonemic representation, which generally consists of phones and prosodic information based on the previ- ous analysis. For unit selection, prosodic markings may be rather simple, with only stress (for prominence), pauses, and some indication of intonation from the punctuation (like question marks). Other synthesis methods that require further signal processing to modify the prosody will also need information on the fun- damental frequency and duration of the phones. This is the final stage of text analysis/decoding and the output is then passed on to the waveform synthesis module.

2.2.5 Waveform Synthesis

In this section, we briefly introduce the Hunt and Black algorithm for finding and joining the best segments of utterances in unit selection (Hunt and Black, 1996). The length of a single unit varies across systems (range from half-phone to syllables), and the choice will affect the size of the database. The preference can also be language specific – a syllable-based unit is preferable for Chinese, as a single grapheme (character) almost always represents a syllable (Taylor, 2009). Our prototype however, is half-phone based, meaning the unit is half of a phone. Phone and its variants are most common for European languages as the large number of possible syllables renders syllable-based systems impractical. Barker lists 15831 syllables in English while only around 1600 are reported in Chinese (Duanmu, 2007). Hunt and Black’s algorithm determines the best utterance sequence by look- ing at the target cost and join cost. Target cost is calculated by the distance between the desired unit and the candidate. Ideally, the chosen unit and the tar- get should have identical quality (phone, stress, position in the word/sentence, and adjacent phones). This is possible with databases with large coverage. The less alike the two units, the larger the target cost. Join cost computes how natu-

12 ral the concatenation would be by checking their acoustic similarity at the unit boundaries. If the two units co-occur in the database, the cost will be zero as it is considered to be completely natural. Again, we want the join cost to be as low as possible, so existing words or phrases in the database are favored over the concatenative sequences. The combination of two costs ensure that the generated speech will be as accurate and natural as possible. Modifications like weights are used for optimal results for different systems, but the concept remains the same.

2.3 Other approaches to synthesize speech

This section provides a short introduction to other approaches. Articulatory synthesis and format synthesis are some of the earliest attempts, but these techniques are rarely used now. Diphone synthesis is a type of con- catenative method that predates unit selection, with diphone as the base unit (Jurafsky and Martin, 2009). HMM (Hidden Markov Model) is applied in many areas in language technology, and its potential in speech synthesis has been ver- ified (Tokuda et al., 2013). In recent years, there has been a great interest in incorporating state-of-the-art machine learning models such as deep learning and neural networks with the existing pipelines. Although unit selection remains the most popular model for commercial TTS voices, these frameworks are not com- pletely exclusive and progressive integrations are expected in the near future.

2.3.1 Articulatory Synthesis

Articulatory synthesis generates speech by initiating human articulatory move- ments mechanically, one earliest example being the talking machine invented by Wolfgang von Kempelen in the 18th century (Jurafsky and Martin, 2009). The input of the model is a sequence of parameters that specifies how the machine should be tuned. The model usually involves passing airflow through adjustable tubes with constrictions that mimic the vocal tract. Such model requires exten- sive knowledge of the articulatory movements, but the data is often obtained with invasive or arguably harmful means (like MRI or X-ray). It is also immensely dif- ficult to produce voices of good quality with this method within the confinesofa practical model. The intrusive nature and complexity make this approach mostly obsolete now, but it is still studied in other related fields such as audio-visual synthesis.

13 Figure 2.3: A complex periodic wave consists of a 100 Hz and a 1000 Hz simple wave (left) and an aperiodic wave with irregular pattern (right) (Johnson, 2003)

2.3.2 Formant Synthesis

In acoustic phonetics, sound waves are either periodic or aperiodic: where the for- mer have regular function, while the latter lack any repeating pattern (Johnson, 2003). For voiced sounds such as vowels, their waveforms are complex periodic waves consisting of multiple simple waves (see Figure 2.3), each has its distinctive simple wave components. As vowels are characterized by their formants, study- ing and modifying the composed waves can change the pitch and other acoustic quality. This feature of sound gives rise to the approach of formant synthesis. The sound source is passed through formant filters that block certain frequency, producing sounds with the required formants. For unvoiced sounds with aperiodic waves, white noise usually serves as the source to create friction or obstruction in the consonants. This approach is more manageable and produces better result than articulatory synthesis, the output quality is still far from natural. The Klatt synthesizer (Klatt, 1980), one of the most complex formant synthesizer built in 1980s, is described as intelligible1, but the limitation of the rule-based model makes it hard to capture the dynamics of natural speech. For this reason, this approach is no longer popular.

2.3.3 Diphone Synthesis

As the predecessor of unit selection, the general outline of diphone synthesis is very similar to what we have introduced in the previous sections. The main differences are the size of the speech database and the signal processing proce- dure (Jurafsky and Martin, 2009). A diphone is a unit that stretches from the middle of a phone to the middle of another. The idea is that by concatenating these units, we can avoid the gaps between phones, as they are naturally joined. A diphone database stores one copy of each phone, so if there are N phones in the language, there will be N2 diphones at most (some combinations might

1You can hear it at http://www.cs.indiana.edu/rhythmsp/ASA/partD.html, along with the output of other earlier synthesizers (Klatt, 1987)

14 be impossible). The diphones are recorded within the context, clipped out from the recording and later labeled to create the database. To generate the speech, the diphones are chosen according to the internal representation. As the units may be of different pitch, prosodic adjustment is necessary before the concatena- tion. One common technique is TD-PSOLA (Time-Domain Pitch-Synchronous Overlap-and-Add) algorithm. The units are lengthened or reduced by repeating or clipping the existing waveform. To modify the fundamental frequency, waves are framed according to their pitch. The frequency is increased by having overlap- ping frames, as the waves are brought closer together. Compare to unit selection, diphone synthesis requires more prosodic information such as duration and fun- damental frequency. The approach is still common as the footprint is relatively small and requires less computational power, but the naturalness is unavoidably affected by any sort of signal alterationJurafsky ( and Martin, 2009).

2.3.4 Hidden Markov Model-based Synthesis

Hidden Markov Model (HMM) is a dominant model in the field of automatic speech recognition (ASR), but its generative nature can also be used to syn- thesize speech. The basic concept is to swap the input of ASR for output in model training and synthesis, but instead of having speech as the direct output of HMM, parameters for a source-filter system (similar to formant synthesis) are generated. The parameters are then used to produce speech signals. Unlike the rule-based model in formant synthesis, HMM learns from the data automatically and is therefore better at handling the dynamic nature of speech. The train- ing creates the multiple HMMs by finding the most probable speech parameters given the phonemes and linguistic representation. In synthesis, the most likely models are chosen according to the specification of the input text. The models are later concatenated to form a sentence level HMM to generate the parameters for the synthesizer. The details of HMM synthesis is described in the paper of Tokuda et al. (2013), who develop many important techniques for smoothing the discrepancy of the HMMs and processing the acoustic representations. HMM synthesis has a number of advantages over other approaches. While it requires much less memory, the intelligibility and naturalness are reported to be on the same level as unit selection (Tokuda et al., 2013). The training and decoding processes are very much automated and only a small part of the model is language dependent, so it is relatively easy to develop a new voice. This makes HMM one of the most researched methods in recent years and there have been a number of combined approaches with unit selection (Tokuda et al., 2013).

15 2.3.5 New Experimental Approaches

The rapid progress in machine learning gives rise to many novel methods. Two of the most popular terms are neural network and deep learning. Neural network is a machine learning model that uses nodes (known as neurons) to accumula- tively process the input. The input is often passed through a number of nodes before finally reaching the output stage. The connections between multiple neu- rons form a net that can have many layers, thus the term deep learning. With neural network, it is possible to tackle more complex and dynamic problems that are difficult to solve with the traditional mathematical, rule-based models. The following paragraph will briefly introduce how neural networks have been used in the development of TTS in the past few years. A number of components in text analysis has been modeled with neural networks, including grapheme-to-phoneme rules (Rao et al., 2015) and intona- tion prediction (Ronanki et al., 2016). For waveform synthesis, WaveNet2 uses neural networks to model the waveforms directly (van den Oord et al., 2016). Char2Wav3 Sotelo et al. (2017) and Tacotron (Wang et al., 2017) are two re- cently developed end-to-end synthesis systems, meaning that no text preprocess- ing is required. While Char2Wav outputs parameters for waveform generation, Tacotron is reported to directly synthesize spectrogram. Arik et al. (2017) in- troduce Deep Voice, a synthesizer that models all the components with neural networks and requires no TTS framework beforehand. These experimental ap- proaches have achieved impressive outcomes with relatively small amount of data (and in the cost of intensive computation). Deep voice, for example, was trained on a database with around 20 hours of speech data compared to hundreds of hours for a unit selection database. With the new approaches and techniques, we can expect significant changes and progress in the development of TTS withina short period of time.

2https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 3http://www.josesotelo.com/speechsynthesis/

16 3 An Overview of the Chinese Language

Text analysis is mostly language dependent except for the new approaches men- tioned in the previous section. In our case, many rules and processes in the frame- work are adapted for the linguistic features of Chinese. We therefore provide an introduction to the language that is relevant to our implementation.

3.1 Introduction: Chinese or Mandarin?

When referring to the standard language in , Chinese and Mandarin seem to be used interchangeably among the general public. For cultural and historical reasons, Chinese is often considered as one language with many vari- eties (dialects)1 by its speakers, but the difference between these dialects can be immense. Chen (1999) named seven major dialect groups in the language and Mandarin (Beifanghua, literally ‘North speech’) is the largest of all. Accord- ing to Oxford English dictionary, the name Mandarin originally refers to the officials in imperial China. European missionaries visiting China in the 16thcen- tury recorded a universal version of the language used by the officials (Guanhua, ‘official speech’), thus the name Mandarin (Coblin, 2000). is based on the Beijing dialect, which is a member in the Mandarin group (Chen, 1999). For this reason, Chinese and Mandarin are both known as the name of the language. Chinese speakers often regard Beijing dialect or Mandarin as the synonyms of the standard language, despite the fact that some distinctions do exist (Chen, 1999). Ethnologue counts more than 1 billion Mandarin speakers. 70% of the 1.3 billion Chinese population and 19 million people in Taiwan speak some form of Mandarin dialect, making Mandarin the language with most native speakers. Mandarin speakers can also be found in Singapore and Chinese immigrant com- munities in many countries. These variants, along with the standard form of the language, do not differ from each other very much compared to the other Chi- nese dialect groups (Ramsey, 1987). The mutual intelligibility means the more

1The definition of dialect can be very diverse among linguists. Here we will adopt the view thatall tongues are dialects. Language is either a hypernym of these related tongues, or one of the dialects with more predominance and standardized form that represents the whole group).

17 refined classification of Mandarin is usually only limited to dialectal linguistic re- search. It is common to simply use Mandarin to represent the standard Chinese language, as opposed tongues in other major dialect groups like Wu, Min, and Yue (). All these groups belong to the Sino-Tibetan language family. In this project, the term Mandarin Chinese is used in the title and introduction for clarity, but from now on we will simply refer to the common/official2 form of the language as Chinese.

3.2 Phonology

We previously mentioned that syllable-based units will be well-suited for Chinese as the number of syllables are manageable. A single Chinese character can al- most always be mapped to a syllable. As a character is the smallest writing unit (grapheme) of the TTS input, it is reasonable to use the syllable or its variants as the base units and there are many such systems (Shih and Sproat, 1996). But for our framework using half-phone, it would be easier to divide the sylla- bles into a handful of phones instead of specifying all possible syllables for units. It is however important to know the syllable structure of Chinese as tones are syllable-based. A Chinese syllable can be transcribed as CGVX, where C is an initial con- sonant, G a glide, V a vowel, and X either a final consonant or a glide (Duanmu, 2007). To reduce the number of phonemes used in the phone set, we have included the glide with the vowel and regard them as diphthongs. The syllable structure can thus be rewritten as CVC. Both consonants can be optional. While linguists have some different opinions on the number of vowels and initial consonants, it is mostly agreed that the only possible final consonants are [ŋ] and [n]. Instead of having the normal onset-vowel-coda structure in phonology, it is more common to have the final consonant with the vowel and refer to the combination as final in Chinese. The consonants are therefore known as the initials. Tone, the use of pitch to distinguish meaning, is superimposed on a syllable. There are four tones in Chinese, usually transcribed phonetically with the Chao tone letters (Chao, 1930). Chao tone letters cover the pitch variation in a scale of 5, with 5 as the highest and 1 the lowest. The four tones can be represented as 55, 35, 214, and 51, or the four diacritics ¯, ,́ ˇ,̀ in Pinyin on the vowel. They are also convention- ally referred to as tone 1 to 4 in Chinese phonology. If the tone is not present, the syllable is usually regarded as unstressed and with shorter duration. This is common among a number of function words. More details will be presented in the implementation where we introduce our phone set.

2Chinese does not have official status in Taiwan despite being the “default” language.

18 3.3 Phonetic Representation and Romanization

The transcription of Chinese characters using the Latin alphabet dates back to the 19th century and there have been many attempts to create a practical phonetic representation for Chinese since then. The most widely used one is Hànyǔ Pīnyīn, usually referred to as Pinyin (‘spelled-out sound’) for short. Pinyin has been the official romanization system in China since its creation nearly 70 years ago (Chen, 1999). From dictionary entries to input method, it is now the default phonetic representation for many native as well as non-native speakers of Chinese3. The tones are marked with diacritics on the vowel of the syllable, but numbers are also commonly used for convenience (like Han4yu3 Pin1yin1). A very important feature of Pinyin is that it is phonemic rather than phonetic, meaning that allophones in complementary distribution may be represented with the same symbol. It is however a good base of our phone set as our algorithm will consider the most appropriate context from the database. We will introduce more about the mapping between Pinyin and our phone set in the next chapter.

3.4 Morphology: What is a Word?

The concept of word is relatively straightforward for speakers of European lan- guages, but the notion is however not directly applicable for Chinese. Packard (2000) illustrates the difference by using the term sociological word proposed by the renowned Chinese linguist Yuen Ren Chao. A word under this definition is “the unit between a phoneme and a sentence that the general public is aware of”. This would be a word in English, but the Chinese equivalence is zì (字), a character or a spoken syllable. A single character can be a word that behaves pretty much as in English, but it is more common to have words that consist of two or more characters known as cí (詞). For example, 手 is the character for hand, but 手機 (literally ‘hand’ and ‘machine’) means cellphone. New words are created by combining existing characters. Entries in a zì-lexicon remain pretty much the same as we rarely create new characters now, but a cí-lexicon listing new combinations of characters is expanded all the time. Some combinations may be used to express more complex ideas and are more or less the equivalence of English fixed expressions or phrases, but they are sometimes included in cí- lexicon. In short, a character is a morpheme in Chinese. Many characters have

3It should be noted Pinyin is only introduced into Taiwan recently. A number of romanization systems had been adopted for names and road signs over the years, causing some degree of inconsistency in translit- eration. My name, for example, would be Qiaoting in Pinyin. The most popular phonetic representation in Taiwan is Zhuyin ‘sound-annotating (symbols)’, with 37 symbols derived from Chinese characters. Zhuyin is still dominant in Taiwan as it is mandatory at school. When Latin transliteration is required, Pinyin is often encouraged. However, as most people in Taiwan are unaware of the differences between the romanization systems, it is not uncommon that an old, even mixed system is used.

19 their own meanings in isolation and can be used to build up words or larger units like phrases. Knowing the definition of a word is important for our TTS system aspauses occur between word groups. It is possible to create a Chinese TTS voice that only considers individual characters, but the result is often poor: pauses may be inserted incorrectly within a word (such as 手 | 機). It is also hard to predict the duration of a syllable when the characters are taken out of the context (the combination), making the voice sound choppy or overlapping. Some characters can be pronounced in more than one way and it is often easy to tell within the combination, but when characters are processed separately we have no means to determine the correct pronunciation. It is therefore common to use a tokenizer for Chinese NLP. The process of finding the word boundary is also known as segmentation in Chinese. It should however be noted that there are no objec- tive criteria for segmentation. Shih and Sproat (1996) report 75% of agreement between native speakers. For our TTS system, the segmentation should at least mark up one acceptable version of word boundary so that no incorrect pauses are inserted.

3.5 Writing Systems

Chinese is written in Chinese characters. Like many other written languages, vari- ations exist before a standardized version is established. Among the variations of characters are the simplified ones with fewer strokes. Simplified characters re- mained largely informal until late 1950, when the Chinese government introduced a list of simplified characters as the standard character sets. The table includes some existing simplified forms as well as some newly created characters basedon the cursive script. The scheme also replaced some heterographs with a single char- acter. Meanwhile, the un-simplified, traditional script continues to be used in Taiwan, Hong Kong, and Macao. While many characters are written differently, literate Chinese speakers have no difficulties understanding both script, although they mostly choose to write with the system that they are familiar with. The logographic nature of Chinese characters means that variations can be created easily by simply adding or removing some strokes, especially before stan- dardization. Large dictionaries may count tens of thousands of characters, but up to 40% of them may be variant characters (Chen, 1999). With computers, however, it is no longer possible to create new variants of characters by modi- fying the strokes. It is not possible to input or process characters that are not encoded, making the character set even more standardized as the existing ones continue to be used for digital texts. Our TTS system should accept input in both traditional and simplified characters. As long as the text is digitally coded,

20 our system should have no problem recognizing the characters. Encoding used to be an issue, but as more and more systems support Unicode, we will ignore other less common encoding formats and assume that the input is in UTF-8.

21 4 Implementation

In this chapter, we present our implementation of text analyzer components in the order shown in the hourglass model in Figure 2.2, as proposed by Jurafsky and Martin (2009).

4.1 Text Normalization

In text normalization, our goal is to convert the input text into words that can be processed later to find out their pronunciation. Two tasks are involved in this step: tokenization locates the word boundary, and the normalizer turns non-characters into their spelled-out form.

4.1.1 Tokenization: ZPar

Tokenization, also known as segmentation, is required for Chinese for both pho- netic and prosodic reasons. Without the word boundary, pauses might be inserted incorrectly within a word. The combination of characters also helps us figure out the pronunciation. We use ZPar (Zhang and Clark, 2011)1, a statistical parser that comes with segmenting and part of speech tagging features to tokenize the input. After compiling, ZPar creates a model when trained repeatedly on a seg- mented or tagged text. The model is then used for POS tagging and segmentation. The model for simplified Chinese is trained on Chinese Penn Treebank 5.0 (CTBS)2 Another model for traditional Chinese is created on the training set of the CKIP corpus provided by Academia Sinica in Taiwan for the Second Inter- national Chinese Word Segmentation Bakeoff in 2005 (Emerson, 2005). Unlike Penn Treebank, CKIP dataset does not come with POS tags, so only the segmen- tation is possible. Once the model is loaded, ZPar is reported to process around 50 sentences every second with satisfactory accuracy (Zhang and Clark, 2011). This is a huge advantage over other dictionary-based tokenizers that we tested.

1ZPar can be downloaded at https://github.com/frcchang/zpar/releases. An online manual can be found at http://people.sutd.edu.sg/~yue_zhang/doc/index.html 2Thanks to Yan Shao for providing the model. To use this model, replace ctb5 in line 16 with penn in zpar-0.7.5/src/chinese/tag.h.

22 A drawback of ZPar is that the model is difficult to adjust once trained. The perceptron based model is cryptic, rendering it impossible to modify the model directly. Another solution is to add the new words into the training set. Our experiment shows that ZPar is capable of getting a name right with four iterations of training on a small paragraph containing around 150 characters. It is however impractical to retrain the model every time we want to expand the vocabulary given the size of our training data. It is also unclear how many occurrences of the word are required for the model to learn the pattern. Furthermore, modifying the training sets will affect the performance of ZPar. Evaluation has to be repeated on every iteration with the retraining to make sure that we have the best model available. Despite this issue, ZPar’s performance on the CKIP test set with 4.3% of OOV is a competitive 95% (Zhang and Clark, 2011), hinting that the algorithm of ZPar may be more superior in predicting the boundary of unknown words than other bakeoff participants. For our tokenization task, we will continue to usethe same models. Further discussion of OOV will continue in the next section. An issue with the ZPar segmentation module is that it sometimes separates an English word into several parts. Although we do not plan to process Latin letters in this project, it is important to keep them intact. Our guess is that the model fails to learn the English words as they are rare in the training data. The word-based perceptron algorithm of ZPar also takes the position of a character in a word into consideration (Zhang and Clark, 2011), which means that English words with more letters are likely to be divided under the Chinese model. The issue is still present after we removed all Latin letters from the CKIP training data. Our workaround is to write a small script that takes the unsegmented and segmented texts as input and replaces the falsely segmented letters with the ones from the original text. For some reason, this problem is only found with traditional Chinese texts. Our test shows that simplified Chinese is not affected by the number of English words in the training set. To sum up, we have two ZPar models, one for traditional and one for sim- plified Chinese. As the simplified training set contains POS tags, we areableto tokenize simplified Chinese with the POS tagging module. Traditional Chinese uses the segmentation module, which breaks English words occasionally. A script is written to fix the separated English letters.

4.1.2 Normalization

Normalization is the process of converting non-characters such as numbers and symbols into their spelled-out form. In a rule-based analyzer, they are usually identified by regular expressions as the patterns are relatively distinctive. Our tokenized text, however, contains spaces and POS tags, making the construction

23 of the normalization rules very unintuitive. For example, 2017/09/19 is a common way of representing date, but ZPar’s POS tagging module will label it as 2017_CD /_PU 09_CD /_PU 19_CD. Despite ZPar’s high accuracy, it is still impractical to consider the tags while writing the rules. The solution for now is to perform normalization before tokenization. Our test shows that ZPar does not have any problem tokenizing a sequence of Chinese characters embedded in the text. The accuracy may drop accordingly as the training data contains fewer spelled-out forms, but the normalized texts do not seem to pose any other difficulty to the tokenizer. A common restriction for the normalization rules is that the string we are looking for should not be preceded or followed by any other characters except whitespace. This way, 09 in 2017/09/19 will not be extracted from its context and interpreted wrongly. This rule is however not applicable for our untokenized Chinese text as numbers are not delimited by spaces either. Luckily, the restric- tion in our existing normalizer only specifies numbers, symbols, and a range of Latin letters, so 2017 年3 (‘year 2017’) will still be captured by the current rules. This seems to be a preferable coincidence for Chinese, as we have decided to normalize the text first. The alternative would be adding all characters tothe restriction list and creating rules with the spaces in mind. Our solution for now can be problematic for rare cases like 碳-14 (‘Carbon-14’) when the character is a part of the normalization rule. While its English equivalent is found successfully with the pattern [letters]-[numbers], 碳 is not a valid letter and the general normalizer that handles all languages will read the hyphen as minus. Exception rules have to be made for all similar cases. If we wish to solve this problem at its root, the whole normalization process will be slowed down as the restriction list is expanded to many thousand times larger. Again, the text has to be tokenized first and the rules will be affected accordingly. As such, we will take advantageof this workaround and leave the restriction unchanged for now. This issue reveals that the robustness of a TTS framework can be challenged when dealing with a very different language like Chinese. Unlike letters, Chinese characters take up twice the space when displayed on computers. For aesthetic reasons, fullwidth Arabic numerals, letters, and punc- tuations are also created so that they can be lined up with characters (compare ABC and ABC). Although it is not mandatory to use fullwidth numbers in Chinese text, their occurrence is not uncommon and must be taken into consider- ation. A small dictionary is created to map the fullwidth items to their halfwidth counterpart and fullwidth characters are corrected in the general normalizer4. The conversion also covers all punctuation marks used in Taiwan and China,

3There is no space between 2017 and 年. Chinese characters are displayed in fullwidth, see the next paragraph. 4Thanks to Andreas Björk for adding the function.

24 including some special ones for vertical writing. The Chinese normalizer is based on the Swedish normalization rules written in C++. Chinese normalization lacks some of the most common tasks in western languages such as acronyms, abbreviations, and ordinal numbers, but there are also a number of language specific cases like classifiers to work with. Acronyms and abbreviations are represented with selected characters and can be treated as OOVs. For example, 臺大 (Táidà) is the abbreviation for 臺灣大學 (Táiwān dàxué, ‘National Taiwan University’). Ordinal numbers are created simply by adding the prefix 第 dì. Table 4.1 shows a list of common techniques for normalization with examples. Most of them can be captured with regular expressions. Conversion to a unified format is required when a concept can be written in many different ways. For ambiguous cases, adjacent words are used to figure out the correct category. For example, 12345678 can be a ill-formatted phone number, or a cardinal number. We can tell how to read it by looking for keywords like contact or call around the number. Small dictionaries are used to store pairs of symbols and their spell out forms. Although we do not work with English here, the English abbreviation for days of the week and months are kept as they are very common in news or blog posts online. Sometimes we need to reorder the input for grammatical reason. So the input 12% should be percent twelve in Chinese despite the fact that the symbol comes after the number. It is common to combine more than one approach from the list to create a rule. The complete list of normalization functions can be found in Appendix A.

Pattern Matching • (0?[0-9]|1[0-9]|2[0-4]):[0-5][0-9] finds time notation like 23:59 • (0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[0-2]) ([12]\d{3}) for 09/19 2017 Format Conversion • 1,000 and 1 000 are converted to 1000 • 19/09-2017 is replaced by 19/09 2017 Keyword Checking • A long number string close to words like 聯絡 (‘contact’) → phone number • 3/10 next to time related words like 星期 (‘week’) → a date instead of fraction Dictionary Lookup • Math and currency symbols: =, <, %, $, €,... • Common English abbreviations: Tel, Mon, Oct,... Reordering • 12% → 百分之 (‘percent’) 十二 (‘twelve’) • 25°C → 攝氏 (‘celsius’) 二十五 (‘twenty five’) 度 (‘degree’)

Table 4.1: Common normalization techniques and examples

An interesting language specific rule is the distinction between 二 (èr) and 兩 (liǎng) in Chinese. Both mean two, but are used in different circumstances.

25 Èr is used for counting and ordinal number, as in 第二 (‘second’) and 一二三 (‘one two three’). When followed by a classifier, liǎng is chosen instead (兩本 書, ‘two book-classifier book’). A function is written to check the POS tagof the following character and replace èr with liǎng when it is a classifier. Number names like hundred (百), thousand (千), and ten thousand (萬) are also treated as classifiers, so the spelled-out function for numbers must be changed accordingly. The function is modified to accommodate the number notation system. Instead of breaking numbers down by three digits, Chinese has four digits as a group based on ten thousand. Numbers bounded with a hyphen or a tilde are commonly used to represent a range in date, time, or amount. These patterns are handled with rules that convert the symbol to the correct spelled-out form. Besides language specific rules, local knowledge is sometimes required for patterns like phone numbers. Other than names of the months and days, many English abbreviation are not translated, particularly the scientific units. For ex- ample, people say GB rather than the Chinese translated term for Gigabytes. Many modern TTS systems are bilingual to some degree to handle cases like this, as English loanwords are ubiquitous, especially on Internet. Moreover, these terms are likely to have different translations in Chinese as they are only newly coined (十億位元組 versus 吉字节 for GB). Creating two voices for the varieties of Chinese is a good way of resolving the problem, pretty much like we would have Z read as [zi] in American English but [zEd] with a British voice. For our system, we have chosen to translate some of the most common ones, but left the terms with deviations untouched. The normalized part is in traditional Chinese, but the normalizer also recognizes context related keywords in both scripts. Sim- plified text may contain a number of traditional characters from the normalizer, which are added manually to the lexicon for later lookup.

4.2 Phonetic Analysis

Normalized text is converted to phonetic transcription at this stage. Most words can be looked up in a pronunciation lexicon incorporated in the TTS system. For our implementation, we create two lexicons in the format of text file: one in traditional and another in simplified Chinese. From now on, we will refer to the the internal pronunciation lexicon for our TTS system as lexicon, while the external source that we used for building the lexicon is known as a dictionary. OOVs are broken down into individual characters to be checked in the lexicon. For languages like English, heteronyms can be grouped into categories like noun or verb. But in Chinese, ambiguity is mostly solved by the context, but to tell the remaining heteronyms apart will require more syntactic or semantic information. The phonetic representation is called the r-sampa phone set, a variant of SAMPA

26 created by ReadSpeaker. The phone set will be introduced in more details in 4.3.

4.2.1 Lexicons

Traditional Chinese Lexicon

The base of the traditional Chinese lexicon is Revised Chinese Dictionary (重編國 語辭典修訂本) by Ministry of Education in Taiwan5. A Python script is written to process the entries. The pronunciation is created by a function mapping Pinyin into r-sampa symbols. There are two types of entries: zì with a single character and cí containing multiple characters. POS tags are available for most zì entries, but there are usually more than one for each character. For example, the character 手 is hand, but in 手機 (‘cellphone’) it is regarded as an adjective that means handy or small. Cí entries do not have any POS tags. The dictionary contains 166,120 entries, 11,933 of them are individual char- acters. All variant characters are removed as many of them are unencoded and their pronunciation have to be manually checked in as well. Fortunately, variant characters are very rare with digital input method as the users are only allow to choose from the encoded word list. After removing invalid entries, our traditional Chinese lexicon has 164,225 entries in total, which is a reasonable size compared to other languages. However, the dictionary rarely adds new entries despite its authoritativeness. Only a dozen new words made their way into the dictionary in the latest update. Many entries are also out-dated, but removing them arbitrarily is not a good idea either. Many new words will thus be treated as OOVs. But as our lexicon covers a good number of characters, we should be able to generate the pronunciation of OOVs by breaking the words down. We can always expand our lexicon by adding the new words. Pinyin is relatively phonemic so the conversion to phonetic representation is straightforward. The function takes the initial and and final of the Pinyin and maps them into r-sampa. There is however the special case of rhotic coda. In Beijing Mandarin, the diminutive form is created by adding an “r” to the word. The suffix is realized as the character 兒 (ér), for example in 哪兒 nǎr (‘where’). Rhotic coda is not as common in Taiwan Chinese and the suffix is often omitted or read as a separate syllable. The r remains as the coda in Pinyin, but we have marked it as a separate syllable ér in the transcription as our speech database does not read it as a rhotic coda. There are eleven POS tags in the dictionary: noun, pronoun, verb, adjective, adverb, preposition, conjunction, interjection, Chinese particles6, onomatopoeia,

5The exe files of the dictionary are available at http://resources.publicense.moe.edu.tw/ 6Chinese makes use of a number of syntactically independent grammatical elements known as particles to indicate mood, aspect, or pragmatic differences (Huang et al., 2014).

27 and affix. The first eight are found in other lexicons and are converted accordingly. Except for onomatopoeia, which is merged with interjection as in English, new tags are created for the new POS categories. The complete POS list and their lexicon format can be found in Table 4.2. As we do not plan to use much of the POS tags, they are mostly for reference. There is however the classifier tag that we need for the classifier rule in the normalizer. They are labeled as Chinese particles in the dictionary. We have selected 70 or so most common classifiers and appended the tag to the entries for further reference.

Simplified Chinese Lexicon

The source of the simplified Chinese lexicon is a text file of Contemporary Chi- nese Dictionary (现代汉语词典 Xiàndài hànyǔ cídiǎn)7. The file is not very well- formatted. The Pinyin is written together with space, which can cause ambiguity for some transcriptions. The original file also contains non-standard letters and symbols that makes importing the transcription extremely difficult. We end up using pypinyin8, a Python package that converts words into Pinyin. The Pinyin transcription is then turned to r-sampa. As in the traditional Chinese lexicon, all variant characters that cannot be displayed are removed. The lexicon has about 64,849 entries, 10,442 of them are single characters. The smaller coverage is due to the fact that the traditional lexicon contains many old fashion words and id- ioms that are very unlikely to be used in our system. On the other hand, the number of individual characters is very close to that of traditional lexicon, which means the system should have no trouble creating the pronunciation of OOVs based on these characters. Although pypinyin is incorporated with a dictionary and is capable of pro- viding most words with accurate transcription, it does not work well on single character entries with multiple pronunciation. Without the context, it is impos- sible to tell how a character should be read and pypinyin will simply choose the most common one. By listing the duplicated lines we have found around 900 entries that are possibly assigned with the wrong pronunciation. The actual number may be fewer, as identical entries do not necessarily mean they are read differently. They may have very unrelated meaning so they are listed as separate words despite sharing the same pronunciation. The correction of these entries can only be carried out manually, but it would be impractical to go through all of them due to time constraint. Most disyllabic words should be found in the lexicon with the right transcription, while the monosyllabic ones have the most common pronunciation given by pypinyin.

7The file is available on Github. 8The documentation for pypinyin is available at https://pypi.python.org/pypi/pypinyin

28 About 80% of the entries have at least one POS tag. Besides the tags given in traditional lexicon, the dictionary has two extra labels: numeral and classifier. The POS labels are retrieved by looking at the beginning of the definitions as this is where the tags are found for an organized entry. As the tags are also common characters, those that do not occur at this position will be ignored to avoid extracting the wrong labels. As we only use classifier for our analysis, other POS tags are not as crucial.

Formatting the Lexicons

Figure 4.1 shows a number of sample entries from both lexicons. An entry contains the word, the transcription, and the optional POS tags (translation and line numbers are only provided here for reference). The columns are separated by tabs. The first four lines are extracted from traditional lexicon, while the lasttwo are from the simplified database. Single character lines usually have a number of POS tags, as their combination with other characters can change the POS categories. Multiple character words do not come with POS tags in traditional Chinese lexicon, but they can be tagged in the simplified one. and line5 in the figure illustrate the difference.

1 星 [ 1S i N ] /NN/JJ/AB/ (‘star’) 2 xing1 [ 1S i N ] 3 星表 [ 1S i N . 3p iau ] (‘star chart’) 4 xing1_biao3 [ 1S i N . 3p iau ] 5 星座 [ 1S i N . 4ts uO ] /NN/ (‘constellation’) 6 xing1_zuo4 [ 1S i N . 4ts uO ] /NN/

Figure 4.1: A sample of entries from traditional and simplified Chinese lexicons

The greatest difference between Chinese and other lexicons is that Pinyin transcription is included, so that people who do not know how to read and type Chinese can still work with the lexicon. But as our system does not allow any extra columns the Pinyin transcription has to be separated. In our lexicon, they are listed as individual entries as seen in , 4, and 6 in Figure 4.1. The POS tags of monosyllabic Pinyin words are removed as many characters share the same transcription. For other entries, the POS labels are preserved. Items with the same transcription with different POS tags are imported as separate entries. Underscores are inserted because the spaces are not allowed within a word for other languages. Figure 4.2 shows the completed POS tags list for both lexicons. Speech Assessment Methods Phonetic Alphabet (SAMPA) is a phonetic tran- scription system based on IPA. SAMPA maps IPA symbols to characters available on the keyboard so that the transcription can be inputed and processed more eas- ily by computers. SAMPA only covers a number of language, but the X-SAMPA

29 /NN/: noun /AB/: adverb /JJ/: adjective /CL/: classifier /PN/: pronoun /PP/: preposition /CP/: Chinese particle /FX/: affix /VB/: verb /KN/: conjunction /IN/: interjection /RG/: numeral

Table 4.2: Part of speech tags used in the lexicons. extension includes the whole IPA chart9. R-sampa is a variation of the SAMPA systems created by ReadSpeaker for internal phonetic transcription. As with IPA, the r-sampa symbols are surrounded by square brackets. The complete Chinese phone set can be found in Table 4.3.

4.2.2 Out-of-Vocabulary Words

Out-of-Vocabulary words are usually handled by grapheme-to-phoneme rules for alphabetic writing system. In Chinese, such rules are mostly replaced by look- ing up individual characters in a word as the number of graphemes is too large. The pronunciation of individual characters are then put together to generate the transcription of the word. This approach can be used in many other languages, for example the compound words (sammansättningsord) in Swedish as well as in other Germanic languages. We therefore make use of the existing script for gen- erating OOV word pronunciation in Swedish10. The OOV word is divided into a number of possible combinations and each part is assigned a cost. The function will seek to find the smallest possible total cost for the word. For example, ifa word contains three characters, the combination of 2 and 1 will have a lower cost than three 1s added together. The idea is to preserve the largest possible chunks found in the lexicon as they are more likely to be the actual pronunciation of word. This is particularly important in Chinese as neighboring characters often reveal the pronunciation for heteronyms. This method should be able to predict most OOVs, although characters with multiple pronunciation can be problematic when in isolation. Disambiguation is then required to determine the correct pro- nunciation. To solve the problem at its root, frequent OOVs can be added into the lexicon for further reference.

4.2.3 Disambiguation

Disambiguation is the task of determining the correct pronunciation for homo- graphs in a language. Homographs have the same written form, but different meaning and sometimes different pronunciation. The latter case is known as het- eronyms and is a main concern of speech synthesis. In our rule-based system for other languages, disambiguation usually relies on language specific information

9X-SAMPA home page: http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm 10Thanks to Erik Margaronis for providing the explanation to the script.

30 like POS tags or the context. The same approaches are not directly applicable to Chinese, but they provide some general ideas about how the disambiguation script for Chinese should look. Fortunately, a rather simple way of dealing with heteronym characters in Chinese is adding the related words into lexicon. For example, 樂 can be read as lè (as in 快樂, ‘happy’) or yuè (in 音樂, ‘music’), but as long as the words are listed in the lexicon, the system should have no difficulty in providing the correct pronunciation. In most cases, a multiple character word does not need any disambiguation even though it may contain heteronym characters. True het- eronyms cí are extremely rare11. The tricky part of disambiguation lies in OOVs with guessed pronunciation and individual characters. Before tackling Chinese disambiguation, we will have a look at how other languages deal with heteronyms. POS tags work relatively well for English. For instance, a word preceded by the is likely to be an adjective or a noun, so project in the project should be read with stress on the first rather than the second syllable. Other than project, English has a large inventory of heteronyms that can be distinguished by checking POS tags. We do not have a tagger for our system, but the surrounding words are usually useful enough in guessing the POS label of the word. Another approach is by looking at the context. This is very similar to telling phone number apart from cardinal numbers with the keywords. Bass can be the instrument or the fish and both are noun, but if words like music and play occur in the same sentence, it is probably referring to the instrument. Heteronyms are ordered by their frequencies and listed in a separate lexicon. When the context or tags provide no information, the more common one will be chosen. The above approaches are not as feasible in Chinese, especially for OOV words that are broken down into characters. Chinese lacks any distinctive POS groups that are related to pronunciation, and characters usually have too many POS tags anyway. This does not mean that such words do not exist12, but we have no means of knowing the POS tags simply by looking at adjacent words. Moreover, the POS tags provided by our dictionaries are incomplete. There is however a category that can be targeted in Chinese. Many Chinese characters are pronounced differently when used as family names. A possible way is to createtwo lists: one containing the family names and another the common titles. If any of the characters in the first list is followed by an item in the second one, then itshould be read as a family name13. This rule does not work for surname follow by first

11Around a dozen examples are found in an article on a language forum, but we only agree with one of them. Another example that we can think of is 老子, which can be Lǎozǐ (a Chinese philosopher) or Lǎozi (an old man or a superior pronoun referring to oneself). 12Tones are sometimes used to distinguish different POS. An example is 鑽 in 鑽洞 (zuān dòng, ‘drill a hole’) and 電鑽 (diànzuàn, ‘electric drill’) 13Chinese surname comes before the title (For example, 曾教授, literally ‘Zhen Professor’) A typical

31 name, but it is at least a more generic approach. For other characters, the ranking of the heteronym lexicon should consider the frequency of the pronunciation in isolation. For example, 娜 can be nuó or nà, but it is always nà when used to translate foreign names. As foreign names are more likely to be separated by the tokenizer and the decompounding function, it is safer to prioritize the latter one. We only managed to implement a number of ranking lists and add a few new entries to the lexicons due to time constraint. We have however found some interesting disambiguation cases in Chinese that may be difficult to handle with a rule-based system. A number of common Chinese verbs have several meanings and are pronounced differently. One example is 倒 (dǎo/dào). The word is dǎo when a person or object falls, but dào when something is emptied from a container. It is not possible to read a sentence like 瓶子倒 (dǎo) 了, 水倒 (dào) 了出來 (‘the bottle fell and the water poured out’) correctly with only POS tags and parsing as the syntactic structures are identical. Semantic knowledge will be required to process such cases.

4.3 Internal Representation: the Phone Set

In phonetic analysis, the input text is converted into phonetic transcription. The most common notation is the International Phonetic Alphabet (IPA). As many IPA symbols are not immediately accessible during input, the SAMPA mapping and its extension is created to map the IPA chart to characters found on the keyboard. We use a variation of SAMPA called r-sampa used exclusively at ReadSpeaker for transcription. Our goal is to design a phone set that is compat- ible with Pinyin while differentiating some allophones that may cause confusion to non Chinese speakers who work with our system. Allophonic differences can be solved by the context in a large speech database, but for our prototype, a balance between phonemic and phonetic seems to work better. The phonology background is mainly based on Duanmu (2007), Lee and Zee (2003), as well as Karen Chung’s lessons on transcribing Mandarin with IPA14. > > We identify 22 consonants in total. [tCh], [tC], and [C] are listed as separate phonemes in Lee and Zee (2003), but Duanmu (2007) views them as palatal > > variations of [ts], [tsh], and [s]. We decide to keep them as individual phonemes as Pinyin also distinguishes between these phones. [ŋ] is only possible as coda while all but [n] are limited to initial position. [C] and the retroflex consonants are created based on the Swedish phone set. Aspiration is marked with an h attached to the phone. [h] is often transcribed as [x] by Chinese phonologists, but since Pinyin uses h and there is no contraction between [x] and [h] in Chinese, we have

Chinese name has the family name preceding the given name. 14Available at http://ocw.aca.ntu.edu.tw/ntu-ocw/ocw/cou/101S102

32 IPA r-sampa Pinyin IPA r-sampa Pinyin p p b i i i ph ph p u u u m m m y y ü/u f f f a~A a a t t d O O o th th t ə @ e n n n Ä @r er l l l ai ai ai k k g ei ei ei h k kh k Au au ao h (x) h h > ou ou ou ts ts z > ia ia ia tsh tsh c iE iE ie/-ia(n) > tC tS j iO iO io > tCh tSh q ua ua ua s s s uə u@ ue C S x uO uO uo > úù rts sh yE yE ue > úùh rtsh ch iau iau iao ù rs sh iou iou i(o)u ü rz r uai uai uai ŋ N ng uei uei u(e)i

Table 4.3: Chinese phone set for internal representation. Consonants are listed to the left and vowels to the right. adopted [h]. The left part of Table 4.3 shows the consonant part of the phone set along with examples. Most analyses propose at least five vowels for Chinese, although they are sometimes transcribed differently. -i final in Pinyin is sometimes treated assyl- labic consonants after some dental and retroflex initials (Duanmu, 2007), while some identify them as other high vowels. We consider them as the allophones of [i] here for simplicity and to keep the syllable structure uniform. The correct variation should be picked according to the context. Other high vowels [u] and [y] are mostly uncontroversial, while the low vowel has been transcribed with symbols between [A] to [a]. We have chosen to represented this phone with the simple [a] as in Pinyin. The mid vowels have a number of variants. [7] is used by some for the syllable e (as in 惡, è), but it tends to merge with the lower [ə] in unstressed syllables (like 的 de). As [7] is only limited to stressed open syllables and Pinyin uses e for both cases, we have adopted [ə]. [O] is not listed as a separate phoneme in either of our references. Lee and Zee (2003) consider it as an allophone of [u], while Duanmu (2007) argues that it can only occur in isolation as non standard words like interjections. We however keep it for not only interjections, but the

33 combination written as -ong in Pinyin. The phone may be an allophonic variation, but it would be rather confusing to transcribe the sound with [u] while Pinyin says otherwise. It is also acoustically distinctive from the normal [u]. The last vowel is the rhotic vowel, which is written as er in Pinyin. It is not to be mixed up with [ü] represented by r in Pinyin. [ü] is only possible in syllable initial position, while [Ä] is either a syllable on its own, or attached to the coda to for the diminutive form. We choose to represent glides with their vowel counterparts to reduce the number of phonemes. They are mostly written with vowels in Pinyin, except when at syllable initial position. This gives rise to a number diphthongs and triphthongs that may not be conventional for some Chinese phonological analyses, but we have found them work well for our system. The 11 diphthongs are proposed by Lee and Zee (2003). [iO] and [uə] can only occur in closed syllables ending with [ŋ] and [n] respectively. Duanmu (2007) identifies the first four and replaces the [i], [u], and [y] in the rest with the glides [j], [w], and [4]. [A] is used in [Au] in IPA as [A] is affected by the following back vowel, but as we do not distinguish the low vowels the r-sampa transcription remains the same. We also choose [E] and [e] for diphthongs as they are pulled forward by the adjacent front vowels [y] and [i]. -ian in Pinyin is transcribed as [iEn] for the same reason, although the underlying form is [a]. [iE] is also a valid final on its own. Duanmu (2007) views the triphthongs as the combination of the four diphthongs and glides. Letters in parenthesis are omitted in Pinyin. Appendix B shows the complete Pinyin to r-sampa mapping table for all regular syllables listed in Duanmu (2007). Although tones behave pretty much like phonemes for Chinese, they cannot be marked in the same way as no speech segments can be assigned solely to the tone marks. We therefore modify the stress symbols in other languages to mark the tones, as in Table 4.4. The marks are only used during segmentation as tone numbers overlap with some phonemes in other languages and confuses the segmentation system. The conventional tone numbers can still be seen in the lexicons. The symbols are prefixed to the initial consonants, which also work as the syllable boundary. For unmarked neutral tones, a dot is used to break the syllable.

Tone 1 % Tone 2 %% Tone 3 ” Tone 4 ”” Syllable boundary .

Table 4.4: Tone marks mapping for the Chinese phone set

34 4.4 Prosodic Analysis

The goal in this stage is to produce natural sounding prosody for the input text. This is a challenging task as prosody is mostly implicit and usually unmarked in the input text. With a large enough speech database, a unit selection system should be able to preserve suprasegmental features on the word or phrase level from the recording, but reading the whole sentence naturally can still be difficult. For our system, the punctuation and the tone marks are the two hints we can use to improve the prosody. We will thus concentrate on these aspects.

4.4.1 Prosody Beyond Tones

There is relatively little research on Chinese stress and intonation compared to tones. As all three prosodic features make use of pitch, it is interesting to see how they interact with each other in a tonal language like Chinese. A number of sentence structures have been identified to have distinctive intonation patterns, which affects the range for the tones (Shen, 1990). The difference of tone lies in relativity. A question with rising intonation pattern will make the overall pitch range higher, but the relative pitch height for the tones remain unchanged (Shen, 1990). Stress is realized in a similar way. A stressed word is reported to have wider pitch range than the normal syllables (Duanmu, 2007). Unstressed words are further marked by the lack of tone (neutral tone) and shorter duration in Chinese, the most common examples being the function words. Unlike English, native speakers of Chinese are barely aware of the regular word stress, nor is there a general agreement on the topic among the researchers (Duanmu, 2007). Sentence level intonation is usually generated based on punctuation. For our implementation, we break the sentence and insert a pause whenever a comma is present. If tokenized correctly, no pauses should occur within words, although some may be included as auto segmentation is not completely foolproof. Question marks would serve as a hint for raising intonation, but it would only work if our speech database contains sufficient examples of questions. Pitch range difference is supposedly speaker specific, and a model of pitch range can only be designed based on observation of the data. On the other hand, it is perhaps better to preserve rather than to generate and modify the pitch range as we have not yet come up with any efficient prosodic notation for capturing the intonation. Modification will also make the output sound less natural. In short, the dynamics of intonation, and lacking a good prosodic representation mean that reproducing natural intonation is more difficult for our rule-based system. The acquisition of prosody pretty much depends on the size of the database. Our prototype is limited in this way and can hardly handle other prosodic features beyond the explicit punctuation, although this are arguably less crucial compared to tones.

35 4.4.2 Third Tone Sandhi

We now return to the more perceivable prosodic feature: tones. Tone sandhi is the change of tone cause by the adjacent tone (Ladefoged and Johnson, 2014). Tone sandhi is not exclusive to Chinese and has been found among many tonal languages around the world (Gandour, 1978). The change is triggered by certain phonological environments and is language specific. Two of the most distinctive and well known tone sandhi rules are the third tone sandhi and yi- and bu- sandhi, which are implemented in our system. Third tone sandhi is the change of a third tone (or written as T3 for Tone 3) when it is followed by another T3 syllable. An example is given below in sentence 1. Both characters in the greeting are of T3 in isolation. But when combined together, the first syllable is turned to the second tone.

(1) Nǐ + hǎo → Níhǎo you + good ‘Hello’

Tone sandhi can occur over the word level. When there are more than two T3 syllables in a row, the sandhi will continue until no T3 syllables are next to each other. Word boundary is used to decide whether a tone change will take place and there may be more than one level of change. Both examples below consist of three T3 syllables, but the change occurs twice in sentence 3 while only one syllable becomes T3 for sentence 2. This is because shuǐcǎi (‘watercolor’) has already undergone the change on the word level, making it sound like 23. Adding another character to its right will create another 33 sequence, which triggers the second change on the second syllable. On the other hand, the 323 combination in sentence 2 does not require any further breaking apart of the T3. In short, there are two possible tone patterns for the three T3 sequence as shown in the example sentences. Some phrases can be tokenized either way, creating two different tone patterns and meanings. We have no means to determine which meaning is the intended one from the input text and have to rely on the given word boundaries. The tokenized result will be our guidelines for the whole tone sandhi process.

(2) Mǎi + shuǐcǎi → Mǎishuícǎi buy + watercolor ‘Buy watercolor (paint)’

(3) Shuǐcǎi + bǐ → Shuícáibǐ watercolor + pen ‘Watercolor brush’

Although tone sandhi can occur in the most common words and expressions, the tone change is not written down. The preliminary test on the phone set re- veals that the tone transcription does not match the what we hear sometimes. It

36 is because many syllables are marked with their underlying T3 while they are pro- nounced with the second tone. So the first step is to make sure the transcription in the lexicon and the sound database matches the actual speech. Most can be done automatically with a script, but sequences with more than three syllables have to be manually checked for the correct tone pattern. This correction only affects phonetic transcriptions. Pinyin is kept unmodified so that they canbe used for lookup in other resources. This way, the tone sandhi within a word is solved. For tone sandhi process over the word boundary, we uses a function that modifies the transcription when the combination of the words triggers the change. The script is also used for dealing with French liaison. The rule simply changes the tone of the first syllable when a 33 pattern is detected in our transcription. Even if the pattern is preceded by another T3 in the same word, it would have been taken care already in the lexicon. When every syllable in a three character sequence is an individual word, we have chosen to change the tone for the second syllable, although it is also possible to have the 223 pattern for some speakers (Duanmu, 2007). The above rule only works across word boundaries, which means that tone sandhi within an OOV word will not be treated. Instead of looking up from the lexicon, the pronunciation of the OOV is generated by the decompounding function that breaks a words into characters. We have thus included the same rules within the function across syllable boundary15. As such, the OOV will function as the words extracted from the lexicon and word level tone sandhi rules can be used to make any further changes. Third Tone sandhi is also found in code-mixing speech. Unstressed syllables in English have been reported to trigger the process as they are prosodically similar to the low T3 in Chinese (Cheng, 1968). For a bilingual voice, it would be interesting to see how Chinese tones interact with English stress and to create prosodic rules for such cases. Our test using sentences provided by Cheng (1968) on bilingual voices does not exhibit tone sandhi change as found in human speech, hinting that this may be a direction for future research and improvement.

4.4.3 Yi- and Bu-Tone Sandhi

The characters 一 (yī ) and 不 (bù) also display tone change phenomenon when combined with other syllables. For some speakers, similar tone sandhi process can be observed with 七 (qī ) and 八 (bā), but the predominant trend prefers to leave these words unchanged (Duanmu, 2007). Table 4.5 below shows the tone sandhi rules.

15Thanks to Erik Margaronis for helping with the implementation.

37 Before 4th tone Before other tones yī (T1) yí (T2) yì (T4) bù (T4) bú (T2) bù (T4)

Table 4.5: Yī and bù tone sandhi rules

When used as a number, yī does not go through any change but remains its original pronunciation. This complicates the issue as our script cannot distinguish between the different meanings of yī. We therefore take the inspiration from the context checking approach in the normalizer for the transcription modification. If other cardinal numbers or the ordinal prefix 第 (dì) is present, then the yī is likely to be a number as well. Again, both lexicons and the speech database have to be fixed to match the actual pronunciation. The tone sandhi across the word boundary is implemented likewise, but we have reserved the change to yī and bù as individual words. The reason is that yī and bù keep their original tone at the end of a multisyllabic word and are not affected by the following syllable. We also decide to limit the transformation to lexicon word combination as there appears to be some disagreement on whether tone change should be applied for OOV words like names. The pronunciation depends very much on personal preference and sometimes both are acceptable. For example, both Yīyín and Yìyín are acceptable pronunciations for 一銀 (abbreviation for 第一銀行, ‘First Bank’) among the people we asked despite yī is a number. Number also seems to be the most probable meaning for other yī in the OOV that we can think of. As bù is a negation word, it is rather uncommon in OOVs unless for words of foreign origin. As such, we decide to keep original tone in OOV words for now until we find out more about the tone sandhi patterns. Our implementation does not consider the cases when yī and bù occur in a row. In our speech, the tone sandhi process appears to be triggered by the follow- ing word and the closer syllable gets the tone change. Table 4.6 illustrates the rules for the two syllable sequence when yī is not a number. Although this looks rather straightforward, we have found surprisingly little literature discussing such patterns. Whether they are agreed by most speakers requires more research. An- other remaining task is to work on the disambiguation of yī before applying the tone sandhi rules.

Before 4th tone Before other tones yībù yìbú (42) yíbù (24) bùyī bùyí (42) búyì (24)

Table 4.6: Yī and bù sequence tone change

38 4.5 Waveform Synthesis

In this section, we will introduce the speech database that we use for testing the components in text analysis and describe the remaining steps of waveform synthesis briefly.

4.5.1 Speech Database

We work with a proprietary speech database for the audio output. The files include the text and Pinyin transcription, which is mapped to our phone set. The recording lasts around 5 hours and contains 3,650 sentences. The voice talent is a female speaker from China who has Chinese as her native tongue. Although the database was completed more than a decade ago, the recording is of acceptable quality and is suitable for our testing purpose. A reason for adding a voice is that it is easier to tell the mistakes by listening than by reading the transcription. The phone set also requires testing on actual speech sound to make sure that it really works for the language. Triphthongs were introduced because the syllable sounds much longer than normal when it is the combination of a vowel plus a diphthong. The same issue might not occur for a larger database, but as we have relatively few samples in our recording, we have to be more specific with representation to reduce the chance of unfitting segments being chosen. The size also limits the productivity of our voice. As there are very few question samples, the system is very likely to fail to generate the correct intonation for such input. Nevertheless, after many revisions of the phone set we have successfully produced understandable output of the 404 Chinese syllables proposed by Duanmu (2007), meaning that our system should have at least one copy of the required segments for all syllable types.

4.5.2 Segmentation and Generating the Output

The training of the voice requires mapping the transcription with the recording so that the correct speech segment can be extracted. The process is known as seg- mentation and this is done by aligning the most probable phonetic representation to the feature vectors of sound with HMM (Sjölander, 2003). Manual segmenta- tion may be further applied to improve the quality, but this would take much longer. For our voice, auto segmentation is used and there are some mistakes and glitches created by the falsely placed boundaries. This will unavoidably affect the output quality, but we will leave it as it is. The concatenation is based on the algorithm mentioned in Section 2.2.5. For our implementation, we do not use any sort of signal processing to alter the frequency or mask the glitches. The concept of unit selection is to inherit

39 rather than generating the features of speech. Signal processing is therefore not as common for unit selection systems, although hybrids that prune away the overlaps between segments do exist (Taylor, 2009). The small size of our database means that a lot would be modified if we use signal processing to smooth over the output, making the voice sounds less natural.

40 5 Evaluation

In this chapter, we briefly describe the common evaluation criteria for TTS sys- tems as well as introduce some commercially available voices on the market. As our prototype only covers a number of aspects in text analysis, standard evalu- ation metrics are not entirely suitable for our system for now. We will therefore focus on the comparison between our implementation and others’ products, hop- ing to find out more that we can improve on in the future. In the mean time,we will test some of the tricky cases we found and see how these voices perform.

5.1 Evaluation Methods

Intelligibility and naturalness are among the most important factors that affect a TTS voice’s quality (Taylor, 2009), but devising ways of measuring them ob- jectively is not always straightforward. Some common standardized evaluation methods for system tests are listed in the follow sections. It should be noted that the “tests” we use to examine the functionality of our implementation are more similar to unit or component tests during the development. The data of system test should ideally be designed by people who have no access to the training data. In our case, our tests on other voices can be viewed as some sort of system test. But for our own system they will be unit tests as we are only going to test what we have implemented rather than using random input.

5.1.1 Intelligibility

Tests on intelligibility often try to determine how well the synthesized speech is understood by human listeners. Some tests focus on comprehension, which do not require listener to know exactly what word is spoken but the general idea of the audio. The most common intelligibility test is modified rhyme test (MRT) (Taylor, 2009). The test data includes several sets of similar words and the listener is asked to identify which one is spoken. IEEE 1969 provides 72 lists of phonetically balanced sentences in English known as the Harvard Sentences. Such tests may lose its credibility if the upcoming words are predictable. This leads to the construction of test data that are nonsense but syntactically correct.

41 One example is the Haskins sentences designed by the Haskins Laboratories (Pisoni and Hunnicutt, 1980). In a sense, our test on all Chinese syllables can be viewed as some sort of modified rhyme test as we have chosen the words alphabetically based onPinyin, which means they usually only differ in one phoneme. Taylor (2009) argues the validity of these tests by pointing out the data is nowhere close to the real world texts. They therefore should be considered as equivalent to unit tests in computer software rather than system tests. It is worth noting that the test methods we mention here are all devised more than two decades ago. The intelligibility of synthesized voices has improved greatly since then, and the focus of TTS research has long shifted to improving naturalness as good intelligibility has become the norm. It is understandable that these evaluation methods, once considered as system tests, have become part of the development.

5.1.2 Naturalness

Naturalness is more difficult to measure compare to intelligibility due to its sub- jectivity. A test of naturalness usually ask the listeners for their impression of the voice with rating scales. The result can be heavily dependent on the listener’s preference and the same score rated by two people may well have different qual- ity. The same listener does not always rate a system with the same score, either. Often more than one voice is provided for comparison. The same test set is used to generate the output and the listeners are to tell which one is superior. Naturalness is out of bound for our implementation at this stage due to the limitation of our database and the segmentation quality. Some tone sandhi errors, however, can be viewed as a factor of naturalness as the speech is still comprehensible. Although tones are considered to be phonemic in Chinese, such cases are less likely to be mistaken for another word given the context. In this sense, our tone sandhi rules can be seen as attempts to improve the naturalness.

5.2 Existing Chinese TTS voices

Below is a list of Chinese TTS products that we will test and compare with. Note that this is by no means an exhaustive list: our priority is those which provide demo online and are explicitly stated as TTS service rather than speech synthesis in other fields. An exception is Google Translate, which comes witha TTS function to read out the input text. We also include two providers from Taiwan: Cyberon and ITRS (Industrial Technology Research Institute) TTS. If there are more than one voice from the same provider, we only evaluate the first one unless they are of different variants. In the next section, we will test some

42 of the problems and cases we found during implementation on these voices. We will also compare with our existing voices at ReadSpeaker, which are purchased from other TTS service providers and are not related to our prototype. Except for ITRS, all other voices are female. The name of the voice is included in parenthesis if available.

Mandarin/Chinese in China

• Acapela (Lulu) • Cyberon (ZhiFen) • Ispeech • Neospeech (Hui) • Nuance vocalizer (Tian-tian) • ReadSpeaker

Taiwan Chinese

• Cyberon (DaiYu) • ITRS (Bruce) • Neospeech (Yafang) • Nuance vocalizer (Mei-jia) • ReadSpeaker

Not specified

• Google Translate

5.3 Comparing the Voices

The mini evaluation is divided into four parts, testing the preprocessing, nor- malization, disambiguation, and tone sandhi rules of the systems. We are also interested in how the rhotic coda sounds in different variant of synthesized speech. Results for some voices also come with additional comments on speech quality or the system. The complete result and test cases can be found in Appendix C. The tests for preprocessing are concerned with the system’s capability of handling variant scripts, namely both traditional and simplified Chinese and fullwidth numbers. For normalization, we test number strings, the conventional ways of writing time and date, as well as a few common symbols and a phone number. We also include some English abbreviations and words to see how for- eign graphemes are handled. As we mentioned in 4.2.3, monosyllabic heteronyms are very difficult to deal with as they often require semantic knowledge todisam- biguate. We therefore test some of the cases that we brought up. The tone sandhi

43 rules are well known in Chinese phonology, but depending on the implementation the output may display different tone change patterns. We may be able tofind out more about the structure of the rules by comparing the speech. All test data is in traditional Chinese, except for the first task in preprocessing. All systems managed to read both traditional and simplified script and full- width numbers correctly. The longest number string most voices can handle falls between 10 and 12, while Google Translate and Nuance read up to 15 and 16 digits respectively. Any numbers longer than the limitation is read one digit at a time, but Ispeech and Nuance Mandarin have chosen yāo instead of yī for 一 (‘one’), which is used in some situation to make the numbers more distinguish- able. Google Translate sometimes switches to yāo, but we have not figured out under what circumstances. Cyberon separates any number longer than 3 digits unless it is followed by characters. They have made an exception for the word 年 (‘year’) so the number preceding it will still be split. It is hard to say whether this tactic is preferable. The voice is still capable of reading numbers correctly within the context we have tested. On the other hand, ITRS seems to reserve the unsplit number for certain classifiers. The only miss is that 年 is also in the list, which is often incorrect as the frequency of year 2017 is much higher than 2017 years. A common normalization problem is that symbols are missing or misinter- preted. Hyphen in the phone number we tested was mistaken as to by Ispeech and Neospeech. The plus sign was absent in 3+5=8 for some voices. ReadSpeaker Mandarin did not recognize the time format 12:30 and interpreted the semicolon as a break instead. All systems had no trouble reading the dollar sign correctly. We chose four English words and a short phrase to test how bilingual a TTS system needs to be. All voices managed to read the letters in IBM, but only two got IKEA right. About half of the voices successfully translated AM to its Chinese equivalent, while some did not recognize it as an abbreviation. Neither was Sat identified as Saturday by any voice, which is slightly surprising as the notion is common for online articles as far as we are aware of. The quality of the English pronunciation varies, which affects intelligibility to some extent. We picked the word iPhone for our test as it usually remains untranslated and is probably not in the lexicon. Interestingly, Neospeech and Nuance Mandarin only recognize the word with capitalized p. Google Translate turned Apple into Chinese in audio output, but iPhone sounded like the Chinese syllable fēng with the 1st tone. Acapela spelled the word out with uppercase p, while iphone seemed to sound like [aI"fOŋ.nə] with 4th, 2nd, and neutral tone on each syllable. Possible explanation to this is that either the voice talent does not know much English, or they may used similar sounding Chinese words to cover some of English syllables while spelling out those that are not included. Google Translate uses similar

44 strategy for words that it does not manage to translate, but the syllables are more distinctively Chinese. This ad hoc approach is of course not the solution to a bilingual TTS system, but it raises a number of questions related to the coverage and the representation of foreign words in the system. Also, do users prefer standard English pronunciation, or a voice with local accent? The topics can be of interest to TTS developers and linguists alike. We created three sentences inserted with monosyllabic heteronyms. Many voices failed to disambiguate them and picked one pronunciation for all cases. The only voice that got full marks in this is Neospeech Taiwan Chinese, while their Mandarin missed one sentence. Other five scored one sentence correctly. It would be interesting to learn the approaches they use for disambiguation, but sometimes it may have been coincidence. As two of our sample heteronym pairs only differ in tones, transcription that does not consider tone sandhi rulesmay yield different tones despite not having any disambiguation mechanism. The tone patterns also vary between the systems. Acapela and ITRS have 223 rather than the 323 we propose for “buy watercolor”, showing that the change on different levels may take place from left to right all at once(333→233→223). Most voices use 323 or the incorrect 233 for “watercolor brush”, which we read as 223. The 323 is likely to cause by segmentation ambiguity: 水彩筆 can be either “watercolor brush (pen)” or “water color-pen” like markers. Google Translate and ReadSpeaker’s Taiwan Chinese are the only two that share our pattern. Moreover, the combination of yī and bù in 一不注意 (Yì bú zhùyì as we read it, literally ‘not paying attention’) is read with 22 by most. Our explanation to this is that the lexicon pronunciation of 不 is bù, causing yī to become yí, but when the 4th tone zhùyì follows, the tone in bù is changed but yí remains the same. As Duanmu (2007) explains, tone sandhi can be analyzed in several ways, leading to different patterns. For a TTS system, we aim at the most widelyused or understood version, which supposedly can be found out by surveying the users. Tone sandhi rules should also be taken into consideration when creating different speech variations. Incidentally, Google Translate is the only voice that performs exactly like our rules in all tone sandhi tasks despite the fact that it does not specify which Chinese variation it is using. We also tested some of the sentences on our prototype. As the test set is designed according to our implementation, it is not surprising that our prototype scores well on the normalization part. We have omitted test sentences with het- eronyms and English words as our system is not capable of handling them. The performance on tone sandhi is however surprisingly good. Both T3 and yi/bu sandhi realized well. A minor problem is that the second syllable in second T3 test group is at the end of a sentence and the tone is much lower, making it sound more like a third rather than the second tone. The transcription is of second tone

45 and with the provided context in the original sentence, the syllable is a clear second tone to us. But when extracted and joined with different segments the tone is slightly questionable. We were expecting some sort of mismatch of tones that would negatively affect the output quality, for example when the voice talent reads it inadiffer- ent tone or we might have missed some cases when we corrected the transcrip- tion. However, other than the above issue (which is strictly not caused by the discrepancy between transcription and speech) the general tone performance is satisfactory on our test sentences. Our system even managed to get most of the combination of 一 and 不 correctly, which we did not explicitly write rules for. Looking at the segments we see that 一 and 不 sometimes belong to the same larger chunk in the speech database. Thanks to the unit selection algorithm, the output actually inherits the tone sandhi by selecting the largest possible seg- ments. We presume that the algorithm may sometimes fail due to the size of our database. Rules are still needed if we wish to produce consistent patter for the output. This however provides the insight of processing tone sandhi implicitly. As was mentioned, tone sandhi patterns may be different among speakers (Duanmu, 2007). We may be able to inherit the tone changes from the voice talent rather than using rules that may deviate from the speaker’s preferred patterns. Naturalness is closely related to audio quality and intonation. Cyberon and ITRS’s voices remind us of the formant synthesis example in 2.3.2, which may be a hint of extensive signal processing. The pauses between the sentences are some- times too short in Nuance and ReadSpeaker Chinese. Moreover, ReadSpeaker Chinese does not pause between the demo statement and output, which can be confusing. The usability of a TTS system can also affect user’s impression. Cy- beron takes a bit longer than the others to generate the speech. Ispeech has Taiwan Chinese in their scrolling list, but it did not respond to any of our input or clicks. The audio of ReadSpeaker Mandarin is either followed or preceded by an unrecognizable utterance, which took us quite a while to figure out that it is actually saying this is a demo by spelling out the first two words. The evaluation reveals that although most cases we tested are straightforward to human speak- ers, the implementation is not always perfect and the variations would need to be taken into account when designing the system.

46 6 Conclusion

6.1 Summary

The goal of the implementation is to study the text analysis components for a rule-based Chinese TTS voice. We also hope to improve the robustness and flex- ibility of the TTS architecture at ReadSpeaker by experimenting with linguistic features that are not found in other existing voices. Although there are still many be improved, the prototype is capable of processing the most common NLP cases and provides phonetic representation of the input for further waveform synthe- sis. We also identify some challenges in text analysis, especially in the topics of disambiguation and tone sandhi rules. A small evaluation based on our experi- ment is then used to examine the existing voices on the market, in the hope that the results would help us know what is expected of a Chinese TTS voice. Our comparison brings up a number of issues that are worth researching for future voices, including the choice of tone sandhi rules and handling foreign words. In hindsight, we would propose to have the comparison before the implemen- tation as the preliminary study of the project, although the evaluation would arguably be less discriminative without the experience gained from building our own system. Due to the time constraint we are unable to repeat the cycle of test and revision for our prototype, but this would be necessary for a system that evolves with the language. Another limitation is the existing framework and the tools we use. While we have successfully handled Chinese characters, phonemes, and tones with the implementation, further incorporation of semantic analysis tools for disambiguation is currently not possible. The semantic features would require additional means of representation and processing, which is a large project itself. The lack of POS tags and out of date entries of our lexicons can be poten- tial pitfalls for the development of a full-fledged system . Also, the disadvantage of our rule-based system is that a feature must be explicitly stated to be realized. Although unit selection seeks to learn from the data by increasing the size of speech database, intonation on a larger scale may still be missed given that we have no efficient ways of representing and manipulating it. In conclusion, we have presented our text analysis solution for a Chinese

47 TTS voice. The components of our implementation include tokenizer, normalizer, lexicons, phone set, and prosodic rules. We also design a small evaluation testing some natural language cases to better understand the quality of other voices. A number of interesting questions arise from the our work, which should be of use for improving the voice in the future.

6.2 Future Work

As a prototype our work has lived up to the initial expectation, but there is still plenty to be improved comparing to other commercial voices. The first step would be to update the speech database for better output quality. If possible, it would be more ideal to separate the two variations of Chinese into two systems with different speech databases. Another direction for future work is to adjust the architecture of the system to include features such as semantic category and additional intonation marks. Additionally, as the use of English words becomes more and more frequent, the future TTS voice should be bilingual to some degree. The design of such system will require extensive research of both the languages and the TTS framework. The diversity of the language means that choosing a speech variation that appeals to most people is not always straightforward. Research on both languages and user’s perception would be beneficial to the development of future voices. Text analysis receives much less attention in the development of TTS recently as the focus shifts towards improving the naturalness of the speech output. There are, however, a number of challenges that are tricky to tackle as shown by our evaluation and implementation. An interesting area of future work would be to take the inspiration from the the experimental methods in Section 2.3.5 to overcome the limitation of rule-based systems. Machine learning also opens the possibility for learning the features we are trying to solve without explicitly doing linguistic research. For example, it may be possible to directly learn the tone sandhi rules used by the voice talent, or to know which English words are common in Chinese texts. The prospective of introducing the new approaches can change the development and research of TTS as we know it.

48 Bibliography

Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J., Sengupta, S. et al. (2017). Deep voice: Real-time neural text-to-speech, arXiv preprint arXiv:1702.07825 .

Barker, C. (n.d.). How many syllables does english have?, http: //web.archive.org/web/20160822211027/http://semarch.linguistics. fas.nyu.edu/barker/Syllables/index.txt. Online; accessed 24-April-2017.

Black, A. W. (2000). Speech synthesis in festival: A practical course on making computers talk edition 1.4.1, for festival version 2.0, http://festvox.org/ festtut/notes/festtut_2.html#SEC3. Online; accessed 18-April-2017.

Chao, Y.-R. (1930). A system of tone letters, Le maître phonétique pp. 24–27.

Chen, P. (1999). Modern Chinese: history and sociolinguistics, Cambridge Uni- versity Press.

Cheng, C.-C. (1968). English stresses and chinese tones in chinese sentences, Phonetica 18(2): 77–88.

Coblin, W. S. (2000). A brief history of mandarin, Journal of the American Oriental Society pp. 537–552.

Cruttenden, A. (1997). Intonation, Cambridge University Press.

Duanmu, S. (2007). The Phonology of Standard Chinese, Oxford University Press.

Emerson, T. (2005). The second international chinese word segmentation bakeoff, Proceedings of the fourth SIGHAN workshop on Chinese language Processing, Vol. 133.

Gandour, J. T. (1978). The perception of tone, in V. A. Fromkin (ed.), Tone: A Linguistic Survey, Academic Press.

Huang, C. J., Li, Y. A. and Simpson, A. (2014). The handbook of Chinese linguistics, John Wiley & Sons.

49 Hunt, A. J. and Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database, Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE Interna- tional Conference on, Vol. 1, IEEE, pp. 373–376.

Johnson, K. (2003). Acoustic and Auditory Phonetics, Blackwell Publishing.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, second edn, Person Education International.

Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer, the Journal of the Acoustical Society of America 67(3): 971–995.

Klatt, D. H. (1987). Review of text-to-speech conversion for english, The Journal of the Acoustical Society of America 82(3): 737–793.

Ladefoged, P. and Johnson, K. (2014). A course in phonetics, Nelson Education.

Lee, W.-S. and Zee, E. (2003). Standard chinese (beijing), International Phonetic Association. Journal of the International Phonetic Association 33(1): 109.

Packard, J. L. (2000). The morphology of Chinese: A linguistic and cognitive approach, Cambridge University Press.

Pisoni, D. and Hunnicutt, S. (1980). Perceptual evaluation of mitalk: The MIT unrestricted text-to-speech system, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’80., Vol. 5, IEEE, pp. 572–575.

Ramsey, S. R. (1987). The languages of China, Princeton University Press.

Rao, K., Peng, F., Sak, H. and Beaufays, F. (2015). Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks, Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, IEEE, pp. 4225–4229.

Ronanki, S., Henter, G. E., Wu, Z. and King, S. (2016). A template-based ap- proach for speech synthesis intonation generation using LSTMs, Interspeech 2016 pp. 2463–2467.

Rothauser, E. (1969). IEEE recommended practice for speech quality measure- ments, IEEE Trans. on Audio and Electroacoustics 17: 225–246.

Shen, X.-n. S. (1990). The prosody of Mandarin Chinese, Vol. 118, Univ of California Press.

50 Shih, C. and Sproat, R. (1996). Issues in text-to-speech conversion for mandarin, Computational linguistics and Chinese language processing 1(1): 37–86.

Sjölander, K. (2003). An HMM-based system for automatic segmentation and alignment of speech, Proceedings of Fonetik, Vol. 2003, pp. 93–96.

Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A. and Bengio, Y. (2017). Char2wav: End-to-end speech synthesis, ICLR 2017 .

Spiegel, M. F. (2003). Proper name pronunciations for speech technology appli- cations., International Journal of Speech Technology 6(4): 419–427.

Taylor, P. (2009). Text-to-speech synthesis, Cambridge university press.

Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J. and Oura, K. (2013). Speech synthesis based on hidden markov models, Proceedings of the IEEE 101(5): 1234–1252. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio, Arxiv. URL: https://arxiv.org/abs/1609.03499

Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. et al. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model, arXiv preprint arXiv:1703.10135 .

Yarowsky, D. (1997). Homograph disambiguation in text-to-speech synthesis, Progress in speech synthesis, Springer, pp. 157–172.

Zhang, Y. and Clark, S. (2011). Syntactic processing using the generalized per- ceptron and beam search, Computational linguistics 37(1): 105–151.

51 A Complete List of Normalization Tasks

Date and Time

• Hyphen in mm/dd-yyyy is removed • 3 or 4-digit numbers followed by 年 (year) is read as year, not cardinal number • Possible date formats: – With day of the week: day, date; day mm/dd; mm/dd day – With year: mm/dd year; yy(yy)-mm-dd; yy(yy)/mm/dd; range of date marked by either hyphen or tilde, separated by hyphen or slash, with date or year first. – With month: mm/dd to mm/dd; month dd, dd and/or/to dd; month- month; mm/dd when adjacent words reveal that it is a date rather than a fraction • Possible time formats: hh:mm:ss; mm’ss; hh:mm AM/PM; hh:mm • For hh:mm AM/PM, AM and PM are converted to morning, noon, after- noon, evening, early morning in Chinese

Currency

• Dictionary lookup for currency symbols and abbreviations • Handle number range ($12-14)

Phone and Account number

• Digits in a long number string are read separately with space when co- occuring with phone/account related keywords • Regular expression for Taiwan and China phone number format, with the possibility of adding + and country code

Regular Number

• Commas or spaces within large number strings are removed (1 000 or 1,000) • Fraction N/M and N N/M (largest possible M is 10)

52 • Negative numbers • Numbers with decimal points • Numbers beginning with zero or longer than 12 digits are read as separate digits • Numbers up to 12 digits are handled correctly • 2 is normalized as 兩 (liǎng) when the following character is a classifier • Hyphen is ignored in character-hyphen-number cases like 碳-14 (‘carbon- 14’)

Symbols

• Dictionary lookup for symbols • Normalize square and cubic represented with circumflex or superscript (up to 9) • Reordering % and °C/°F

Dictionaries

• Currency symbols and abbreviations: €, £, $, ¥, ¢, USD, EUR, GBP, CNY, TWD, NTD, JPY,... • English abbreviations: for months and days of week, Tel, sec, min,... • Math symbols: <, >, =, +, ±, √,... • Other symbols: _, /, ~, #...

53 B Pinyin to R-sampa Mapping Chart

y, w, or no onset b p m f d t n l a a p a ph a m a f a t a th a n a l a e @ m @ t @ th @ n @ l @ o O p uO ph uO m uO f uO ai ai p ai ph ai m ai t ai th ai n ai l ai ei ei p ei ph ei m ei f ei t ei n ei l ei ao au p au ph au m au t au th au n au l au ou ou ph ou m ou f ou t ou th ou n ou l ou an a n p a n ph a n m a n f a n t a n th a n n a n l a n ang a N p a N ph a N m a N f a N t a N th a N n a N l a N en @n/u@n(wen) p @ n ph @ n m @ n f @ n t @ n n @ n i i p i ph i m i t i th i n i l i ia ia iao iau p iau ph iau m iau t iau th iau n iau l iau ie iE p iE ph iE m iE t iE th iE n iE l iE iu (iou) iou m iou t iou n iou l iou ian iE n p iE n ph iE n m iE n t iE n th iE n n iE n l iE n in i n p i n ph i n m i n t i n th i n n i n l i n iang ia N n ia N l ia N ing i N p i N ph i N m i N t i N th i N n i N l i N iong iO N ü y n y l y ue yE n yE l yE uan yE n t ua n th ua n n ua n l ua n un (uen) y n t u@ n th u@ n l u@ n u u p u ph u m u f u t u th u n u l u ua ua uo uO t uO th uO n uO l uO uai uai ui (uei) uei t uei th uei uang ua N ong t O N th O N n O N l O N eng uO N p O N ph O n m O n f O n t @ N th @ N n @ N l @ N

54 j q x i tS i tSh i S i ia tS ia tSh ia S ia iao tS iau tSh iau S iau ie iE tSh iE S iE iu (iou) tS iou tSh iou S iou ian tS iE n tSh iE n S iE n in tS i n tSh i n S i n iang tS ia N tSh ia N S ia N ing tS i N tSh i N S i N iong tS iO N tSh iO N S iO N ü tS y tSh y S y ue tS yE tSh yE S yE uan tS yE n tSh yE n S yE n un (uen) tS y n tSh y n S y n

g k h z c s zh ch sh r a k a kh a h a ts a tsh a s a rts a rtsh a rs a e k @ kh @ h @ ts @ tsh @ s @ rts @ rtsh @ rs @ r @ ai k ai kh ai h ai ts ai tsh ai s ai rts ai rtsh ai rs ai ei k ei kh ei h ei ts ei rs ei ao k au kh au h au ts au tsh au s au rts au rtsh au rs au r au ou k ou kh ou h ou ts ou tsh ou s ou rts ou rtsh ou rs ou r ou an k a n kh a n h a n ts a n tsh a n s a n rts a n rtsh a n rs a n r a n ang k a N kh a N h a N ts a N tsh a N s a N rts a N rtsh a N rs a N r a N en k @ n kh @ n h @ n ts @ n tsh @ n s @ n rts @ n rtsh @ n rs @ n r @ n i ts i tsh i s i rts i rtsh i rs i r i uan k ua n kh ua n h ua n ts ua n tsh ua n s ua n rts ua n rtsh ua n rs ua n r ua n un (uen) k u@ n kh u@ n h u@ n ts u@ n tsh u@ n s u@ n rts u@ n rtsh u@ n rs u@ n r u@ n u k u kh u h u ts u tsh u s u rts u rtsh u rs u r u ua k ua kh ua h ua rts ua rs ua uo k uO kh uO h uO ts uO tsh uO s uO rts uO rtsh uO rs uO r uO uai k uai kh uai h uai rts uai rs uai ui (uei) k uei kh uei h uei ts uei tsh uei s uei rts uei rtsh uei rs uei r uei uang k ua N kh ua N h ua N rts ua N rtsh ua N rs ua N ong k O N kh O N h O N ts O N tsh O N s O N rts O N rtsh O N rs O N r O N eng k @ N kh @ N h @ n ts @ n tsh @ n s @ N rts @ N rtsh @ N rs @ N r @ N

* Syllables beginning with i or ü are written with y or yu in Pinyin while u is replaced with w. Turn the initial into the vowel counterpart to find the final in the table. The middle letterin iou, uen, and uei shown in parenthesis is omitted.

55 C Test Cases and Results

Test cases and expected output

• Proprocessing – 我是一個語音生成系統 (‘I am a speech synthesis system’, traditional) – 我是一个语音生成系统 (‘I am a speech synthesis system’, simplified) – Fullwidth number: 204 • Normalization – Test the largest possible number – 12:30 → 十二點三十分 – 2017/09/19 → 二零一七年九月十九日 – $20 → 20(美) 元 – 3+5=8 → 三加五等於八 – +886-800-123-412 → (加) 八八六 | 八零零 | 一二三 | 四一二 – 1992 年 → 一九九二年 (‘year 1992’) • English words and abbreviations – IBM → [aI.bi"Em] – IKEA → [i"ki.a] or [aI"ki.a] – (10:30) AM → [eI"Em] or 上午 (‘morning’) – (2017/09/19) Sat → 星期六 (‘Saturday’) – Apple 公司的 iPhone → ["æpəl] 公司的 ["aI.fOn] (‘Apple company’s iPhone’) • Disambiguation – 曾 (Zēng) 老師曾 (céng) 經在這所學校任教 (‘Teacher Zeng used to teach at this school.’) – 他大吃大喝 (hē) 之後大喝 (hè) 一聲,又喝 (hē) 了一杯。 (‘He shouted after gulping down the food and the drink, then drank another glass.’) – 瓶子倒 (dǎo) 了,水倒 (dào) 了出來。 (‘The bottle fell and the water poured out.’) • Tone Sandhi

56 – 買水彩 (323)、水彩筆 (223)、一卷 (43) 膠帶、一袋 (24) 筆記本, 不要 (24) 紙袋,不需 (42) 要發票。 (‘Buy some watercolor, a watercolor brush, a roll of tape, a pack of notebooks. No need for the paper bag and the receipt.’) – 一不 (24) 小心走錯路,一不 (42) 注意就忘記買了。 (‘(I) happened to take the wrong way and forgot to buy it.’) – 不一 (42) 定有空,不一 (24) 直記得。 (‘Might not have time for it or might forget about it.’) • Rhotic coda: 這兒 (Zhèr) 天氣很好。(‘The weather is nice here.’)

Mandarin/Chinese in China

Acapela Cyberon Ispeech NeoSpeech Nuance ReadSpeaker digits 10 3 11 11 15 12 12:30 D D D D D 十二 | 三十 2017/09/19 D 二零一七 | 九 | 十九 D D D 二千零一十七 | 零九 | 十九 $20 D D D D D 二十 3+5=8 D 三五 | 八 D D 三五等於號八 三五等於八 phone no.     D  1992 年 D D D D D D IBM D D D D [E.bi"Em] D IKEA [aI.keI.i.E] D [aI.ki] [aI.keI.eI] [keI] [aI.keI.eI] 10:30 AM 十 | 三十 [7.Em] D D D D 十 | 三十 [Em] Sat [sat] [sæt] [sæt] [sæt] [sæt] [Es.eI.ti] Apple [a.pə] D D D D [eI.pi.pi] iPhone spelled-out D D D D spelled-out iphone [aI"fOŋ.nə] D D [I.fOŋ] [I.fOŋ] spelled-out 曾 céng céng céng D céng céng 喝 D hē hē hē D hē 倒 dǎo dǎo dǎo D dào D 3rd tone 223,323 D,323 D,323 D,223? D,233 D,233 一 sandhi D D D 23,D D D 不 sandhi D D D D D D 一不 D,22 D,22 D,12 D,12 12,D 44,44 不一 D D D D 41,44 41,44 Rhotic D  D D D ér is stressed

• Cyberon read numbers longer than 3 digits separately unless followed by any characters other than 年. • Ispeech read 一 as yāo rather than yī. Nuance used yāo and yī interchange- ably.

57 • Neospeech’s output for the second 3rd tone sandhi sounded like somewhere between 223 and 323, thus the question mark. • Ispeech, NeoSpeech, and Nuance translated AM into Chinese. • Most voices failed to read the phone number we tested. Two most common mistakes are interpreting the hyphens as to, or reading the number as a whole. – Acapela: 加八百八十六至八百至一百二十三至四百一十二 – Cyberon: 八百八十六 | 八零零 | 一二三 | 四百一十二 – Ispeech: 正八百八十六至八百至一百二十三至四百一十二 – Neospeech: 正八百八十六至八百至一百二十三至四百一十二 – ReadSpeaker: 八百八十六 | 八百 | 一百二十三 | 四百一十二

Taiwan Chinese and Google Translate

Cyberon Google ITRS NeoSpeech Nuance ReadSpeaker Prototype digits 3 16 11 11 15 12 12 12:30 D D no output D D D D 2017/09/19 二零一七 | 零九 | 十九 D D D D D D $20 D D D D D D D 3+5=8 三五 | 八 D 三五等於八 D D D D phone no.  D D  D D D 1992 年 D D 一千九百九十二年 D D D D IBM D D D D D D - IKEA [E"ki.a] spelled-out D [ki] [aI"ki] [aI.ki] - 10:30 AM D D no output D D D - Sat [sat] 薩特 [sæt] [sæt] [sæt] [sæt] - Apple D 蘋果 D D D D - iPhone D fēng D D D D - iphone D fēng D D D D - 曾 céng céng D D D céng - 喝 hē hè, hè, hē hē D hè, hè, hē hē - 倒 dǎo dào dào D dǎo dào - 3rd tone 322,323 D 223, DD,323 D,233 D D,233? 一 sandhi D D D D D D D 不 sandhi D D D D D D D 一不 D,22 D D,22 D,22 12,DD,22 D,22 不一 D D D D D D D Rhotic     D D -

• Except for Cyberon, all other voices converted AM to Chinese. • Google read 一 as yāo despite Nuance changed back to yī as it is less com- mon in Taiwan Chinese. Google also seems to have an interesting mech-

58 anism that maps English words into Chinese syllables, as demonstrated with the Sat example. The characters read sàtè or [sa.tə], both with the 4th tone. • Cyberon and NeoSpeech made the same mistake in phone number as in their Mandarin voices. The others read the phone number correctly without misinterpreting the hyphens. • We failed to get any output from ITRS with the time format we used for testing.

* Pipeline (|) represents a pause in the output. The check mark (D) means the output is as we expected, or the phenomenon is present as in the rhotic coda test. The - mark in the Prototype column means that the test sentences are not applicable to our prototype. We do not include the result for preprocessing here as all voices gave satisfactory output. In disambiguation, the voice has to get every heteronym correctly to get a check mark. The transcription listed in the table means that the voice used that pronunciation for all cases. For the tone sandhi test, every rule has two samples. If the voice only managed to read one of them correctly, it gets a check mark plus the tone pattern that deviates from ours. Stress in the phonetic transcriptions is not mark unless it is really distinctive. 10:30 and 2017/09/19 preceding AM and Sat in the English test are to provide more information for the system as the words in isolation may be interpreted differently.

59