Model in Word
Total Page:16
File Type:pdf, Size:1020Kb
Using the Segmentation Corpus to define an inventory of concatenative units for Cantonese speech synthesis Wai Yi Peggy WONG Shui-duen CHAN Chris BREW Chinese Language Centre, Mary E. BECKMAN Hong Kong Polytechnic University Linguistics Dept., Ohio State University Yuk Choi Road, Hung Hom, Kowloon, 222 Oxley Hall, 1712 Neil Ave. HONG KONG Columbus, OH, 43210-1298 USA [email protected] {pwong, cbrew, mbeckman}@ling.osu.edu One major challenge involves the definition Abstract of the “word” in Cantonese. As in other varieties of Chinese, morphemes in Cantonese are The problem of word segmentation affects typically monosyllabic and syllable structure is all aspects of Chinese language processing, extremely simple, which might suggest the including the development of text-to-speech demi-syllable or even the syllable (Chu & synthesis systems. In synthesizing a Hong Ching, 1997) as an obvious basic unit. At the Kong Cantonese text, for example, words same time, however, there are segmental must be identified in order to model fusion “sandhi” effects that conjoin syllables within a of coda [p] with initial [h], and other similar word. For example, when the morpheme 集 effects that differentiate word-internal zaap6 2 stands as a word alone (meaning ‘to syllable boundaries from syllable edges that collect’), the [p] is a glottalized and unreleased begin or end words. Accurate segmentation coda stop, but when the morpheme occurs in the is necessary also for developing any list of 集合 zaap6hap6 words large enough to identify the word- longer word (‘to assemble’), internal cross-syllable sequences that must the coda [p] often resyllabifies and fuses with be recorded to model such effects using the following [h] to make an initial aspirated concatenated synthesis units. This paper stop. Accurate word segmentation at the text describes our use of the Segmentation analysis level is essential for identifying the Corpus to constrain such units. domain of such sandhi effects in any full-fledged TTS system, whatever method is used for Introduction generating the waveform from the specified pronunciation of the word. A further challenge is What are the best units to use in building a fixed to find a way to capture such sandhi effects in inventory of concatenative units for an unlimited systems that use concatenative methods for vocabulary text-to-speech (TTS) synthesis waveform generation. system for a language? Given a particular choice This paper reports on research aimed at of unit type, how large is the inventory of such defining an inventory of concatenative units for units for the language, and what is the best way Cantonese using the Segmentation Corpus, a to design materials to cover all or most of these lexicon of 33k words extracted from a large units in one recording session? Are there effects corpus of Cantonese newspaper texts. The such as prosodically conditioned allophony that corpus is described further in Section 2 after an cannot be modeled well by the basic unit type? excursus (in Section 1) on the problems posed These are questions that can only be answered language by language, and answering them for standard, and not the older Canton City one. 1 Cantonese poses several interesting challenges. 2 We use the Jyutping romanization developed by the Linguistics Society of Hong Kong in 1993. See 1We use “Cantonese” to mean the newer Hong Kong http://www.cpct92.cityu.edu.hk/lshk. by the Cantonese writing system. Section 3 morpheme. For example, ge3 (genitive particle) outlines facts about Cantonese phonology might be written 的, a character which more relevant to choosing the concatenative unit, and typically writes the morpheme dik1 in Section 4 calculates the number of units that muk6dik1 目的 ‘aim’, but which suggests ge3 would be necessary to cover all theoretically because it also writes a genitive particle in possible syllables and sequences of syllables. Mandarin (de in the Pinyin romanization). Thus, The calculation is done for three models: (1) 的 has a reading dik1 that is etymologically syllables, as in Chu & Ching (1997), (2) Law & related to the Mandarin morpheme, but it also Lee’s (2000) mixed model of onsets, rhymes, has the etymologically independent reading ge3 and cross-syllabic rhyme-onset units, and (3) a because Cantonese readers can read texts written positionally sensitive diphone model. This in Mandarin as if they were Cantonese. Such section closes by reporting how the number of ambiguities of reading make the task of units in the last model is reduced by exploiting developing online wordlists from text corpora the sporadic and systematic phonotactic gaps doubly difficult, since word segmentation is discovered by looking for words exemplifying only half the task. each possible unit in the Segmentation Corpus. 2 The Segmentation Corpus 1 The Cantonese writing system The Segmentation Corpus is an electronic The Cantonese writing system poses unique database of around 33,000 Cantonese word types challenges for developing online lexicons, not extracted from a 1.7 million character corpus of all of which are related to the “foremost Hong Kong newspapers, along with a tokenized problem” of word segmentation. These problems record of the text. It is described in more detail stem from the long and rich socio-political in Chan & Tang (1999). The Cantonese corpus history of the dialect, which makes the writing is part of a larger database of segmented Chinese system even less regular than the Mandarin one, texts, including Mandarin newspapers from both even though Cantonese is written primarily with the PRC and Taiwan. The three databases were the same logographs (“Chinese characters”). created using word-segmentation criteria The main problem is that each character has developed by researchers at the Chinese several readings, and the reading cannot always Language Centre and Department of Chinese be determined based on the immediate context and Bilingual Studies, Hong Kong Polytechnic of the other characters in a multisyllabic word. University. Since these criteria were intended to For some orthographic forms, the variation is be applicable to texts in all three varieties, they 支援 stylistic. For example, the word ‘support’ do not refer to the phonology. can be pronounced zi1jyun4 or zi1wun4. But For our current purpose, the most useful part for other orthographic forms, the variation in of the Segmentation Corpus is the wordlist pronunciation corresponds to different words, proper, a file containing a separate entry for each with different meanings. For example, the word type identified by the segmentation criteria. character sequence 正當 writes both the function Each entry has three fields: the orthographic word zing3dong1 ‘while’ and the content word form(s), the pronunciation(s) in Jyutping, and zing3dong3 ‘proper’. Moreover, some words, the token frequency in the segmented newspaper such as ko1 ‘to page, telephone’, ge3 (genitive corpus. In the original corpus, the first field particle), and laa3 (aspect marker), have no could have multiple entries. For example, there standard written form. In colloquial styles of are two character strings, 泛濫 and 氾濫, in the writing, these forms may be rendered in non- entry for the word faan6laam6 ‘to flood’. standard ways, such as using the English source However, the two readings of 支援 were not call ko1 word to write , or writing the particles listed separately in the pronunciation field for with special characters unique to Cantonese. In that orthographic form (and there was only one more formal writing, however, such forms must entry for the two words written with 正當). be left to the reader to interpolate from a Before we could use the wordlist, therefore, character “borrowed” from some other we had to check the pronunciation field for each entry. The first author, a native speaker of Hong open syllables, and in closed syllables it occurs Kong Cantonese, examined each entry in order only before [k], [], and [j], whereas [i:] occurs to add variant readings not originally listed (as in closed syllables only before [p], [t], [m], [n], in the entry for 支援 ‘support’) and to correct and [w]. The Jyutping transliteration system readings that did not correspond to the modern takes advantage of this kind of complementary Hong Kong pronunciation (as in the entry for 盒 distribution to limit the number of vowel ‘box’). In addition, when the added variant symbols. Thus “i” is used to transcribe both the pronunciation corresponded to an identifiably short mid vowel [e] in the rhymes [ek] and [e], different word (as in zing3dong3 ‘proper’ and the high vowel [i:] in the rhymes [i:], [i:t], versus zing3dong1 ‘while’ for 正當), the [i:p], [i:m], [i:n], and [i:w]. Ignoring tone, there entry was split in two, and all tokens of that are 54 rhyme types in standard usage of character string in the larger corpus were Jyutping. Canonical forms of longer words can examined, in order to allocate the total token be described to a very rough first approximation count for the orthographic form to the two as strings of syllables. However, fluent synthesis separate frequencies for the two different words. cannot be achieved simply by stringing syllables Approximately 90 original entries were split into together without taking into account the effects separate entries by this processing. In this way, of position in the word or phrase. the 32,840 entries in the original word list became 33,037 entries. Once this task was front central back completed, we could use the wordlist to count round round all of the distinct units that would be needed to i: y: u: high synthesize all of the words in the Segmentation e o mid (short) Corpus. To explain what these units are, we : [: ): mid (long) need to describe the phonology of Cantonese ( a: low vowels words and longer utterances. Table 1.