A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction Chu-Cheng Lin and Richard Tzong-Han Tsai

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012 1109 A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction Chu-Cheng Lin and Richard Tzong-Han Tsai Abstract—Most spoken Chinese dialects lack comprehensive Chinese population can converse in a non-Mandarin dialect, digital pronunciation databases, which are crucial for speech while only 53% can converse in Mandarin. [3] However, there processing tasks. Given complete pronunciation databases for is a serious lack of such databases for non-Mandarin dialects. related dialects, one can use supervised learning techniques to predict a Chinese character’s pronunciation in a target dialect This situation impedes the development of speech processing based on the character’s features and its pronunciation in other technologies and applications for resource-poor dialects. Since related dialects. Unfortunately, Chinese dialect pronunciation compiling such resources is labor-intensive, our goal is to de- databases are far from complete. We propose a novel generative velop a tool to help automate the prediction of character pro- model that makes use of both existing dialect pronunciation nunciations for different Chinese dialects. data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing Currently, most dialect pronunciation databases/dictionaries dialectal pronunciations based on existing dialect pronunciation have been constructed by individual researchers and vary tables (even if incomplete) and the pronunciation data in rime greatly in terms of completeness. If we have complete pro- books. The augmented pronunciation database can then be used in nunciation databases for related dialects, we can use standard supervised learning settings. We evaluate the prediction accuracy supervised learning techniques to predict a character’s pro- in terms of phonological features, such as tone, initial phoneme, final phoneme, etc. For each character, features are evaluated on nunciation in a target dialect. As mentioned above, however, the whole, overall pronunciation feature accuracy (OPFA). Our pronunciations databases for most Chinese dialects are far from first experimental results show that adding features from dialectal complete. Therefore, we propose a novel generative model pronunciation data to our baseline rime-book model dramatically that makes use of both existing dialect pronunciation data plus improves OPFA using the support vector machine (SVM) model. medieval rime books to discover patterns that exist in multiple In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects. Unlike previous work, this model does not assume that dialects with that of the model using phonological features from language evolves like a branching tree, but only that character non-closely related dialects. The experimental results show that pronunciations across related dialects do show patterns. The using features from closely related dialects results in higher accu- proposed model can augment character pronunciations for a racy. In the third experiment, we show that using our proposed dialect based on existing dialect pronunciation tables (even if data augmentation model to fill in missing data can increase the SVM model’s OPFA by up to 7.6%. incomplete) and the pronunciation data in medieval rime books. After augmentation, a standard classifier-based pronunciation Index Terms—Chinese dialects, data augmentation, generative prediction system can be constructed. model, pronunciation database. II. BACKGROUND OF CHINESE DIALECTS I. INTRODUCTION A. Mutual Intelligibility It is widely recognized that Chinese dialects are to a great ex- HARACTER pronunciation databases are key resources tent mutually unintelligible. All the southern Chinese dialects in speech processing tasks such as speech recognition C have mean sentence intelligibility lower than 30% for nonna- and synthesis. For official written languages, such databases are tive speakers [4]. In comparison, Portuguese and Spanish have rich. For example, English has the CMU pronouncing dictionary mutual intelligibility at roughly 60% [5]. [1], while Mandarin has the Unihan database [2]. For spoken Although the mutual intelligibility among Chinese dialects languages, digitized pronunciation resources are not so plen- is very low, the character pronunciations across dialects show tiful, however. In China, this is particularly relevant. A 2004 regular correspondence. For example, the pronunciations of “ survey of Chinese dialects revealed that more than 86% of the 肝” (gan/liver) and “寒” (han/frigid) sound utterly different in Southern Min and Mandarin; but within the dialects themselves, Manuscript received October 31, 2010; revised March 14, 2011; accepted the rhyming is consistent. July 11, 2011. Date of publication October 17, 2011; date of current version Feb- ruary 10, 2012. This work was supported in part by the National Science Council under Grants NSC 98-2221-E-155-060-MY3 and NSC99-2628-E-155-004. The B. Rime Books associate editor coordinating the review of this manuscript and approving it for Other than areal influence, the striking correspondence is publication was Dr. Gokhan Tur. C.-C. Lin is with the Department of Computer Science and Information Engi- largely attributed to historical reasons [6], which can be seen neering, National Taiwan University, Taipei 10617 , Taiwan (e-mail: chu.cheng. in medieval rime books. Earlier rime books, such as “切韻 [email protected]). (Qieyun)” (601AD), records contemporary character pronun- R. T.-H. Tsai is with the Department of Computer Science and Engineering, Yuan Ze University, Zhongli 320, Taiwan (e-mail: [email protected]). ciations with fanqie “ 反切” analyses. Fanqie represents a Digital Object Identifier 10.1109/TASL.2011.2172424 character’s pronunciation with other two characters, combining 1558-7916/$31.00 © 2011 IEEE 1110 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012 TABLE I SYMBOLS USED IN SECTION IV the former’s onset and the latter’s rhyme and tone. An English of Canton). In 1962 that the first comprehensive cross-dialectal equivalent would be to combine the onset of “peek” / i: k/ lexicon, 漢語方音字彙(Hanyu Fangyin Zihui, Zihui), was and the rhyme of “cat” /kæt/ to get “pat” / æt/. published. The original Zihui consists of approximately 2500 Obviously, there may be multiple combinations of char- character readings with IPA notation from 17 modern Chinese acters to represent a single pronunciation in the system of dialects. In addition, the categorical descriptive features from fanqie. In contrast, Later rime books such as “韻鏡(Yunjing)” the Middle Chinese rimebook 韻鏡(Yunjing) are also provided. (900–950AD), did finer phonological analysis, using fixed Soon after its publication, Zihui was digitized under Project sets of characters to represent phonological qualities of con- DOC (Dictionary on Computer) [7]. The Zihui lexicon is in- temporary analysis [6]. A character pronunciation under the valuable to the study of diachronic phonology. However, many new system has six features, each having value in fixed sets dialects are still unrecorded. Another problem is that Zihui only of Chinese characters. The six features are 聲母(initials), 韻 contains about 2500 characters; it is far from the total amount (rhymes/finals), 攝 (rhyme groups), 聲調(tones), 呼 (open- of Chinese characters (more than 50 000). The two flaws render ness), and 等 (grades). For example, the character 含 has 匣 the Zihui lexicon unsatisfactory when used as a dialect dictio- (xia) as 聲母, 咸 as 攝, etc. These features cannot be directly nary. Our work then proposes to augment the unseen characters employed to reconstruct Middle Chinese pronunciations, as and languages with dialects and character readings recorded in the meaning of some features are still disputed. Nevertheless, the Zihui lexicon. modern dialects still bear the correspondence., and thus rime To augment the missing data with known information is not a book features can be used to infer phonological correspondence new idea, as practiced by [8], and [9]. Data augmentation is gen- between characters of the same rime book feature in modern erally done by introducing latent variables to model the training dialects. For example, the two characters “含” (han) and “站” data [10]. In our problem, we need to model dialectal pronunci- (zhan) are described with the same rhyme group character “咸” ation data. A model of pronunciations has been proposed for the (xian), and they still rhyme in Mandarin, Cantonese, and Amoy, Romance languages by [11], which allows generation of word although the pronunciations do not rhyme across dialects. Thus, forms of both reconstructed languages and modern languages. the rime books are very valuable resources in determining a A phylogenic tree of Classical Latin, Vulgar Latin, Spanish, and character’s pronunciation. Italian was built to model the evolutionary relationship among these languages. In this tree, Classical Latin is the root, Vulgar III. RELATED WORK Latin is its child, and Spanish and Italian are Vulgar Latin’s de- There are many modern dictionaries using phonetic alphabets scendants. In their approach, the pronunciation of the root lan- to denote pronunciation for specific dialects, such as 粤音韻 guage must be given. 彙 (A Chinese Syllabary Pronounced According to the Dialect LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL

Load more