IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012 1109 A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction Chu-Cheng Lin and Richard Tzong-Han Tsai

Abstract—Most spoken Chinese dialects lack comprehensive Chinese population can converse in a non-Mandarin dialect, digital pronunciation databases, which are crucial for speech while only 53% can converse in Mandarin. [3] However, there processing tasks. Given complete pronunciation databases for is a serious lack of such databases for non-Mandarin dialects. related dialects, one can use supervised learning techniques to predict a Chinese character’s pronunciation in a target dialect This situation impedes the development of speech processing based on the character’s features and its pronunciation in other technologies and applications for resource-poor dialects. Since related dialects. Unfortunately, Chinese dialect pronunciation compiling such resources is labor-intensive, our goal is to de- databases are far from complete. We propose a novel generative velop a tool to help automate the prediction of character pro- model that makes use of both existing dialect pronunciation nunciations for different Chinese dialects. data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing Currently, most dialect pronunciation databases/dictionaries dialectal pronunciations based on existing dialect pronunciation have been constructed by individual researchers and vary tables (even if incomplete) and the pronunciation data in rime greatly in terms of completeness. If we have complete pro- books. The augmented pronunciation database can then be used in nunciation databases for related dialects, we can use standard supervised learning settings. We evaluate the prediction accuracy supervised learning techniques to predict a character’s pro- in terms of phonological features, such as tone, initial phoneme, final phoneme, etc. For each character, features are evaluated on nunciation in a target dialect. As mentioned above, however, the whole, overall pronunciation feature accuracy (OPFA). Our pronunciations databases for most Chinese dialects are far from first experimental results show that adding features from dialectal complete. Therefore, we propose a novel generative model pronunciation data to our baseline rime-book model dramatically that makes use of both existing dialect pronunciation data plus improves OPFA using the support vector machine (SVM) model. medieval rime books to discover patterns that exist in multiple In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects. Unlike previous work, this model does not assume that dialects with that of the model using phonological features from language evolves like a branching tree, but only that character non-closely related dialects. The experimental results show that pronunciations across related dialects do show patterns. The using features from closely related dialects results in higher accu- proposed model can augment character pronunciations for a racy. In the third experiment, we show that using our proposed dialect based on existing dialect pronunciation tables (even if data augmentation model to fill in missing data can increase the SVM model’s OPFA by up to 7.6%. incomplete) and the pronunciation data in medieval rime books. After augmentation, a standard classifier-based pronunciation Index Terms—Chinese dialects, data augmentation, generative prediction system can be constructed. model, pronunciation database.

II. BACKGROUND OF CHINESE DIALECTS I. INTRODUCTION A. Mutual Intelligibility It is widely recognized that Chinese dialects are to a great ex- HARACTER pronunciation databases are key resources tent mutually unintelligible. All the southern Chinese dialects in speech processing tasks such as speech recognition C have mean sentence intelligibility lower than 30% for nonna- and synthesis. For official written languages, such databases are tive speakers [4]. In comparison, Portuguese and Spanish have rich. For example, English has the CMU pronouncing dictionary mutual intelligibility at roughly 60% [5]. [1], while Mandarin has the Unihan database [2]. For spoken Although the mutual intelligibility among Chinese dialects languages, digitized pronunciation resources are not so plen- is very low, the character pronunciations across dialects show tiful, however. In China, this is particularly relevant. A 2004 regular correspondence. For example, the pronunciations of “ survey of Chinese dialects revealed that more than 86% of the 肝” (gan/liver) and “寒” (han/frigid) sound utterly different in and Mandarin; but within the dialects themselves, Manuscript received October 31, 2010; revised March 14, 2011; accepted the rhyming is consistent. July 11, 2011. Date of publication October 17, 2011; date of current version Feb- ruary 10, 2012. This work was supported in part by the National Science Council under Grants NSC 98-2221-E-155-060-MY3 and NSC99-2628-E-155-004. The B. Rime Books associate editor coordinating the review of this manuscript and approving it for Other than areal influence, the striking correspondence is publication was Dr. Gokhan Tur. C.-C. Lin is with the Department of Computer Science and Information Engi- largely attributed to historical reasons [6], which can be seen neering, National Taiwan University, Taipei 10617 , Taiwan (e-mail: chu.cheng. in medieval rime books. Earlier rime books, such as “切韻 [email protected]). (Qieyun)” (601AD), records contemporary character pronun- R. T.-H. Tsai is with the Department of Computer Science and Engineering, Yuan Ze University, Zhongli 320, Taiwan (e-mail: [email protected]). ciations with “ 反切” analyses. Fanqie represents a Digital Object Identifier 10.1109/TASL.2011.2172424 character’s pronunciation with other two characters, combining

1558-7916/$31.00 © 2011 IEEE 1110 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

TABLE I SYMBOLS USED IN SECTION IV

the former’s onset and the latter’s rhyme and tone. An English of Canton). In 1962 that the first comprehensive cross-dialectal equivalent would be to combine the onset of “peek” / i: k/ lexicon, 漢語方音字彙(Hanyu Fangyin , Zihui), was and the rhyme of “cat” /kæt/ to get “pat” / æt/. published. The original Zihui consists of approximately 2500 Obviously, there may be multiple combinations of char- character readings with IPA notation from 17 modern Chinese acters to represent a single pronunciation in the system of dialects. In addition, the categorical descriptive features from fanqie. In contrast, Later rime books such as “韻鏡()” the rimebook 韻鏡(Yunjing) are also provided. (900–950AD), did finer phonological analysis, using fixed Soon after its publication, Zihui was digitized under Project sets of characters to represent phonological qualities of con- DOC (Dictionary on Computer) [7]. The Zihui lexicon is in- temporary analysis [6]. A character pronunciation under the valuable to the study of diachronic phonology. However, many new system has six features, each having value in fixed sets dialects are still unrecorded. Another problem is that Zihui only of . The six features are 聲母(initials), 韻 contains about 2500 characters; it is far from the total amount (rhymes/finals), 攝 (rhyme groups), 聲調(tones), 呼 (open- of Chinese characters (more than 50 000). The two flaws render ness), and 等 (grades). For example, the character 含 has 匣 the Zihui lexicon unsatisfactory when used as a dialect dictio- (xia) as 聲母, 咸 as 攝, etc. These features cannot be directly nary. Our work then proposes to augment the unseen characters employed to reconstruct Middle Chinese pronunciations, as and languages with dialects and character readings recorded in the meaning of some features are still disputed. Nevertheless, the Zihui lexicon. modern dialects still bear the correspondence., and thus rime To augment the missing data with known information is not a book features can be used to infer phonological correspondence new idea, as practiced by [8], and [9]. Data augmentation is gen- between characters of the same rime book feature in modern erally done by introducing latent variables to model the training dialects. For example, the two characters “含” (han) and “站” data [10]. In our problem, we need to model dialectal pronunci- (zhan) are described with the same rhyme group character “咸” ation data. A model of pronunciations has been proposed for the (xian), and they still rhyme in Mandarin, , and Amoy, Romance languages by [11], which allows generation of word although the pronunciations do not rhyme across dialects. Thus, forms of both reconstructed languages and modern languages. the rime books are very valuable resources in determining a A phylogenic tree of Classical Latin, Vulgar Latin, Spanish, and character’s pronunciation. Italian was built to model the evolutionary relationship among these languages. In this tree, Classical Latin is the root, Vulgar III. RELATED WORK Latin is its child, and Spanish and Italian are Vulgar Latin’s de- There are many modern dictionaries using phonetic alphabets scendants. In their approach, the pronunciation of the root lan- to denote pronunciation for specific dialects, such as 粤音韻 guage must be given. 彙 (A Chinese Syllabary Pronounced According to the Dialect LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1111

However, for Chinese dialects, the applicability of the tree TABLE II model is disputed. [12] suggested that it may be more appro- ENCODED PHONOLOGICAL FEATURES OF THE DOC DATASET priate to model the development of Chinese dialects with a net- work. Even if Chinese dialects are placed into a tree structure after Bouchard-Côté et al.’s model and set Middle Chinese, which influenced the largest number of Chinese dialects, as the root language, we still encounter the following problem. Clas- sical Latin’s phonology has been well established. [13] There- fore, the actual pronunciation can be easily deduced from the spelling. Unlike Classical Latin, the phonology and character pronunciations of Middle Chinese are still not wholly clear. For example, we know virtually nothing about the actual tone. Cur- rent reconstructions depend heavily upon medieval rime books, which are known to be a combination of at least two Middle Chi- nese dialects. [14] To derive a proper phylogenic tree, one must first distinguish between the Middle Chinese dialects (at least two according to Ting) and then correctly assign their respec- tive offspring languages. However current studies show that for certain Wu dialects there are at least two substrata, one from the northern Middle Chinese and the other from the southern one. [15] This directly violates the tree assumption. For a language , without given the actual pronunciation in ’s ancestral language, Fig. 1. Scheme of the input data. There are characters, every of which has its binary rime book feature vector known. Some of the phonological features we cannot use Bouchard-Côté et al.’s model to predict a char- maybemissing.Ourgoalistofill the missing values out, and the output is a acter’s pronunciation in . complete table. Some researches try to use the resources of other languages to deal with the languages with poor resources. [16] shows adding Take the character 含 as an example, its rime book feature unannotated text in more languages can improve unsupervised vector [匣 (xia), 覃 (tan), 咸 (xian), 平 (ping), 開 (kai), 一 (yi)]. POS tagging performance. [17] uses multilingual acoustic data Its phonological features (see Table II) would be[“12”, “43”, /h/, to improve a newly seen language’s recognition performance, ,/a/, , false, /m/] for the Xiamen dialect. sharing articulatory feature data among languages. These The problem can be stated as follows: suppose there are total researches assume that linguistic data used during training phonological features for all dialects, and given bi- has patterns which carry over to the newly seen language, nary rime book feature vectors , and a partially filled but our work only assumes Chinese dialects have consistent phonological feature table of dimension by for characters phonological correspondence with Middle Chinese and among , our goal is equivalent to filling that table out. Fig. 1 themselves. depicts the scheme of the input under the problem definition. Definitions of symbols introduced in this section can be found IV. METHODOLOGY in Table I. A. Problem Definition B. Model Considerations Our task is to augment the pronunciation database of Chinese AsdescribedinSectionII,nearly every Chinese dialect’s dialects. For each record, the given pronunciation database lists phonology is highly correlated to both the categorical features all existing pronunciations in the 21 dialects from all major di- described in 廣韻()and韻鏡(Yunjing), and to alect groups. That is, some records may be incomplete. Our aug- other Chinese dialects’ phonological features. For example, mentation model not only utilizes the existing pronunciations, there is a clear correspondence among the rime book feature represented by phonemes (which we will refer to as phonolog- 深攝(shen-she), the Cantonese rhyme /am/, and the Xiamen ical features,) but also rime book features. rhyme /im/. While the rime book alone offers much insight More formally, let be the character in a record. Let its into many dialects’ phonology, some characters listed under categorical rime book features be . For example, the rime different rime-book rhymes have clear correspondence among book features of character 含 (han) can be encoded as [ 匣 (xia), dialects. To augment missing phonological features, all the 覃 (tan), 咸 (xian), 平 (ping), 開 (kai), 一 (yi)]. The multi-class above phenomena should be taken into consideration. vector is then converted to a binary vector by con- We propose a model that simultaneously captures phonolog- catenating each “flattened” component of . For example, a ical similarities across dialects and rime book features, using component with three possible values is “flattened” to a binary latent variables which we call superlingual rhymes (SLRs). Our vector of dimension 3. Since for the rime book features there are model splits each character’s record into two parts. The first part six components in , would be a binary vector of length contains its rime book features while the second consists of is .1 Let there be modern dialects . its phonological features. Our task is to augment missing values Each dialect has fixed number of phonological features. in the second part. We know that rime book features are highly 12a correlated with phonological features in every Chinese dialect. 1112 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

Recall that the binary rime book feature of character is . Let there be rime book feature weight vec- tors of , and each has the same dimension as .Wethendefine the prior over all SLRs in character to be a multinomial distribution with prior Dirichlet Dirichlet .Notethat

Fig. 2. Plate diagram of our proposed generative model. Shaded nodes are ob- . In other words, the prior probability of SLR is served data. proportional to , a log-linear function. is treated as a given value in the generating part; indeed it is given a Therefore, we employ rime book features to estimate missing Normal prior, but we do not change its value through MCMC phonological features. In addition, our model also employs the steps—rather, its value is obtained by maximizing the like- other dialects’ phonological features. Our basic idea is to in- lihood of the generative model. We will go into details in troduce superlingual rhymes as an intermediate layer between Section IV-D. rime book features and all dialects’ phonological features. The We now describe the generating process. A plate diagram for pronunciation of each character can be represented as a mix- this model is depicted in Fig. 2. ture of all superlingual rhymes. That is, for each superlingual 1) For each SLR : rhyme, the character has a proportional value. Since the phono- a) ; logical features are all categorical data, they are naturally mod- b) ; eled with multinomial distribution. As in every Bayesian model, c) for each dialect , Dirichlet . we impose priors on these multinomials. Following many pre- 2) For each character and its binary rime book feature vector vious works such as [18] and [19], we chose Dirichlet distri- : bution, which allows analytic expression of posterior proba- a) for each SLR , ; bility. The proportional values of a character follow a Dirichlet b) Dirichlet ; distribution, whose parameters are decided by log-linear func- c) for each : tions of the character’s rime book features. This approach is also • Multinomial ; known as logistic regression. Because of the conjugacy between • Multinomial ; Dirichlet and multinomial distribution, we can obtain the pos- • Multinomial . terior distribution of a character over SLRs easily [20], [21]. Mixing a generative model with logistic regression is akin to the paradigm advocated by [22]. Similarly, using the multino- mial-Dirichlet conjugacy, we can estimate the distribution of a D. Inference superlingual rhyme over phonological features. Then, because each character’s proportion of each superlingual rhyme and each Without subscripts, the full joint distribution, expressed as superlingual rhyme’s proportion of each phonological feature product of distributions, is is are known, missing phonological features can be augmented.

(1) C. Model Description

A plate diagram for our proposed model is shown in Fig. 2. Let observation be a tuple of two components: . is an observed phonological feature and is the dialect of . For every observation of character ,thereisa latent SLR ; and the character is a mixture of SLRs. To simplify the explanation, we assume every dialect has only one phonological feature, namely . In the real model, each observa- tion has multiple phonological features for dialect ,but the model’s structure is roughly the same. We describe the model as follows. Let there be SLRs: . Each has multinomial distributions over phonological feature values; and a multinomial distribution over the dialects . ’s and ’s are given Dirichlet uniform priors Dirichlet and Dirichlet . In our experiments, each component of both and is set to 0.001, making the prior (2) rather sparse. LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1113

This equation can be rearranged and simplified by moving where . After reorganization (the and to the former term. Variables , ,and details are in the appendix), we have can be integrated out using the identity

(5)

where if the current assignment of ;and where , ,and . More details are otherwise. available in Appendix A. Now we describe the Gibbs sampler for : Andthenwehave 1) randomly assign values to ; 2) for to an arbitrarily assigned : •for to , a) re-sample new value of using (5). 2) Computing : Unlike , we do not use MCMC techniques to find because it is difficult to derive a Gibbs sampler for . On the other hand, for our purpose an MAP estimate of suffices. We use L-BFGS to solve this numeric optimization problem. L-BFGS requires the loss function and the gradient (3) for minimization [23]. First, from (3) we can derive the loss function, which is negative log-likelihood function of : where is the number of observations that have dialect with SLR , is the number of observations that have phonological feature value with SLR and dialect ,and is the number of observations with SLR in character . In (3) we have four variables, , , ,and , and cannot really sample from directly. However, it can be shown that there exists an efficient Gibbs sampler to infer ;and we subsequently use optimization methods to compute . 1) The Gibbs Sampler: Gibbs sampling is an MCMC tech- nique to sample from a complex and multivariate distribution. It can be applied if given variables , sampling from is impossible, but sampling from distributions is feasible. Below is the Gibbs sampler: 1) randomly assign values ; (6) 2) for to an arbitrarily assigned : •for to ; where is a constant; and recall that for character , a) re-sample new value of . . The variable denotes except .If is sufficiently Likewise, we derive the gradient : large, the resultant values can be regarded as a sample from . Since the training data already provides us with and ,we do not resample them. Neither do we resample , but instead use optimization methods to find the most probable .Nowwe only need to collect samples from . Since is a vector of variables consisting of all observed values’ (unobserved) SLRs, is actually multivariate; we use the Gibbs sampling technique here, and obtain samples of via alternately sampling from . can be expressed as

where is the digamma function. As previously stated, we can minimize if we can com- pute both and ; and minimizing in turn maximizes the (4) likelihood . 1114 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

E. Inference Procedure TABLE III PREDICTION ACCURACY WITH/WITHOUT DIALECTAL DATA In Section IV-D, we have described a Gibbs sampler that sam- ples from the posterior , and in Section IV-D2 we have derived and , which enable us to maximize the likelihood . We use an EM-like algorithm for infer- ence: [24] in alternating steps we sample and maximize , repetitively. The posterior feature-value distribution can to learn the labels independently. The features fed to SVM be sampled if the latent SLRs are fixed. To augment a missing classifiers are the binary rime book feature vectors ( s) and phonological feature, we output the mode of samples over sev- phonological features of all dialects except dialect .Thecor- eral iterations. responding labels are phonological features of dialect .And the output from these classifiers are predicted labels, which are V. D ATA AND EVALUATION METRICS phonological features of dialect . A. Data D. -Test The experiments are conducted on the DOC dataset described We apply two-sample tests to examine whether one config- in Section III. In this dataset, each record corresponds to one uration is significantly better than the other with statistical sig- pronunciation of a Chinese character. For example, the poly- nificance. phone “正”, with two Mandarin pronunciations (zheng1 and Two-sample -testsareappliedsinceweassumethesamples zheng4), has two corresponding records. The number of pro- are independent. As the number of samples is large and the sam- nunciations for a character is determined by Guangyun.For ples’ standard deviations are known, the following two-sample each record, the DOC dataset lists all existing pronunciations in -statistic is appropriate in this case: 21 dialects from all major dialect groups. In the original DOC, pronunciations are transcribed in IPA notation. [25] represented these IPA transcriptions with eight phonological features, listed in Table II. Given that there are 21 dialects and eight features, each record contains a total of 168 phonological features. Some where is mean accuracy, is variance of accuracy, and is records are incomplete because certain phonological features do sample number (in our experiments, ). If the resulting not exist in some dialects. After disambiguation of polyphone score is equal or less than 1.67 with a degree of freedom of 29 characters, we have 5403 records. and a statistical significance level of 95%, the null hypothesis is accepted; otherwise it is rejected. B. Evaluation Metrics Individual pronunciation feature accuracy (IPFA) is mea- VI. EXPERIMENTS sured as the number of correctly predicted phonological We designed three experiments on character pronounciations features over the number of phonological features in the test of the Chaozhou dialect, which is a Min dialect spoken in eastern set. Overall pronunciation feature accuracy (OPFA) is mea- Guangdong, to evaluate the effect of the following factors: sured as the number of correctly predicted records over the number of records in the test set. A. Effect of Dialectal Data on Standard Classifiers The conventional approach employed by philologists to Chi- C. Evaluation Scheme nese dialect pronunciation prediction is to find correspondence To evaluate prediction accuracy in a given dialect ,all between rime book categories and modern pronunciation, often phonological features of the dialect are regarded as ground through laborious human inspection. However, a clear corre- truth labels. Some phonological features of dialects other than spondence between the two does not always exist. In the Wu may be missing, and they are all filled in using either our dialect for example, the rime book categories 夬 (guai) and 佳 proposed model or a baseline classifier, depending on which (jia) are not clearly distinguished, sometimes being referred to augmentation method is used in that configuration. as -ua and sometimes as -uo. Introducing dialectal data (other Since one of our focus is augmentation (see Section VI-C), in dialects’ phonological features) may help distinguish pronunci- the augmentation experiments we randomly remove phonolog- ation in some dialects. ical features from all dialects except . The detailed procedure We train the SVM classifier to predict character pronuncia- is as follows: first we create two subsets of the main dataset with tions in Chaozhou. As previously described, we conducted two 10% or 20% of fields (phonological features) missing, respec- runs: tively. The missing fields are then augmented as previously de- 1) Rime Book Only (R): In this run only the rime book scribed. Note that phonological features of dialect are not used features, namely 聲母(initials), 韻 (rhymes/finals), 攝 (rhyme for prediction of other phonological features. After the missing groups), 聲調(tones), 呼 (openness), and 等 (grades), are pronunciations are augmented, no records have empty fields. included. To conduct the statistical significance -test, we perform the 2) Rime Book + Full Dialectal Data (R+F): In addition to following procedure 30 times. We randomly split the records rime book features, all dialectal data are used. In cases where 2:1 into training (67%) and test (33%) data. Since each record there are missing pronunciations, a random guess is supplied is associated with multiple labels, we employ multiclass SVMs for each phonological feature for the SVM classifier. LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1115

TABLE IV PREDICTION ACCURACY WITH DIFFERENT DIALECT GROUPS

The results are listed in Table III. It is obvious that by in- TABLE V cluding dialectal data, we make a significant performance gain. EFFECTS OF DATA AUGMENTATION WITH CLOSELY (R+C) AND DISTANTLY (R+D) RELATED DIALECT DATA B. Impacts of Proximate Dialects [26] reported that POS tagging performance can be improved by including more languages, especially closely related lan- guages. We carried out experiments to see whether using rime book features (R) with closely related dialects (+C) is more effective than with distantly related dialects (+D). We compared the OPFA of the Xi’an and Chaozhou dialects, which belong to the Mandarin and Min dialect groups, respec- tively. The Mandarin dialects we use in the experiments are Jinan, Taiyuan, and Beijing; and for the Min dialects we use Xiamen, Fuzhou, and Jian’ou. For each dialect we conduct two runs, the first using dialects from the same dialect group, and the second using dialects from the other dialect group. To make comparison meaningful, we make the ratio of missing entriessameineveryrunbyrandomlyremovingentries.And the missing entries are randomly augmented without sophis- ticated augmentation. Thus, each run has 10% pronunciations removed, and augmented with random guesses. Average OPFA over 30 times are listed in Table IV. The results show that R+C TABLE VI outperforms R+D for both Xi’an and Chaozhou dialects by a IPFA TABLE statistically significant margin.

C. Effect of Data Augmentation AsdescribedinSectionI,thedataformanyChinesedi- alects are scarce. Our data augmentation model is designed to fill in missing pronunciation information. If our augmentation model is effective, one application would be to use multiple resource-poor dialects to augment missing data in another di- alect’s pronunciation database. For data augmentation, we use The results and corresponding values are listed in Table V. the procedure described in Section V-C to fill in the missing pro- Using our data augmentation model consistently improves nunciations in the Chaozhou dialect. the OPFA accuracy. Interestingly, the margin of improvement For comparison, we employ three different methods to aug- seems to be greater when using closely related dialect data ment the missing data as baselines: than when using distantly related dialect data when using both 1) Logistic Regression (-L): Using the rime book features and datasets. , a discriminative model is trained to predict missing phono- logical values. VII. ANALYSIS AND DISCUSSION 2) Naive Bayes (-N): Similar to the logistic regression We are interested in how the choice of training dialects affect model, a generative model is trained using to predict individual feature predictions. Table VI shows percentile IPFA missing phonological values. improvement over baseline random augmentation. The R+C run 3) Random (-R): The missing phonological values are benefits from the augmentation in all features except nasaliza- guessed randomly. tion, the reason for which is unclear. In this experiment we test two different amounts of removal, In the R+D run, tone, initial, and final features show worse 10% and 20%. All the following SVM classifiers use the RBF IPFA after augmentation. This can be explained by considering kernel, with parameters and . The number of the assumptions of our proposed model. We assume that the SLRs in our augmentation model is set to 200. dialects exhibit correspondence among phonological features 1116 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012 across dialects. That is, corresponding phonological features clarify how we integrate out these variables with as example. across dialects should be put under the same SLR. Therefore, if For convenience, (1) is relisted here again: the dialects lack such correspondence, the augmented features may be inaccurate. It is evident that phonological features such as tones, initials, and finals do not have good correspondence across different dialect families [27]. Recent research suggests tones in Min dialects may be related to an innovation of the Wu-Min proto-dialect [27], which Mandarin did not share. As for initials, there is a striking difference between the “heavy” and “light” initial distinction in Mandarin and Min dialects [28]. Finals also lack good correspondence: the Min dialects have preserved most final stops from Middle Chinese, while Man- darin dialects have lost many. Thus, it is difficult to predict final consonants in Min dialects using Mandarin dialects and vice versa. The IPFA metric seems to reflect the level of correspondence between the target dialect and other dialects, both closely and distantly related. The possibility of determining dialectal rela- tionships between individual dialects by comparing respective IPFA improvement scores may lead to interesting discoveries. By fixing and , terms involving and in (1) are

VIII. CONCLUSION

We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books The latter terms can be refactored to ,where to discover phonological patterns that exist in multiple dialects, is the number of observations with phonological feature which are referred to as superlingual rhymes (SLRs) in our pro- value ,SLR and dialect . Thus, it can be rewritten as posed model. The proposed model can predict character pro- nunciations for a dialect based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in rime books. We evaluate the prediction accuracy in terms of phono- logical features, such as tone, initial phoneme, etc. For each Using the identity of (3), we have character, phonological features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our first exper- imental results show that adding features from dialectal pronun- ciation data to our baseline rime-book model dramatically im- proves OPFA using the support vector machine (SVM) model. In the second experiment, we compare the performance of the (7) SVM model using phonological features from closely related di- alects with that of the model using phonological features from non-closely related dialects. The experimental results show that Variables and can be integrated out in same fashion. using features from closely related dialects results in higher ac- curacy. In the third experiment, we show that using our pro- posed data augmentation model to fill in missing data can in- APPENDIX B crease the SVM model’s OPFA by up to 7.6%. We also note DERIVATION OF THE GIBBS SAMPLER that this improvement is greater when using closely related di- alect data.

APPENDIX A INTEGRATION OF , , AND

Since , ,and have Dirichlet priors, the posterior distri- bution of , ,and , which have the form are Dirichlet-multinomial distribution as introduced in [29]. We (8) LIN AND TSAI: GENERATIVE DATA AUGMENTATION MODEL FOR ENHANCING CHINESE DIALECT PRONUNCIATION PREDICTION 1117 and again using the identity ,wehave [14] P.-H. Ting, “Some thoughts on the reconstruction of Middle Chinese,” J. Chinese Linguist., vol. 249, no. 6, p. 414, 1995. [15] T.-L. Mei, “The survival of two pairs of distinctions in Southern Wu dialects,” J. Chinese Linguist., vol. 280, no. 1, pp. 1–15, 2001. [16] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay, “Adding more languages improves unsupervised multilingual part-of-speech tagging: A Bayesian non-parametric approach,” in Proc. NAACL ’09: Human Lang. Technol.: 2009 Annu. Conf. North Amer. Chapt. Assoc. Comput. Linguist., Morristown, NJ, 2009, pp. 83–91. (9) [17] S. Stüker, F. Metze, T. Schultz, and A. Waibel, “Integrating multilin- gual articulatory features into speech recognition,” in Proc. 8th Eur. Conf. Speech Commun. Technol., 2003, Citeseer. using the fact that ,where is the [18] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003. number of observations with SLR , and the fact that a character [19] S. Goldwater and T. Griffiths, “A fully Bayesian approach to unsu- has fixed number of observations, we can further simplify (9) pervised part-of-speech tagging,” in Proc. 45th Annu. Meeting Assoc. into Comput. Linguist., Prague, Czech Republic, Jun. 2007, pp. 744–751. [20] G. Heinrich, “Parameter Estimation for Text Analysis,” Tech. Rep. Univ. of Leipzig, Leipzig, Germany, 2008 [Online]. Available: http:// www.arbylon.net/publications/text-est.pdf [21] P. Resnik and E. Hardisty, “Gibbs sampling for the uninitiated,” Univ. of Maryland, 2010, Tech. Rep. CS-TR-4956, UMIACS-TR-2010-04, (10) LAMP-153. [22] T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein, “Painless unsupervised learning with features,” in Proc. Human where if the current assignment of ;and , Lang. Technol.: 2010 Annu. Conf. North Amer. Chap. Assoc. Comput. otherwise. Linguist., Los Angeles, CA, Jun. 2010, pp. 582–590. [23]D.C.LiuandJ.Nocedal,“OnthelimitedmemoryBFGSmethodfor large scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, ACKNOWLEDGMENT 1989. [24] A. Dempster et al., “Maximum likelihood from incomplete data via the The authors would like to thank Prof. C.-C. Cheng for pro- EM algorithm,” J. R. Statist. Soc.. Ser. B (Methodological), vol. 39, no. viding them the DOC dataset and the TASLP reviewers for their 1, pp. 1–38, 1977. valuable comments, which helped them improve the quality of [25] C.-C. Cheng, “Measuring relationship among dialects: DOC and re- lated resources,” Comput. Linguist., vol. 2, no. 1, pp. 41–72, 1997. the paper. [26] B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay, “Unsupervised multilingual learning for pos tagging,” in Proc. EMNLP ’08: Proc. REFERENCES Conf. Empirical Methods Natural Lang. Process., Morristown, NJ, 2008, pp. 1041–1050. [1] “CMUDICT, CMU Pronouncing Dictionary,” 1998 [Online]. Avail- [27] R.-W. Wu, “A Comparative study on the phonologies of Min and Wu able: http://www.speech.cs.cmu.edu/cgi-bin/cmudict dialects,” Ph.D. dissertation, Dept. of , National [2] J. H. Jenkins and R. Cook, “Unicode Han Database,” Tech. Rep. The Chengchi Univ., Taipei, Taiwan, 2005. Unicode Consortium, 2009. [28] U.-J. Ang, “On the motivation and typology of aspiration and nasal- [3] L.-Q. Tong, “Survey on the usage of Chinese languages and script,” ization in ,” in Proc. 6th Int. and 17th National Conf. (in Chinese) Language and Literature Press, Beijing, China, 2006 Chinese Phonol., Taipei, Taiwan, May 1999. [Online]. Available: http://www.china-language.gov.cn/LSF/LS- [29] T. Minka, “Estimating a Dirichlet distribution,” Mass. Inst. of Technol., Frame.aspx Cambridge, MA, Tech. Rep., 2000. [4] C. Tang and V. J. van Heuven, “Mutual intelligibility of Chinese di- alects experimentally tested,” Lingua, vol. 119, no. 5, pp. 709–732, Chu-Cheng Lin received the B.S. and M.S. degrees 2009. in computer science and information engineering [5] J. B. Jensen, “On the mutual intelligibility of Spanish and Portuguese,” from National Taiwan University, Taipei, in 2008 Hispania, vol. 72, no. 4, pp. 848–852, 1989. and 2010, respectively. [6] E. G. Pulleyblank, “Qieyun and Yunjing: The essential foundation for His current research interests are information chinese historical linguistics,” J. Amer. Oriental Soc., vol. 118, no. 2, retrieval, natural language processing, and computa- pp. 200–216, 1998. tional phonology. [7] M. Streeter, “DOC, 1971: A Chinese dialect dictionary on computer,” Comput. Humanities,vol.6,no.5,pp.259–270,1972. [8] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classifica- tion from labeled and unlabeled documents using em,” Mach. Learn., vol. 39, no. 2-3, pp. 103–134, 2000. [9] X. Lu, B. Zheng, A. Velivelli, and C. Zhai, “Enhancing text catego- rization with semantic-enriched representation and training data aug- mentation,” J. Amer. Med. Inform. Assoc., vol. 13, no. 5, pp. 526–535, Richard Tzong-Han Tsai received the B.S., M.S., 2006. and Ph.D. degrees in computer science and informa- [10] D. van Dyk and X. Meng, “The art of data augmentation,” J. Comput. tion engineering from National Taiwan University, Graph. Statist., vol. 10, no. 1, pp. 1–50, 2001. Taipei, Taiwan, in 1997, 1999, and 2006, respec- [11] A. Bouchard-Côté, P. Liang, T. Griffiths, and D. Klein, “A prob- tively. abilistic approach to diachronic phonology,” in Proc. Empirical He was a Postdoctoral Fellow at Academia Sinica Methods in Natural Lang. Process. Comput. Natural Lang. Learn. from 2006 to 2007. He is now an Assistant Professor (EMNLP/CoNLL), 2007. in the Department of Computer Science and Engi- [12] M. Ben Hamed and F. Wang, “Stuck in the forest : Trees, networks and neering, Yuan Ze University, Zhongli, Taiwan. His Chinese dialects,” Diachronica, vol. 23, no. 1, pp. 29–60, 2006. research areas are natural language processing, cross- [13] W. S. Allen, Vox Latina: A Guide to the Pronunciation of Classical language information retrieval, biomedical literature Latin (in Eng.). Cambridge, U.K.: Cambridge Univ. Press, 1978. mining, and information services on mobile devices.