Homophones and Tonal Patterns in English-Chinese Transliteration

Oi Yee Kwong Department of Chinese, Translation and Linguistics City University of Tat Chee Avenue, , Hong Kong [email protected]

to overcome the problem and model the charac- Abstract ter choice directly. Meanwhile, Chinese is a typical tonal language and the information The abundance of homophones in Chinese can help distinguish certain homophones. Pho- significantly increases the number of similarly neme mapping studies seldom make use of tone acceptable candidates in English-to-Chinese information. Transliteration is also an open transliteration (E2C ). The dialectal factor also problem, as new names come up everyday and leads to different transliteration practice. We there is no absolute or one-to-one transliterated compare E2C between and , and report work in progress for version for any name. Although direct ortho- dealing with homophones and tonal patterns graphic mapping has implicitly or partially mod- despite potential skewed distributions of indi- elled the tone information via individual charac- vidual in the training data. ters, the model nevertheless heavily depends on the availability of training data and could be 1 Introduction skewed by the distribution of a certain homo- phone and thus precludes an acceptable translit- This addresses the problem of automatic eration alternative. We therefore propose to English-Chinese forward transliteration (referred model the sound and tone together in E2C . In to as E2C hereafter). this way we attempt to deal with homophones There are only a few hundred Chinese charac- more reasonably especially when the training ters commonly used in names, but their combina- data is limited. In this paper we report some tion is relatively free. Such flexibility, however, work in progress and compare E2C in Cantonese is not entirely ungoverned. For instance, while and Mandarin Chinese. the Brazilian striker Ronaldo is rendered as 朗拿 Related work will be briefly reviewed in Sec- 度 long5-naa4-dou6 in Cantonese, other pho- tion 2. Some characteristics of E2C will be dis- netically similar candidates like 朗娜度 long5- cussed in Section 3. Work in progress will be naa4-dou6 or 郎拿刀 long4-naa4-dou1 1 are least reported in Section 4, followed by a conclusion likely. Beyond linguistic and phonetic properties, with future work in Section 5. many other social and cognitive factors such as dialect, gender, domain, meaning, and perception, 2 Related Work are simultaneously influencing the naming proc- There are basically two categories of work on ess and superimposing on the surface graphemic machine transliteration. First, various alignment correspondence. models are used for acquiring transliteration The abundance of homophones in Chinese fur- lexicons from parallel corpora and other re- ther complicates the problem. Past studies on sources (.g. Kuo and , 2008). Second, statis- phoneme-based E2C have reported their adverse tical models are built for transliteration. These effects (e.g. Virga and Khudanpur, 2003). Direct models could be phoneme-based (e.g. Knight and orthographic mapping (e.g. Li et al. , 2004), mak- Graehl, 1998), grapheme-based (e.g. Li et al. , ing use of individual Chinese graphemes, tends 2004), hybrid (Oh and Choi, 2005), or based on phonetic (e.g. et al. , 2006) and semantic (e.g. 1 Mandarin names are transcribed in Hanyu Li et al. , 2007) features. and Cantonese names are transcribed in pub- Li et al. (2004) used a Joint Source-Channel lished by the Linguistic Society of Hong Kong. Model under the direct orthographic mapping

21 Proceedings of the ACL-IJCNLP 2009 Conference Short , pages 21–24, Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP (DOM) framework, skipping the middle phone- The homophonous third character gives rise to mic representation in conventional phoneme- multiple alternative transliterations in this exam- based methods, and modelling the segmentation ple, where orthographically 利 lei6 , 莉 lei6 and and alignment preferences by means of contex- 里 lei5 are observed for “ry” in transliteration tual n-grams of the transliteration units. Al- data. One cannot really say any of the combina- though DOM has implicitly modelled the tone tions is “right” or “wrong”, but perhaps only choice, since a specific character has a specific “better” or “worse”. Such judgement is more tone, it nevertheless heavily relies on the avail- cognitive than linguistic in nature, and appar- ability of training data. If there happens to be a ently the tonal patterns play an important role in skewed distribution of a certain Chinese charac- this regard. Hence naming is more of an art than ter, the model might preclude other acceptable a science, and automatic transliteration should transliteration alternatives. In view of the abun- avoid over-reliance on the training data and thus dance of homophones in Chinese, and that missing unlikely but good candidates. sound-tone combination is important in names (i.e., names which sound “nice” are preferred to 4 Work in Progress those which sound “monotonous”), we propose to model sound-tone combinations in translitera- 4.1 Datasets tion more explicitly, using pinyin transcriptions A common set of 1,423 source English names to bridge the graphemic representation between and their transliterations 2 in Mandarin Chinese English and Chinese. In addition, we also study (as used by media in Mainland ) and Can- the dialectal differences between transliteration tonese (as used by media in Hong Kong) were in Mandarin Chinese and Cantonese, which is collected over the Internet. The names are seldom addressed in past studies. mostly from soccer, entertainment, and politics. E2C The data size is admittedly small compared to 3 Some Properties other existing transliteration datasets, but as a 3.1 Dialectal Differences preliminary study, we aim at comparing the transliteration practice between Mandarin speak- English and Chinese have very different phono- ers and Cantonese speakers in a more objective logical properties. A well cited example is a syl- way based on a common set of English names. lable initial /d/ may surface as in Baghdad 巴格 The transliteration pairs were manually aligned, 達 ba1-ge2-da2, but the final /d/ is not and the pronunciations for the Chinese characters represented. This is true for Mandarin Chinese, were automatically looked up. but since ending stops like –p, –t and –k are al- lowed in Cantonese , the syllable final 4.2 Preliminary Quantitative Analysis /d/ in Baghdad is already captured in the last syl- Cantonese Mandarin lable of 巴格達 baa1-gaak3-daat6 in Cantonese. Unique name pairs 1,531 1,543 Such phonological difference between Manda- Total English segments 4,186 4,667 rin Chinese and Cantonese might also account Unique English segments 969 727 for the observation that Cantonese translitera- Unique grapheme pairs 1,618 1,193 tions often do not introduce extra syllables for Unique seg-sound pairs 1,574 1,141 certain segments in the middle of an Table 1. Quantitative Aspects of the Data English name, as in Dickson, transliterated as 迪 克遜 di2-ke4-xun4 in Mandarin Chinese and 迪 As shown in Table 1, the average segment-name 臣 dik6-san4 in Cantonese. ratios (2.73 for Cantonese and 3.02 for Mandarin) suggest that Mandarin transliterations often use 3.2 Ambiguities from Homophones more syllables for a name. The much smaller The homophone problem is notorious in Chinese. number of unique English segments for Manda- As far as personal names are concerned, the rin and the difference in token-type ratio of “correctness” of transliteration is not clear-cut at grapheme pairs (3.91 for Mandarin and 2.59 for all. For example, to transliterate the name Hilary Cantonese) further suggest that names are more into Chinese, based on Cantonese pronunciations, consistently segmented and transliterated in the following are possibilities amongst many Mandarin. others: (a) 希拉利 hei1-laai1-lei6 , (b) 希拉莉 hei1-laai1-lei6 , and (c) 希拉里 hei1-laai1-lei5 . 2 Some names have more than one transliteration.

22 4.2.1 Graphemic Correspondence sound mappings. Comparing with Table 2 above, the downward shift of the percentages suggests Assume grapheme pair mappings are in the form that much of the graphemic ambiguity is a result , where e stands for the kth k k1 k2 kn k of the use of homophones, instead of a set of unique English segment from the data, and characters with very different pronunciations. {ck1 ,ck2,…, ckn } for the set of n unique Chinese segments observed for it. It was found that n 4.2.3 Homophone Ambiguity (Sound-Tone) varies from 1 to 10 for Mandarin, with 34.9% of the distinct English segments having multiple Table 4 shows the situation of homophones with grapheme mappings, as shown in Table 2. For both sound and tone taken into account. For ex- Cantonese, n varies from 1 to 13, with 31.5% of ample, the characters 利莉 all correspond to lei6 the distinct English segments having multiple in Cantonese, while 李里理 all correspond to grapheme mappings. The proportion of multiple lei5, and they are thus treated as two groups. mappings is similar for Mandarin and Cantonese, Assume grapheme-sound/tone pair mappings but the latter has a higher percentage of English are in the form < ek, { st k1 ,st k2 ,…, st kn }>, where ek segments with 5 or more Chinese renditions. stands for the kth unique English segment, and Thus Mandarin transliterations are relatively {stk1 ,st k2 ,…, st kn } for the set of n unique pronun- more “standardised”, whereas Cantonese trans- ciations (sound-tone combination). For Manda- literations are graphemically more ambiguous. rin, n varies from 1 to 8, with 33.5% of the dis- tinct English segments corresponding to multiple n Cantonese Mandarin Chinese homophones. For Cantonese, n varies >=5 5.3% 3.3% from 1 to 10, with 30.8% of the distinct English 4 4.0% 4.4% segments having multiple Chinese homophones. 3 6.2% 7.2% 2 16.0% 20.0% n Cantonese Mandarin 1 68.5% 65.1% >=5 4.1% 2.8% Example 雷}> 2 15.8% 20.7% Table 2. Graphemic Ambiguity of the Data 1 69.2% 66.5% Example lu4} Table 3 shows the situation with homophones Table 4. Homophone Ambiguity (Sound-Tone) (ignoring tones). For example, all five characters

利莉李里理 correspond to the Jyutping . - The figures in Table 4 are somewhere between spite the tone difference, they are considered those in Table 2 and Table 3, suggesting that a homophones in this section. considerable part of homophones used in the transliterations could be distinguished by tones. n Cantonese Mandarin This supports our proposal of modelling tonal >=5 3.3% 1.9% 4 4.0% 2.5% combination explicitly in E2C. 3 5.8% 5.7% 2 16.3% 20.7% 4.3 Method and Experiment 1 70.5% 69.2% The Joint Source-Channel Model in Li et al. Example

23 P(E,ST) = P(e1,e2,...,ek,st1,st 2,...,stk) point of view, it suggests more should be consid-

= P(< e1,st1 >,< e2,st 2 >,...,< ek,stk >) ered in addition to grapheme mapping for - K dling Cantonese data. = ∏ P(< ek,stk >|< ek − 1,stk − 1 >) k=1 5 Future Work and Conclusion where E refers to the English source name and Thus we have compared E2C between Mandarin ST refers to the sound/tone sequence of the trans- Chinese and Cantonese, and discussed work in literation, while e and st refer to kth segment k k progress for our proposed SoTo method which and its Chinese sound respectively. Homo- more reasonably treats homophones and better phones in Chinese are thus captured as a class in models tonal patterns in transliteration. Future the phonetic transcription. For example, the ex- work includes testing on larger datasets, more in- pected Cantonese transliteration for Osborne is depth error analysis, and developing better meth- 奧斯邦尼 ou3-si1-bong1-nei4 . Not only is it ods to deal with Cantonese transliterations. ranked first using this method, its homophonous variant 奧施邦尼 is within the top 5, thus bene- Acknowledgements fitting from the grouping of the homophones, despite the relatively low frequency of . The work described in this paper was substan- This would be particularly useful for translitera- tially supported by a grant from City University tion extraction and information retrieval. of Hong Kong (Project No. 7002203). Unlike pure phonemic modelling, the tonal factor is modelled in the pronunciation transcrip- References tion. We do not go for phonemic representation Kantor, P.B. and Voorhees, E.M. (2000) The TREC- from the source name as the transliteration of 5 Confusion Track: Comparing Retrieval Methods foreign names into Chinese is often based on the for Scanned Text. Information Retrieval, 2(2-3) : surface orthographic forms, e.g. the silent h in 165-176. Beckham is pronounced to give 漢姆 han4-mu3 Knight, K. and Graehl, . (1998) Machine Translit- in Mandarin and 咸 haam4 in Cantonese. eration. Computational Linguistics, 24(4) :599-612. Five sets of 50 test names were randomly ex- Kuo, J-S. and Li, H. (2008) Mining Transliterations tracted from the 1.4K names mentioned above from Web Query Results: An Incremental Ap- for 5-fold cross validation. Training was done proach. In Proceedings of SIGHAN-6, Hyderabad, on the remaining data. Results were also com- , pp.16-23. pared with DOM. The Mean Reciprocal Rank Li, H., Zhang, M. and Su, J. (2004) A Joint Source- (MRR) was used for evaluation (Kantor and Channel Model for Machine Transliteration. In Voorhees, 2000). Proceedings of the 42nd Annual Meeting of ACL , Barcelona, Spain, pp.159-166. 4.4 Preliminary Results Li, H., Sim, K.C., Kuo, J-S. and Dong, M. (2007) Method Cantonese Mandarin Semantic Transliteration of Personal Names. In DOM 0.2292 0.3518 Proceedings of the 45th Annual Meeting of ACL , SoTo 0.2442 0.3557 Prague, Czech Republic, pp.120-127. Table 5. Average System Performance Oh, J-H. and Choi, K-S. (2005) An Ensemble of Table 5 shows the average results of the two Grapheme and Phoneme for Machine Translitera- methods. The figures are relatively low com- tion. In R. Dale et al. (Eds.), Natural Language pared to state-of-the-art performance, largely due Processing – IJCNLP 2005 . Springer, LNAI Vol. to the small datasets. Errors might have started 3651, pp.451-461. to propagate as early as the name segmentation Tao, T., Yoon, S-Y., Fister, A., Sproat, R. and Zhai, C. step. As a preliminary study, however, the po- (2006) Unsupervised Named Entity Transliteration tential of the SoTo method is apparent, particu- Using Temporal and Phonetic Correlation. In Pro- larly for Cantonese. A smaller model thus per- ceedings of EMNLP 2006 , Sydney, Australia, forms better, and treating homophones as a class pp.250-257. could avoid over-reliance on the prior distribu- Virga, P. and Khudanpur, S. (2003) Transliteration of tion of individual characters. The better per- Proper Names in Cross-lingual Information Re- formance for Mandarin data is not surprising trieval. In Proceedings of the ACL2003 Workshop given the less “standardised” Cantonese translit- on Multilingual and Mixed-language Named Entity erations as discussed above. From the research Recognition .

24