Homophones and Tonal Patterns in English-Chinese Transliteration

Homophones and Tonal Patterns in English-Chinese Transliteration Oi Yee Kwong Department of Chinese, Translation and Linguistics City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong [email protected] to overcome the problem and model the charac- Abstract ter choice directly. Meanwhile, Chinese is a typical tonal language and the tone information The abundance of homophones in Chinese can help distinguish certain homophones. Pho- significantly increases the number of similarly neme mapping studies seldom make use of tone acceptable candidates in English-to-Chinese information. Transliteration is also an open transliteration (E2C ). The dialectal factor also problem, as new names come up everyday and leads to different transliteration practice. We there is no absolute or one-to-one transliterated compare E2C between Mandarin Chinese and Cantonese, and report work in progress for version for any name. Although direct ortho- dealing with homophones and tonal patterns graphic mapping has implicitly or partially mod- despite potential skewed distributions of indi- elled the tone information via individual charac- vidual Chinese characters in the training data. ters, the model nevertheless heavily depends on the availability of training data and could be 1 Introduction skewed by the distribution of a certain homophone and thus precludes an acceptable translit- This paper addresses the problem of automatic eration alternative. We therefore propose to English-Chinese forward transliteration (referred model the sound and tone together in E2C . In to as E2C hereafter). this way we attempt to deal with homophones There are only a few hundred Chinese charac- more reasonably especially when the training ters commonly used in names, but their combina- data is limited. In this paper we report some tion is relatively free. Such flexibility, however, work in progress and compare E2C in Cantonese is not entirely ungoverned. For instance, while and Mandarin Chinese. the Brazilian striker Ronaldo is rendered as 朗拿 Related work will be briefly reviewed in Sec- 度 long5-naa4-dou6 in Cantonese, other pho- tion 2. Some characteristics of E2C will be dis- netically similar candidates like 朗娜度 long5- cussed in Section 3. Work in progress will be naa4-dou6 or 郎拿刀 long4-naa4-dou1 1 are least reported in Section 4, followed by a conclusion likely. Beyond linguistic and phonetic properties, with future work in Section 5. many other social and cognitive factors such as dialect, gender, domain, meaning, and perception, 2 Related Work are simultaneously influencing the naming proc- There are basically two categories of work on ess and superimposing on the surface graphemic machine transliteration. First, various alignment correspondence. models are used for acquiring transliteration The abundance of homophones in Chinese fur- lexicons from parallel corpora and other re- ther complicates the problem. Past studies on sources (e.g. Kuo and Li, 2008). Second, statis- phoneme-based E2C have reported their adverse tical models are built for transliteration. These effects (e.g. Virga and Khudanpur, 2003). Direct models could be phoneme-based (e.g. Knight and orthographic mapping (e.g. Li et al. , 2004), mak- Graehl, 1998), grapheme-based (e.g. Li et al. , ing use of individual Chinese graphemes, tends 2004), hybrid (Oh and Choi, 2005), or based on phonetic (e.g. Tao et al. , 2006) and semantic (e.g. 1 Mandarin names are transcribed in Hanyu Pinyin Li et al. , 2007) features. and Cantonese names are transcribed in Jyutping pub- Li et al. (2004) used a Joint Source-Channel lished by the Linguistic Society of Hong Kong. Model under the direct orthographic mapping 21 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 21–24, Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP (DOM) framework, skipping the middle phone- The homophonous third character gives rise to mic representation in conventional phoneme- multiple alternative transliterations in this exam- based methods, and modelling the segmentation ple, where orthographically 利 lei6 , 莉 lei6 and and alignment preferences by means of contex- 里 lei5 are observed for “ry” in transliteration tual n-grams of the transliteration units. Al- data. One cannot really say any of the combina- though DOM has implicitly modelled the tone tions is “right” or “wrong”, but perhaps only choice, since a specific character has a specific “better” or “worse”. Such judgement is more tone, it nevertheless heavily relies on the avail- cognitive than linguistic in nature, and appar- ability of training data. If there happens to be a ently the tonal patterns play an important role in skewed distribution of a certain Chinese charac- this regard. Hence naming is more of an art than ter, the model might preclude other acceptable a science, and automatic transliteration should transliteration alternatives. In view of the abun- avoid over-reliance on the training data and thus dance of homophones in Chinese, and that missing unlikely but good candidates. sound-tone combination is important in names (i.e., names which sound “nice” are preferred to 4 Work in Progress those which sound “monotonous”), we propose to model sound-tone combinations in translitera- 4.1 Datasets tion more explicitly, using pinyin transcriptions A common set of 1,423 source English names to bridge the graphemic representation between and their transliterations 2 in Mandarin Chinese English and Chinese. In addition, we also study (as used by media in Mainland China) and Can- the dialectal differences between transliteration tonese (as used by media in Hong Kong) were in Mandarin Chinese and Cantonese, which is collected over the Internet. The names are seldom addressed in past studies. mostly from soccer, entertainment, and politics. E2C The data size is admittedly small compared to 3 Some Properties other existing transliteration datasets, but as a 3.1 Dialectal Differences preliminary study, we aim at comparing the transliteration practice between Mandarin speak- English and Chinese have very different phono- ers and Cantonese speakers in a more objective logical properties. A well cited example is a syl- way based on a common set of English names. lable initial /d/ may surface as in Baghdad 巴格 The transliteration pairs were manually aligned, 達 ba1-ge2-da2, but the syllable final /d/ is not and the pronunciations for the Chinese characters represented. This is true for Mandarin Chinese, were automatically looked up. but since ending stops like –p, –t and –k are al- lowed in Cantonese syllables, the syllable final 4.2 Preliminary Quantitative Analysis /d/ in Baghdad is already captured in the last syl- Cantonese Mandarin lable of 巴格達 baa1-gaak3-daat6 in Cantonese. Unique name pairs 1,531 1,543 Such phonological difference between Manda- Total English segments 4,186 4,667 rin Chinese and Cantonese might also account Unique English segments 969 727 for the observation that Cantonese translitera- Unique grapheme pairs 1,618 1,193 tions often do not introduce extra syllables for Unique seg-sound pairs 1,574 1,141 certain consonant segments in the middle of an Table 1. Quantitative Aspects of the Data English name, as in Dickson, transliterated as 迪克遜 di2-ke4-xun4 in Mandarin Chinese and 迪 As shown in Table 1, the average segment-name 臣 dik6-san4 in Cantonese. ratios (2.73 for Cantonese and 3.02 for Mandarin) suggest that Mandarin transliterations often use 3.2 Ambiguities from Homophones more syllables for a name. The much smaller The homophone problem is notorious in Chinese. number of unique English segments for Manda- As far as personal names are concerned, the rin and the difference in token-type ratio of “correctness” of transliteration is not clear-cut at grapheme pairs (3.91 for Mandarin and 2.59 for all. For example, to transliterate the name Hilary Cantonese) further suggest that names are more into Chinese, based on Cantonese pronunciations, consistently segmented and transliterated in the following are possibilities amongst many Mandarin. others: (a) 希拉利 hei1-laai1-lei6 , (b) 希拉莉 hei1-laai1-lei6 , and (c) 希拉里 hei1-laai1-lei5 . 2 Some names have more than one transliteration. 22 4.2.1 Graphemic Correspondence sound mappings. Comparing with Table 2 above, the downward shift of the percentages suggests Assume grapheme pair mappings are in the form that much of the graphemic ambiguity is a result <e , { c ,c ,…, c }>, where e stands for the kth k k1 k2 kn k of the use of homophones, instead of a set of unique English segment from the data, and characters with very different pronunciations. {ck1 ,ck2,…, ckn } for the set of n unique Chinese segments observed for it. It was found that n 4.2.3 Homophone Ambiguity (Sound-Tone) varies from 1 to 10 for Mandarin, with 34.9% of the distinct English segments having multiple Table 4 shows the situation of homophones with grapheme mappings, as shown in Table 2. For both sound and tone taken into account. For ex- Cantonese, n varies from 1 to 13, with 31.5% of ample, the characters 利莉 all correspond to lei6 the distinct English segments having multiple in Cantonese, while 李里理 all correspond to grapheme mappings. The proportion of multiple lei5, and they are thus treated as two groups. mappings is similar for Mandarin and Cantonese, Assume grapheme-sound/tone pair mappings but the latter has a higher percentage of English are in the form < ek, { st k1 ,st k2 ,…, st kn }>, where ek segments with 5 or more Chinese renditions. stands for the kth unique English segment, and Thus Mandarin transliterations are relatively {stk1 ,st k2 ,…, st kn } for the set of n unique pronun- more “standardised”, whereas Cantonese trans- ciations (sound-tone combination). For Manda- literations are graphemically more ambiguous. rin, n varies from 1 to 8, with 33.5% of the distinct English segments corresponding to multiple n Cantonese Mandarin Chinese homophones.

Homophones and Tonal Patterns in English-Chinese Transliteration

Example Sentences

A Note on Orthography and Transcription

Proquest Dissertations

Destination Competitiveness: a Comparative Study of Hong Kong, Macau and Singapore

LINGUISTIC DIVERSITY ALONG the CHINA-VIETNAM BORDER* David Holm Department of Ethnology, National Chengchi University William J

Cantonese As a World Language from Pearl River and Beyond

1 Introduction to Jyutping 粵拼(A Cantonese

Color Polysemy: Black and White in Taiwanese Languages

Developing Computational Tools for Cantonese Linguistics Jackson L

David Li-Wei Chen Handbook of Taiwanese Romanization

41912405 Masters Thesis CHEUNG Siu

Sinophone Southeast Asia