<<

Automatic English-Chinese transliteration for develop- ment of multilingual resources

Stephen Wan and Cornelia Maria Verspoor Microsoft Research Institute Macquarie University Sydney NSW 2109, Australia {swan, kversp } @mri.mq.edu.au

manufacturer, with values of personal and place Abstract . Place names and personal names do not fall into a well-defined set, nor do they have se- In this paper, we describe issues in the translation mantic content which can be expressed in other of proper names from English to Chinese which languages through words equivalent in meaning. we have faced in constructing a system for multi- As more objects are added to our database (as lingual text generation supporting both languages. will happen as a museum acquires new objects), We introduce an algorithm for mapping from new names will be introduced, and these must English names to based on (1) also be added to the lexica for each language in heuristics about relationships between English the system. We require an automatic procedure spelling and pronunciation, and (2) consistent re- for achieving this, and concentrate here on tech- lationships between English phonemes and Chi- niques for the creation of a Chinese lexicon. nese characters. 2 English-Chinese Transliteration 1 Introduction We use the term transliteration to refer generally to the problem of the identification of a specific In the context of multilingual natural language textual form in an output language (in our case processing systems which aim for coverage of Chinese characters) which corresponds to a both languages using a roman alphabet and - specific textual form in an input language (an guages using other alphabets, the development of English word or phrase). For words with lexical resources must include mechanisms for semantic content, this process is essentially handling words which do not have standard equivalent to the translation of individual words. translations. Words falling into this category are So, the English word "black" is associated with a words which do not have any obvious semantic concept which is expressed as "~" ([h~i]) in content, .g. most indo-european personal and Chinese. In thiscase, a dictionary search place names, and which can therefore not simply establishes the input-output correspondence. be mapped to translation equivalents. For words with little or no semantic content, In this paper, we examine the problem of such as personal and place names, dictionary generating Chinese characters which correspond lookup may suffice where standard translations to English personal and place names. Section 2 exist, but in general it cannot be assumed that introduces the basic principles of English- names will be included in the bilingual Chinese transliteration, Section 3 identifies issues dictionary. In multilingual systems designed only specific to the domain of name transliteration, for languages sharing the roman alphabet, such and Section 4 introduces a rule-based algorithm names pose no problem as they can simply be for automatically performing the name translit- included unaltered in output texts in any of the eration. In Section 5 we present an example of languages. They cannot, however, be included in the application of the algorithm, and in Section 6 a Chinese text, as the roman characters cannot we discuss extensions to improve the robustness standardly be realized in the Han character set. of the algorithm. Our need for automatic transliteration 3 Name Transliteration mechanisms stems from a multilingual text gen- eration system which we are currently construct- English-Chinese name transliteration occurs on ing, on the basis of an English-language database the basis of pronunciation. That is, the written containing descriptive information about museum English word is mapped to the written Chinese objects (the POWER system; Verspoor et al character(s) via the spoken form associated with 1998). That database includes fields such as the word. The idealized process consists of:

1352 1. mapping an English word (grapheme) to a pho- The algorithm does not aim to specify general nemic representation grapheme-phoneme conversion for English, but 2. mapping each phoneme composing the word to a only for the subset of English words relevant to corresponding Chinese character place name transliteration. This limited domain In practice, this process is not entirely rarely exhibits complex morphology and thus a straightforward. We outline several issues com- robust morphological module is not included. In plicating the automation of this process below. addition, foreign language morphemes are treated The written form of English is less than superficially. Thus, the algorithm transliterates normalized. A particular English grapheme (letter the "-istan" (a morpheme having meaning in or letter group) does not always correspond to a Persian) of "Afghanistan" in spite of a standard single phoneme (e.g. ea is pronounced differently transliteration which omits this morpheme. in eat, threat, heart, etc.), and many English The transliteration process is intended to be multi-letter combinations are realised as a single based purely on phonetic equivalency. On phoneme in pronunciation (so f, if, ph, and gh occasion, country names will have some can all map to /f/) (van den Bosch 1997). An additional meaning in English apart from the important step in grapheme-phoneme conversion referential function, as in "The ". is the segmentation of words into syllables. Such names are often translated semantically However, this process is dependent on factors rather than phonetically in Chinese. However, such as morphology. The syllabification of this in not uniformly true, for example "'Virgin" "hothead" divides the letter combination th, in "British Virgin Islands" is transliterated. We while the same combination corresponds to a therefore introduce a dictionary lookup step prior single phoneme in "bother". Automatic to commencing transliteration, to identify cases identification of the phonemes in a word is which have a standard translation. therefore a difficult problem. The transliteration algorithm results in a Many approaches exist in the literature to string of Han characters, the ideographic script solving the grapheme-phoneme conversion used for Chinese. While the dialects of Chinese problem. Divay and Vitale (1997) review several share the same orthography, they do not share the of these, and introduce a rule-based approach same pronunciation. This algorithm is based on (with 1,500 rules for English) which achieved the Mandarin dialect. 94.9% accuracy on one corpus and 64.37% on Because automation of this algorithm is our another. Van den Bosch (1997) evaluates primary goal, the transliteration starts with a instance-based learning algorithms and a decision written source and it is assumed that the tree algorithm, finding that the best of these orthography represents an assimilated algorithms can achieve 96.9% accuracy. pronunciation, even though English has borrowed Even when a reliable grapheme-to-phoneme many country names. This is permitted only conversion module can be constructed, the because the mapping from English phonemes to English-Chinese transliteration process is faced Chinese phonemes loses a large degree of with the task of mapping phonemes in the source variance: English vowel monothongs are language to counterparts in the target language, flattened into a fewer number Chinese difficult due to phonemic divergence between the monothongs. However, Chinese has a larger set two languages. English permits initial and final of diphthongs and triphthongs. This results in consonant clusters in syllables. Mandarin approximating a prototypical vowel by the Chinese, in contrast, primarily has a consonant- closest match within the set of Chinese vowels. vowel or consonant-vowel-[nasal consonant (/n/ 4 An Algorithm for Auto Transliteration or /0/)] syllable structure. English consonant clusters, when pronounced within the Chinese The algorithm begins with a proper noun phrase phonemic system, must either be reduced to a (PNP) and returns a transliteration in Chinese single phoneme or converted to a consonant- characters. The process involves five main vowel-consonant-vowel structure by inserting a stages: Semantic Abstraction, Syllabification, vowel between the consonants in the cluster. In Sub-syllable Divisions, Mapping to , and addition to these phonotactic constraints, the Mapping to Han Characters. range of Chinese phonemes is not fully 4.1 Semantic Abstraction compatible with those of English. For instance, Mandarin does not use the phoneme Iv/ and so The PNP may consist of one or more words. If it that phoneme in English words is realized as is longer than a single word, it is likely that some either/w/or/f/in the Chinese counterpart. part of it may have an existing semantic We focus on the specific problem of country translation. "The" and "of' are omitted by name transliteration from English into Chinese.

1353 convention. To ensure that such words as clusters are reduced to a single phoneme "Unitear" are translated and not transliterated ~, we represented by a single ASCII character (e.g. ff pass the entire PNP into a dictionary in search of and ph are both reduced to f). Instances of 'y' as a standard translation. If a match is not a vowel are also replaced by the vowel 'i'. immediately successful, we break the PNP into For each pair of identical consonants in the input string words and pass each word into the dictionary to Reduce the pair to a singular instance of the consonant check for a semantic translation 2. This portion of For each substring in the input string listed in Appendix A the algorithm controls which words in the PNP Replace substring with the corresponding phoneme (App. A) are translated and which are transliterated. For all instances where 'y' is not followed by a vowel or 'y' follows a consonant Search for PNP in dictionary Replace this instance of 'y' with the vowel 'i' If exact match exists then When 'e' is followed by a consonant and an 'ia#' return corresponding characters ;; (where # is the end of string marker) else Replace the the preceding 'e' with 'i remove article 'The' and preposition 'of' For each (remaining) word in PNP 4.2.2 Syllabification search for word in dictionary If string begins with a consonant If exact match exists Then read/store consonants until next vowel and call this add matching characters to output string 3 substring initial_consonant_group (or icg) else if the word is not already a chinese word Read/store vowels until next consonant and call this substring transliterate the word and add to output string vowels (or v) If more characters, read/store consonants until next vowel and call 4.2 Transliteration 1: Syllabification this final_consonant_cluster (or fcc) Because Chinese characters are monosyllabic, If length of fcc = 1 and fcc followed by substrings 'e#' each word to be transliterated must first be final_vowel (or fv) = 'e' divided into syllables. The outcome is a list of syllable = icg + v +fcc +fv syllables, each with at least one vowel part. else if the last two letters of fcc form a substring in Appendix B We distinguish between a consonant group then this string has a double consonant cluster and a consonant cluster, where a group is an next_syllable (or ns) = the last two letters of fcc arbitrary collection of consonant phonemes and a reset fcc to be fcc with ns removed cluster is a known collection of consonants. Like else Divay and Vitale (1997), we identify syllable next_syllable (or ns) = the last letter of fcc boundaries on the basis of consonant clusters and reset fcc to be fcc with ns removed vowels (ignoring morphological considerations). syllable = icg + v + fcc Any consonant group is divided into two parts, Store syllable in a list by identifying the final consonant cluster or lone Call syllabification procedure on substring [ns .. #] consonant in that group and grouping that consonant (cluster) with the following vowel. 4.3 Transliteration 2: Sub-syllable Divisions The sub-syllabification algorithm then further The algorithm then proceeds to find patterns divides each identified syllable. While this within each syllable of the list. The pattern procedure may not always strictly divide a word matching consists of splitting those consonant into standard syllables, it produces syllables of clusters that cannot be pronounced within the the form consonant-vowel, the common Chinese phonemic set. These separated pronunciation of most Chinese characters. consonants are generally pronounced by inserting a context-dependent vowel. The Pinyin 4.2.1 Normalization romanization consists of elements that can be Prior to the syllabification process, the input described as consonants (including three string must be normalized, so that consonant consonant clusters "zh", "ch" and "sh") and vowels which consist of monothongs, diphthongs and vowels followed by a nasal In/ or /rj/. I The historical interactions of some European and Asian nations Consonants that follow a set of vowels are has lead to names that include some special meaning. Interaction examined to determine if they "modify" the with the dialects of the South may have produced transliterations vowel. Such consonants include the alveolar based on regional pronunciations which are accepted as standard. approximant /r/, the pharyngeal fricative /h/ or 2 There is some discrepency among speakers about the balance the above mentioned nasal consonants. These are between translation and transliteration. For instance, the word 'New' is translated by some and transliterated by others. then joined to the vowel to form the "vowel part". The "vowel part" may be divided so as to 3 Identification of syntactic constraints is work-in-progress. Known nouns such as 'island' are moved to the end of the phrase while map onto a Pinyin syllable. Any remaining modifers (remaining words) maintain their relative . consonants are then split by inserting a vowel.

1354 For each syllable s identified above specifying the Pinyin Han character Initialize subsyllable_list (or s/) to the empty string correspondence (Appendix E). In some cases, Identify initial_consonant_group s~g multiple characters might be possible but the While s~g is non-null table includes only the most common. If the first two letters of s~g appear in Appendix C then consonant_pair (or cp) = those two letters 5 An Example append cp to sl The transliteration of the place name "Faeroe reset S~g to be the remainder of S~cg Islands" according to the algorithm will proceed else add the first letter of S~=gtOsl as follows: reset S~g to be the remainder of S~=g 1. No match for "Faeroe" in the dictionary, so must be Identify vowels (v) in s transliterated : append v to last element of sl 2. Divide Faeroe into two syllables by recognizing the syllabic identify final_consonant_cluster (fcc) of s break falls before the "?' in the middle consonant group. if sfcc is non-null 3. Map/fae/and/roe/onto their Chinese equivalents. Since no if Sfcc is equal to 'n', 'm', 'ng', 'h' or 'r' vowel form/ae/exists in Chinese, this is mapped to/ei/. The identify final vowels of s (Sly) Irl of the second syllable is mapped to /1/ and /oe/ is If s~ exists and Sfcc= 'n' or 'm' correspondingly mapped to luol. append Sfc=to last element of sl 4. Since each syllable is of the form , no subsyllabic else if s~ exists and Sfcc not = 'n' or 'm' processing is required. append Sfc¢+sty to last element of sl 5. The transliterated phrase "fei luo" is the mapped to the Han else if Sly exists and sfc¢= 'h' or 'r' characters: "-:lie~'" discard sfc¢+ s~ 6. "Islands" is searched for and found in the dictionary : "1~'%" else (qOn d~o) while sfcc is non null 7. The characters of the translated "Islands" are placed after the If the first two letters of sfc¢appear in Appendix C transliteration of "Faeroe" : "tlz ~' ~ ,%" (f~i/0o qOn d~o) then cp = those two letters 6 Conclusions and Future Extensions append cp to sl The algorithm we have outlined is being reset S~cctObe the remainder of sfc¢ implemented as a tool for the creation of Chinese else lexical resources within a multilingual text add the first letter of SfcctO sl generation project from an English-language reset stc¢to be the remainder of Sfc= source database. We focused on the requirements For each element of sl of the domain of English place names. The If element does not include a vowel algorithm is currently being extended to include Insert context dependent vowel transliteration as well, which requires a different set of characters. A personal This procedure will subdivide the syllable into name transliteration standard has been developed pronounceable sections for mapping to the and is in use in (Chanzhong Wu, p.c.). By Chinese phoneme set. Thus each subsection mapping the Pinyin transliterations arrived by our should be of the form , or , where algorithm to this different set of characters, we "c" is a single consonant, "v" is a monothong or can extend the domain to include personal names. diphthong and "c," is a nasal consonant. In its present form, the algorithm will not always generate transliterations matching those 4.4 Transliteration 3: Mapping to Pinyin which might be produced by a human The subsyllables are then mapped to the Pinyin transliterator due to the influence of historical romanization standard equivalents by means of a factors or individual differences. However, the table (Appendix D). This table is indexed on the aim of the algorithm is to produce a columns on the consonants of the subsyllable, transliteration understandable by readers of a and on the rows on the vowel part of the Chinese text. While the algorithm mimics the subsyllable. When an exact match cannot be intuitive superimposition of phonemic and found we prioritize aspects of the subsyllable. phonotactic systems, the ultimate goals of the Often the highest priority is the initial consonant. algorithm are generality and reliability. Indeed, Of next priority are nasal consonants. This may the result from the example above corresponds to demand an alternate vowel choice if no such a standard transliteration. Thus the algorithm combination of phonemes exists in the table. produces results which are recognisable. The degree to which the transliteration is recognised 4.5 Transliteration 4: Mapping to Han by the human speaker is dependent in part on the Once the Pinyin of a word is established, the Han length of the original name. Longer names with characters are simply extracted from a table of many syllables are less recognisable than shorter 1355 names. The introduced phonemic conversion PhD thesis, University of Maastricht, Uitgeverij rules are merely those most common and further Phidippides, Cadier en Keer, the Netherlands, 229p. work will strengthen the generality of the tool. Further research will include a more formal Appendices A. B. and C. English-Chinese uni- analysis of the correspondences between English tary consonant correspondences, consonant and Chinese phonemes. Furthermore, the mirs, and double consonant correspondences algorithm is far from robust due to its current limited focus, and errors made in earlier stages bh =>b cqu =>k tr bl cz => ch sp => xi b- ngh => ngh sc =>c sh cl st =>shid- sw =>ru- are propagated and possibly magnified as the gh => gh dj => j ch fl ch => ch sh => sh algorithm continues. Since place names and Iph =>f ts =>c cz kl people's names originate from many cultures, Ith =>t lk =>k sp pl this algorithm will not produce desirable results !ck =>k we=>w st sl unless the written form exhibits some r + cons. => cons. SW assimilation to English spelling. We are currently Appendix D. Portion of English phoneme - investigating the application of lazy learning techniques (as described by van den Bosch 1997) Chinese Pinyin Mapping Table to learning the English naming word-phoneme f- n- p- r- v correspondences from a corpus of names. Such a a fa na ba la wa module could eventually replace our simplistic ae fei nei bei lei wei rule-based procedure, and could feed into the ai fei nei bei lei wei ai fai nai bai lai wai phoneme-Pinyin mapping module, ultimately ai fa yi na yi ba yi la yi wa yi resulting in greater accuracy. ao nao bao lao The applications of such an algorithm are ar# nuo luo wuo countless. Currently, the process of finding a less au nuo luo wuo common country, city, or county name is an ay fei nei bei lei wei arduous procedure. Because transliteration uses o fo bo wo no semantic content, it is a obvious task for o# nuo# luo# wuo# oa bo ya wo ya automation. This algorithm could also be applied oe nuo luo wuo in the character entry on a Chinese word oi # processor or to index Chinese electronic atlases. on lun When attached to a robust grapheme-to-phoneme or# nuo# luo# wuo# module, the transliteration into Chinese ou nuo luo wuo characters is ultimately a mapping to Chinese- Appendix E. Pinyin-Han table (portion) specific IPA phonetics, raising the possibility of speech synthesis of English names in Chinese, a;l~" ;~ hong;~'J~ lun;~ ;~l~ ;[ gwen that Pinyin is a phonemically normalized ai;~ dian;.~l~: jiJ'L luo;~ qiu;~ wang;j orthography. ai;~ dian;~i~ ;~. luo;~ ri;Et wei;~ an;~ du;/~ ji;i~ luu;'J~ rui;~ wei;~ Acknowledgements an;~ du;glI ji;~ ;-~ rui;~ wei;~ ang;~ dun;]ll~ ji;}':~: mai;~ sa;~ wei;,~ Our thanks go to Canzhong Wu for help with ao;'~ duo;~ jia;~fl mai;~ sai;i wei;.~ identifying Chinese mappings, and the members ba;Fq e;~ jian;~ man;J sang;~ wen;~ of Dynamic Document Delivery project at the bai;-I~ e;~ jie;~j~ mao;~ se;~. wu;-~ Microsoft Research Institute (the POWER team). ban;t'~ er;~ jin;ff~ ;~ sen;~ wuo;~, bao;~ er;~l~ jing;~ men;f" sha;~ xi;~ References bao;t~ fa;~ ju;~ meng;~ shao;.~ xi;i~i bei;:ll~ fei;~ ka;"~, meng;] she;~ ;~ Divay M. and Vitale A.J. (1997) Algorithms for bei;~ fei;~ ka;l~ meng;] shi;-&" xiang;~ Grapheme-Phoneme Translation for English and ben;:~ fei;~l~ kai;-~ mi;~ shi;~ xiang;~ French: Applications. Computational Linguistics, bi;l~ fen;:~: ;P-~ mi;~2, shi;llr]" xin;~ 23/4, pp. 495--524. bing;,~ fo;~ ke;~-[. mi;;~: shi;J~ xiong;! Verspoor, C., Dale, R., Green, S., Milosavljevic, bing;~ fu;~ ken;'l~" mian;~ ;ll/~ xu;~ bo;~fl fu;'~ la;~'~ mo;IJ' song;Jl~ ya;,'ll7 M., Pads, C., and Williams, S. (1998) Intelligent bo;tl~ fu;~ la;~t mo;~ su;~ ya;~ Agents for Information Presentation: Dynamic De- bo;jl~ gan;-~ lai;~ mo;~ suo;~ ye;~ scription of Knowledge Base Objects. In the proceed- bo;J~ gang;~ lan; -~" mu:t~ suo;~ yi;I,2 ings of the International Workshop on Intelligent bo;~ gang;~lJ lang;l~I] na;lt!: ta;~ yi;~ Agents on the Internet and Web, Mexico City, Mex- bu;~l~ gang;~ lao;:~ na;~ ta;t~: yi;.~ ico, 16-20 March 1998, pp. 75-86. bu;~ ge;-~]- le;l~ na;~JIl tai;~ yin;l~ll bu;~ ge;t~ ;~l nan;]~ tai;~ yue;~J van den Bosch A. (1997) Learning to pronounce chao;~ ge;~l' li;~J nao;t~l tai;~ yue;/~ written words: A study in inductive language learning.

1356