Automatic English-Chinese name transliteration for develop- ment of multilingual resources
Stephen Wan and Cornelia Maria Verspoor Microsoft Research Institute Macquarie University Sydney NSW 2109, Australia {swan, kversp } @mri.mq.edu.au
manufacturer, with values of personal and place Abstract names. Place names and personal names do not fall into a well-defined set, nor do they have se- In this paper, we describe issues in the translation mantic content which can be expressed in other of proper names from English to Chinese which languages through words equivalent in meaning. we have faced in constructing a system for multi- As more objects are added to our database (as lingual text generation supporting both languages. will happen as a museum acquires new objects), We introduce an algorithm for mapping from new names will be introduced, and these must English names to Chinese characters based on (1) also be added to the lexica for each language in heuristics about relationships between English the system. We require an automatic procedure spelling and pronunciation, and (2) consistent re- for achieving this, and concentrate here on tech- lationships between English phonemes and Chi- niques for the creation of a Chinese lexicon. nese characters. 2 English-Chinese Transliteration 1 Introduction We use the term transliteration to refer generally to the problem of the identification of a specific In the context of multilingual natural language textual form in an output language (in our case processing systems which aim for coverage of Chinese characters) which corresponds to a both languages using a roman alphabet and lan- specific textual form in an input language (an guages using other alphabets, the development of English word or phrase). For words with lexical resources must include mechanisms for semantic content, this process is essentially handling words which do not have standard equivalent to the translation of individual words. translations. Words falling into this category are So, the English word "black" is associated with a words which do not have any obvious semantic concept which is expressed as "~" ([h~i]) in content, e.g. most indo-european personal and Chinese. In thiscase, a dictionary search place names, and which can therefore not simply establishes the input-output correspondence. be mapped to translation equivalents. For words with little or no semantic content, In this paper, we examine the problem of such as personal and place names, dictionary generating Chinese characters which correspond lookup may suffice where standard translations to English personal and place names. Section 2 exist, but in general it cannot be assumed that introduces the basic principles of English- names will be included in the bilingual Chinese transliteration, Section 3 identifies issues dictionary. In multilingual systems designed only specific to the domain of name transliteration, for languages sharing the roman alphabet, such and Section 4 introduces a rule-based algorithm names pose no problem as they can simply be for automatically performing the name translit- included unaltered in output texts in any of the eration. In Section 5 we present an example of languages. They cannot, however, be included in the application of the algorithm, and in Section 6 a Chinese text, as the roman characters cannot we discuss extensions to improve the robustness standardly be realized in the Han character set. of the algorithm. Our need for automatic transliteration 3 Name Transliteration mechanisms stems from a multilingual text gen- eration system which we are currently construct- English-Chinese name transliteration occurs on ing, on the basis of an English-language database the basis of pronunciation. That is, the written containing descriptive information about museum English word is mapped to the written Chinese objects (the POWER system; Verspoor et al character(s) via the spoken form associated with 1998). That database includes fields such as the word. The idealized process consists of:
1352 1. mapping an English word (grapheme) to a pho- The algorithm does not aim to specify general nemic representation grapheme-phoneme conversion for English, but 2. mapping each phoneme composing the word to a only for the subset of English words relevant to corresponding Chinese character place name transliteration. This limited domain In practice, this process is not entirely rarely exhibits complex morphology and thus a straightforward. We outline several issues com- robust morphological module is not included. In plicating the automation of this process below. addition, foreign language morphemes are treated The written form of English is less than superficially. Thus, the algorithm transliterates normalized. A particular English grapheme (letter the "-istan" (a morpheme having meaning in or letter group) does not always correspond to a Persian) of "Afghanistan" in spite of a standard single phoneme (e.g. ea is pronounced differently transliteration which omits this morpheme. in eat, threat, heart, etc.), and many English The transliteration process is intended to be multi-letter combinations are realised as a single based purely on phonetic equivalency. On phoneme in pronunciation (so f, if, ph, and gh occasion, country names will have some can all map to /f/) (van den Bosch 1997). An additional meaning in English apart from the important step in grapheme-phoneme conversion referential function, as in "The United States". is the segmentation of words into syllables. Such names are often translated semantically However, this process is dependent on factors rather than phonetically in Chinese. However, such as morphology. The syllabification of this in not uniformly true, for example "'Virgin" "hothead" divides the letter combination th, in "British Virgin Islands" is transliterated. We while the same combination corresponds to a therefore introduce a dictionary lookup step prior single phoneme in "bother". Automatic to commencing transliteration, to identify cases identification of the phonemes in a word is which have a standard translation. therefore a difficult problem. The transliteration algorithm results in a Many approaches exist in the literature to string of Han characters, the ideographic script solving the grapheme-phoneme conversion used for Chinese. While the dialects of Chinese problem. Divay and Vitale (1997) review several share the same orthography, they do not share the of these, and introduce a rule-based approach same pronunciation. This algorithm is based on (with 1,500 rules for English) which achieved the Mandarin dialect. 94.9% accuracy on one corpus and 64.37% on Because automation of this algorithm is our another. Van den Bosch (1997) evaluates primary goal, the transliteration starts with a instance-based learning algorithms and a decision written source and it is assumed that the tree algorithm, finding that the best of these orthography represents an assimilated algorithms can achieve 96.9% accuracy. pronunciation, even though English has borrowed Even when a reliable grapheme-to-phoneme many country names. This is permitted only conversion module can be constructed, the because the mapping from English phonemes to English-Chinese transliteration process is faced Chinese phonemes loses a large degree of with the task of mapping phonemes in the source variance: English vowel monothongs are language to counterparts in the target language, flattened into a fewer number Chinese difficult due to phonemic divergence between the monothongs. However, Chinese has a larger set two languages. English permits initial and final of diphthongs and triphthongs. This results in consonant clusters in syllables. Mandarin approximating a prototypical vowel by the Chinese, in contrast, primarily has a consonant- closest match within the set of Chinese vowels. vowel or consonant-vowel-[nasal consonant (/n/ 4 An Algorithm for Auto Transliteration or /0/)] syllable structure. English consonant clusters, when pronounced within the Chinese The algorithm begins with a proper noun phrase phonemic system, must either be reduced to a (PNP) and returns a transliteration in Chinese single phoneme or converted to a consonant- characters. The process involves five main vowel-consonant-vowel structure by inserting a stages: Semantic Abstraction, Syllabification, vowel between the consonants in the cluster. In Sub-syllable Divisions, Mapping to Pinyin, and addition to these phonotactic constraints, the Mapping to Han Characters. range of Chinese phonemes is not fully 4.1 Semantic Abstraction compatible with those of English. For instance, Mandarin does not use the phoneme Iv/ and so The PNP may consist of one or more words. If it that phoneme in English words is realized as is longer than a single word, it is likely that some either/w/or/f/in the Chinese counterpart. part of it may have an existing semantic We focus on the specific problem of country translation. "The" and "of' are omitted by name transliteration from English into Chinese.
1353 convention. To ensure that such words as clusters are reduced to a single phoneme "Unitear" are translated and not transliterated ~, we represented by a single ASCII character (e.g. ff pass the entire PNP into a dictionary in search of and ph are both reduced to f). Instances of 'y' as a standard translation. If a match is not a vowel are also replaced by the vowel 'i'. immediately successful, we break the PNP into For each pair of identical consonants in the input string words and pass each word into the dictionary to Reduce the pair to a singular instance of the consonant check for a semantic translation 2. This portion of For each substring in the input string listed in Appendix A the algorithm controls which words in the PNP Replace substring with the corresponding phoneme (App. A) are translated and which are transliterated. For all instances where 'y' is not followed by a vowel or 'y' follows a consonant Search for PNP in dictionary Replace this instance of 'y' with the vowel 'i' If exact match exists then When 'e' is followed by a consonant and an 'ia#' return corresponding characters ;; (where # is the end of string marker) else Replace the the preceding 'e' with 'i remove article 'The' and preposition 'of' For each (remaining) word in PNP 4.2.2 Syllabification search for word in dictionary If string begins with a consonant If exact match exists Then read/store consonants until next vowel and call this add matching characters to output string 3 substring initial_consonant_group (or icg) else if the word is not already a chinese word Read/store vowels until next consonant and call this substring transliterate the word and add to output string vowels (or v) If more characters, read/store consonants until next vowel and call 4.2 Transliteration 1: Syllabification this final_consonant_cluster (or fcc) Because Chinese characters are monosyllabic, If length of fcc = 1 and fcc followed by substrings 'e#' each word to be transliterated must first be final_vowel (or fv) = 'e' divided into syllables. The outcome is a list of syllable = icg + v +fcc +fv syllables, each with at least one vowel part. else if the last two letters of fcc form a substring in Appendix B We distinguish between a consonant group then this string has a double consonant cluster and a consonant cluster, where a group is an next_syllable (or ns) = the last two letters of fcc arbitrary collection of consonant phonemes and a reset fcc to be fcc with ns removed cluster is a known collection of consonants. Like else Divay and Vitale (1997), we identify syllable next_syllable (or ns) = the last letter of fcc boundaries on the basis of consonant clusters and reset fcc to be fcc with ns removed vowels (ignoring morphological considerations). syllable = icg + v + fcc Any consonant group is divided into two parts, Store syllable in a list by identifying the final consonant cluster or lone Call syllabification procedure on substring [ns .. #] consonant in that group and grouping that consonant (cluster) with the following vowel. 4.3 Transliteration 2: Sub-syllable Divisions The sub-syllabification algorithm then further The algorithm then proceeds to find patterns divides each identified syllable. While this within each syllable of the list. The pattern procedure may not always strictly divide a word matching consists of splitting those consonant into standard syllables, it produces syllables of clusters that cannot be pronounced within the the form consonant-vowel, the common Chinese phonemic set. These separated pronunciation of most Chinese characters. consonants are generally pronounced by inserting a context-dependent vowel. The Pinyin 4.2.1 Normalization romanization consists of elements that can be Prior to the syllabification process, the input described as consonants (including three string must be normalized, so that consonant consonant clusters "zh", "ch" and "sh") and vowels which consist of monothongs, diphthongs and vowels followed by a nasal In/ or /rj/. I The historical interactions of some European and Asian nations Consonants that follow a set of vowels are has lead to names that include some special meaning. Interaction examined to determine if they "modify" the with the dialects of the South may have produced transliterations vowel. Such consonants include the alveolar based on regional pronunciations which are accepted as standard. approximant /r/, the pharyngeal fricative /h/ or 2 There is some discrepency among speakers about the balance the above mentioned nasal consonants. These are between translation and transliteration. For instance, the word 'New' is translated by some and transliterated by others. then joined to the vowel to form the "vowel part". The "vowel part" may be divided so as to 3 Identification of syntactic constraints is work-in-progress. Known nouns such as 'island' are moved to the end of the phrase while map onto a Pinyin syllable. Any remaining modifers (remaining words) maintain their relative order. consonants are then split by inserting a vowel.
1354 For each syllable s identified above specifying the Pinyin
1356