Automatic English-Chinese Name Transliteration for Develop- Ment of Multilingual Resources
Total Page:16
File Type:pdf, Size:1020Kb
Automatic English-Chinese name transliteration for develop- ment of multilingual resources Stephen Wan and Cornelia Maria Verspoor Microsoft Research Institute Macquarie University Sydney NSW 2109, Australia {swan, kversp } @mri.mq.edu.au manufacturer, with values of personal and place Abstract names. Place names and personal names do not fall into a well-defined set, nor do they have se- In this paper, we describe issues in the translation mantic content which can be expressed in other of proper names from English to Chinese which languages through words equivalent in meaning. we have faced in constructing a system for multi- As more objects are added to our database (as lingual text generation supporting both languages. will happen as a museum acquires new objects), We introduce an algorithm for mapping from new names will be introduced, and these must English names to Chinese characters based on (1) also be added to the lexica for each language in heuristics about relationships between English the system. We require an automatic procedure spelling and pronunciation, and (2) consistent re- for achieving this, and concentrate here on tech- lationships between English phonemes and Chi- niques for the creation of a Chinese lexicon. nese characters. 2 English-Chinese Transliteration 1 Introduction We use the term transliteration to refer generally to the problem of the identification of a specific In the context of multilingual natural language textual form in an output language (in our case processing systems which aim for coverage of Chinese characters) which corresponds to a both languages using a roman alphabet and lan- specific textual form in an input language (an guages using other alphabets, the development of English word or phrase). For words with lexical resources must include mechanisms for semantic content, this process is essentially handling words which do not have standard equivalent to the translation of individual words. translations. Words falling into this category are So, the English word "black" is associated with a words which do not have any obvious semantic concept which is expressed as "~" ([h~i]) in content, e.g. most indo-european personal and Chinese. In thiscase, a dictionary search place names, and which can therefore not simply establishes the input-output correspondence. be mapped to translation equivalents. For words with little or no semantic content, In this paper, we examine the problem of such as personal and place names, dictionary generating Chinese characters which correspond lookup may suffice where standard translations to English personal and place names. Section 2 exist, but in general it cannot be assumed that introduces the basic principles of English- names will be included in the bilingual Chinese transliteration, Section 3 identifies issues dictionary. In multilingual systems designed only specific to the domain of name transliteration, for languages sharing the roman alphabet, such and Section 4 introduces a rule-based algorithm names pose no problem as they can simply be for automatically performing the name translit- included unaltered in output texts in any of the eration. In Section 5 we present an example of languages. They cannot, however, be included in the application of the algorithm, and in Section 6 a Chinese text, as the roman characters cannot we discuss extensions to improve the robustness standardly be realized in the Han character set. of the algorithm. Our need for automatic transliteration 3 Name Transliteration mechanisms stems from a multilingual text gen- eration system which we are currently construct- English-Chinese name transliteration occurs on ing, on the basis of an English-language database the basis of pronunciation. That is, the written containing descriptive information about museum English word is mapped to the written Chinese objects (the POWER system; Verspoor et al character(s) via the spoken form associated with 1998). That database includes fields such as the word. The idealized process consists of: 1352 1. mapping an English word (grapheme) to a pho- The algorithm does not aim to specify general nemic representation grapheme-phoneme conversion for English, but 2. mapping each phoneme composing the word to a only for the subset of English words relevant to corresponding Chinese character place name transliteration. This limited domain In practice, this process is not entirely rarely exhibits complex morphology and thus a straightforward. We outline several issues com- robust morphological module is not included. In plicating the automation of this process below. addition, foreign language morphemes are treated The written form of English is less than superficially. Thus, the algorithm transliterates normalized. A particular English grapheme (letter the "-istan" (a morpheme having meaning in or letter group) does not always correspond to a Persian) of "Afghanistan" in spite of a standard single phoneme (e.g. ea is pronounced differently transliteration which omits this morpheme. in eat, threat, heart, etc.), and many English The transliteration process is intended to be multi-letter combinations are realised as a single based purely on phonetic equivalency. On phoneme in pronunciation (so f, if, ph, and gh occasion, country names will have some can all map to /f/) (van den Bosch 1997). An additional meaning in English apart from the important step in grapheme-phoneme conversion referential function, as in "The United States". is the segmentation of words into syllables. Such names are often translated semantically However, this process is dependent on factors rather than phonetically in Chinese. However, such as morphology. The syllabification of this in not uniformly true, for example "'Virgin" "hothead" divides the letter combination th, in "British Virgin Islands" is transliterated. We while the same combination corresponds to a therefore introduce a dictionary lookup step prior single phoneme in "bother". Automatic to commencing transliteration, to identify cases identification of the phonemes in a word is which have a standard translation. therefore a difficult problem. The transliteration algorithm results in a Many approaches exist in the literature to string of Han characters, the ideographic script solving the grapheme-phoneme conversion used for Chinese. While the dialects of Chinese problem. Divay and Vitale (1997) review several share the same orthography, they do not share the of these, and introduce a rule-based approach same pronunciation. This algorithm is based on (with 1,500 rules for English) which achieved the Mandarin dialect. 94.9% accuracy on one corpus and 64.37% on Because automation of this algorithm is our another. Van den Bosch (1997) evaluates primary goal, the transliteration starts with a instance-based learning algorithms and a decision written source and it is assumed that the tree algorithm, finding that the best of these orthography represents an assimilated algorithms can achieve 96.9% accuracy. pronunciation, even though English has borrowed Even when a reliable grapheme-to-phoneme many country names. This is permitted only conversion module can be constructed, the because the mapping from English phonemes to English-Chinese transliteration process is faced Chinese phonemes loses a large degree of with the task of mapping phonemes in the source variance: English vowel monothongs are language to counterparts in the target language, flattened into a fewer number Chinese difficult due to phonemic divergence between the monothongs. However, Chinese has a larger set two languages. English permits initial and final of diphthongs and triphthongs. This results in consonant clusters in syllables. Mandarin approximating a prototypical vowel by the Chinese, in contrast, primarily has a consonant- closest match within the set of Chinese vowels. vowel or consonant-vowel-[nasal consonant (/n/ 4 An Algorithm for Auto Transliteration or /0/)] syllable structure. English consonant clusters, when pronounced within the Chinese The algorithm begins with a proper noun phrase phonemic system, must either be reduced to a (PNP) and returns a transliteration in Chinese single phoneme or converted to a consonant- characters. The process involves five main vowel-consonant-vowel structure by inserting a stages: Semantic Abstraction, Syllabification, vowel between the consonants in the cluster. In Sub-syllable Divisions, Mapping to Pinyin, and addition to these phonotactic constraints, the Mapping to Han Characters. range of Chinese phonemes is not fully 4.1 Semantic Abstraction compatible with those of English. For instance, Mandarin does not use the phoneme Iv/ and so The PNP may consist of one or more words. If it that phoneme in English words is realized as is longer than a single word, it is likely that some either/w/or/f/in the Chinese counterpart. part of it may have an existing semantic We focus on the specific problem of country translation. "The" and "of' are omitted by name transliteration from English into Chinese. 1353 convention. To ensure that such words as clusters are reduced to a single phoneme "Unitear" are translated and not transliterated ~, we represented by a single ASCII character (e.g. ff pass the entire PNP into a dictionary in search of and ph are both reduced to f). Instances of 'y' as a standard translation. If a match is not a vowel are also replaced by the vowel 'i'. immediately successful, we break the PNP into For each pair of identical consonants in the input string words and pass each word