Semantic Transliteration of Personal Names
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Transliteration of Personal Names Haizhou Li*, Khe Chai Sim*, Jin-Shea Kuo†, Minghui Dong* *Institute for Infocomm Research †Chung-Hwa Telecom Laboratories Singapore 119613 Taiwan {hli,kcsim,mhdong}@i2r.a-star.edu.sg [email protected] transliteration, we mean rewriting a foreign word Abstract in native grapheme such that its original pronunciation is preserved. For example, London Words of foreign origin are referred to as becomes 伦敦 /Lun-Dun/1 which does not carry borrowed words or loanwords. A loanword is usually imported to Chinese by phonetic any clear connotations. Phonetic transliteration transliteration if a translation is not easily represents the common practice in transliteration. available. Semantic transliteration is seen Phonetic-semantic transliteration, hereafter as a good tradition in introducing foreign referred to as semantic transliteration for short, is words to Chinese. Not only does it preserve an advanced translation technique that is how a word sounds in the source language, considered as a recommended translation practice it also carries forward the word’s original for centuries. It translates a foreign word by preserving both its original pronunciation and semantic attributes. This paper attempts to 2 automate the semantic transliteration meaning. For example, Xu Guangqi translated process for the first time. We conduct an geo- in geometry into Chinese as 几何 /Ji-He/, inquiry into the feasibility of semantic which carries the pronunciation of geo- and transliteration and propose a probabilistic expresses the meaning of “a science concerned model for transliterating personal names in with measuring the earth”. Latin script into Chinese. The results show Many of the loanwords exist in today’s Chinese that semantic transliteration substantially through semantic transliteration, which has been and consistently improves accuracy over well received (Hu and Xu, 2003; Hu, 2004) by the phonetic transliteration in all the people because of many advantages. Here we just experiments. name a few. (1) It brings in not only the sound, but also the meaning that fills in the semantic blank 1 Introduction left by phonetic transliteration. This also reminds people that it is a loanword and avoids misleading; The study of Chinese transliteration dates back to (2) It provides etymological clues that make it easy the seventh century when Buddhist scriptures were to trace back to the root of the words. For example, translated into Chinese. The earliest bit of Chinese a transliterated Japanese name will maintain its translation theory related to transliteration may be Japanese identity in its Chinese appearance; (3) It the principle of “Names should follow their evokes desirable associations, for example, an bearers, while things should follow Chinese.” In English girl’s name is transliterated with Chinese other words, names should be transliterated, while characters that have clear feminine association, things should be translated according to their thus maintaining the gender identity. meanings. The same theory still holds today. Transliteration has been practiced in several 1 Hereafter, Chinese characters are also denoted in Pinyin ro- ways, including phonetic transliteration and manization system, for ease of reference. phonetic-semantic transliteration. By phonetic 2 Xu Quangqi (1562–1633) translated The Original Manu- script of Geometry to Chinese jointly with Matteo Ricci. 120 Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 120–127, Prague, Czech Republic, June 2007. c 2007 Association for Computational Linguistics Unfortunately, most of the reported work in the and Yamamoto Akiko are transliterated into 乔治 area of machine transliteration has not ventured 布什 and 山本 亚喜子 that arouse to the into semantic transliteration yet. The Latin-scripted following associations: 乔治 /Qiao-Zhi/ - male personal names are always assumed to given name, English origin; 布什 /Bu-Shi/ - homogeneously follow the English phonic rules in 山本 automatic transliteration (Li et al., 2004). surname, English origin; /Shan-Ben/ - Therefore, the same transliteration model is surname, Japanese origin; 亚喜子 /Ya-Xi-Zi/ - applied to all the names indiscriminatively. This female given name, Japanese origin. assumption degrades the performance of In Section 2, we summarize the related work. In transliteration because each language has its own Section 3, we discuss the linguistic feasibility of phonic rule and the Chinese characters to be semantic transliteration for personal names. adopted depend on the following semantic Section 4 formulates a probabilistic model for attributes of a foreign name. semantic transliteration. Section 5 reports the (1) Language of origin: An English word is not experiments. Finally, we conclude in Section 6. necessarily of pure English origin. In English news reports about Asian happenings, an English 2 Related Work personal name may have been originated from In general, computational studies of transliteration Chinese, Japanese or Korean. The language origin fall into two categories: transliteration modeling affects the phonic rules and the characters to be 3 and extraction of transliteration pairs. In used in transliteration . For example, a Japanese transliteration modeling, transliteration rules are name Matsumoto should be transliterated as 松本 trained from a large, bilingual transliteration /Song-Ben/, instead of 马茨莫托 /Ma-Ci-Mo-Tuo/ lexicon (Lin and Chen, 2002; Oh and Choi, 2005), as if it were an English name. with the objective of translating unknown words (2) Gender association: A given name typically on the fly in an open, general domain. In the implies a clear gender association in both the extraction of transliterations, data-driven methods source and target languages. For example, the are adopted to extract actual transliteration pairs Chinese transliterations of Alice and Alexandra from a corpus, in an effort to construct a large, up- are 爱丽丝 /Ai-Li-Si/ and 亚历山大 /Ya-Li-Shan- to-date transliteration lexicon (Kuo et al., 2006; Da/ respectively, showing clear feminine and Sproat et al., 2006). masculine characteristics. Transliterating Alice as Phonetic transliteration can be considered as an 埃里斯 /Ai-Li-Si/ is phonetically correct, but extension to the traditional grapheme-to-phoneme semantically inadequate due to an improper gender (G2P) conversion (Galescu and Allen, 2001), association. which has been a much-researched topic in the (3) Surname and given name: The Chinese name field of speech processing. If we view the system is the original pattern of names in Eastern grapheme and phoneme as two symbolic Asia such as China, Korea and Vietnam, in which representations of the same word in two different a limited number of characters 4 are used for languages, then G2P is a transliteration task by surnames while those for given names are less itself. Although G2P and phonetic transliteration restrictive. Even for English names, the character are common in many ways, transliteration has its set for given name transliterations are different unique challenges, especially as far as E-C from that for surnames. transliteration is concerned. E-C transliteration is Here are two examples of semantic the conversion between English graphemes, transliteration for personal names. George Bush phonetically associated English letters, and Chinese graphemes, characters which represent ideas or meanings. As a Chinese transliteration can 3 In the literature (Knight and Graehl,1998; Qu et al., 2003), arouse to certain connotations, the choice of translating romanized Japanese or Chinese names to Chinese Chinese characters becomes a topic of interest (Xu characters is also known as back-transliteration. For simplic- et al., 2006). ity, we consider all conversions from Latin-scripted words to Chinese as transliteration in this paper. Semantic transliteration can be seen as a subtask 4 The 19 most common surnames cover 55.6% percent of the of statistical machine translation (SMT) with Chinese population (Ning and Ning 1995). 121 monotonic word ordering. By treating a Pinyin to character dictionary for Chinese names. letter/character as a word and a group of The entries are classified into surname, male and letters/characters as a phrase or token unit in SMT, female given name categories. The E-C corpus also one can easily apply the traditional SMT models, contains some entries without gender/surname such as the IBM generative model (Brown et al., labels, referred to as unclassified. 1993) or the phrase-based translation model (Crego E-C J-C5 C-C6 et al., 2005) to transliteration. In transliteration, we face similar issues as in SMT, such as lexical Surname (S) 12,490 36,352 569,403 mapping and alignment. However, transliteration is Given name (M) 3,201 35,767 345,044 also different from general SMT in many ways. Given name (F) 4,275 11,817 122,772 Unlike SMT where we aim at optimizing the Unclassified 22,562 - - semantic transfer, semantic transliteration needs to All 42,528 83,936 1,972,851 maintain the phonetic equivalence as well. Table 1: Number of entries in 3 corpora In computational linguistic literature, much Phonetic transliteration has not been a problem effort has been devoted to phonetic transliteration, as Chinese has over 400 unique syllables that are such as English-Arabic, English-Chinese (Li et al., enough to approximately transcribe all syllables in 2004), English-Japanese (Knight and Graehl, other languages. Different Chinese characters may 1998) and English-Korean.