Name Origin Recognition Using Maximum Entropy Model and Diverse Features
Total Page:16
File Type:pdf, Size:1020Kb
Name Origin Recognition Using Maximum Entropy Model and Diverse Features Min Zhang1, Chengjie Sun2, Haizhou Li1, Aiti Aw1, Chew Lim Tan3, Xiaolong Wang2 1Institute for Infocomm 2Harbin Institute of 3National University of Research, Singapore Technology, China Singapore, Singapore {mzhang,hli,aaiti} {cjsun,wangxl} tancl@comp. @i2r.a-star.edu.sg @insun.hit.edu.cn nus.edu.sg Chinese form a pair of transliteration and back- Abstract transliteration. In many natural language process- ing tasks, such as machine translation and cross- Name origin recognition is to identify the lingual information retrieval, automatic name source language of a personal or location transliteration has become an indispensable com- name. Some early work used either rule- ponent. based or statistical methods with single Name origin refers to the source language of a knowledge source. In this paper, we cast the name where it originates from. For example, the name origin recognition as a multi-class origin of the English name “Smith” and its Chi- classification problem and approach the nese transliteration “史密斯 (Shi-Mi-Si)” is Eng- problem using Maximum Entropy method. In doing so, we investigate the use of differ- lish, while both “Tokyo” and “东京 (Dong-Jing)” ent features, including phonetic rules, n- are of Japanese origin. Following are examples of gram statistics and character position infor- different origins of a collection of English-Chinese mation for name origin recognition. Ex- transliterations. periments on a publicly available personal English: Richard-理查德 (Li-Cha-De) name database show that the proposed ap- proach achieves an overall accuracy of Hackensack-哈肯萨克(Ha-Ken- 98.44% for names written in English and Sa-Ke) 98.10% for names written in Chinese, which Chinese: Wen JiaBao-温家宝(Wen-Jia- are significantly and consistently better than Bao) those in reported work. ShenZhen–深圳(Shen-Zhen) Japanese: Matsumoto-松本 (Song-Ben) 1 Introduction Hokkaido-北海道(Bei-Hai-Dao) Korean: Many technical terms and proper names, such as Roh MooHyun-卢武铉(Lu-Wu- personal, location and organization names, are Xuan) translated from one language into another with Taejon-大田(Da-Tian) approximate phonetic equivalents. The phonetic Vietnamese: Phan Van Khai-潘文凯(Pan- translation practice is referred to as transliteration; Wen-Kai) conversely, the process of recovering a word in its Hanoi-河内(He-Nei) native language from a transliteration is called as back-transliteration (Zhang et al, 2004; Knight In the case of machine transliteration, the name and Graehl, 1998). For example, English name origins dictate the way we re-write a foreign word. “Smith” and “史密斯 (Pinyin 1 : Shi-Mi-Si)” in For example, given a name written in English or Chinese for which we do not have a translation in 1 Hanyu Pinyin, or Pinyin in short, is the standard romaniza- tion system of Chinese. In this paper, Pinyin is given next to Chinese characters in round brackets for ease of reading. 56 a English-Chinese dictionary, we first have to de- Section 4 presents our experimental setup and re- cide whether the name is of Chinese, Japanese, ports our experimental results. Finally, we con- Korean or some European/English origins. Then clude the work in section 5. we follow the transliteration rules implied by the origin of the source name. Although all English 2 Related Work personal names are rendered in 26 letters, they Most of previous work focuses mainly on ENOR may come from different romanization systems. although same methods can be extended to CNOR. Each romanization system has its own rewriting We notice that there are two informative clues that rules. English name “Smith” could be directly used in previous work in ENOR. One is the lexical transliterated into Chinese as “史密斯(Shi-Mi-Si)” structure of a romanization system, for example, since it follows the English phonetic rules, while Hanyu Pinyin, Mandarin Wade-Giles, Japanese the Chinese translation of Japanese name “Koi- Hepbrun or Korean Yale, each has a finite set of zumi” becomes “小泉(Xiao-Quan)” following the syllable inventory (Li et al., 2006). Another is the Japanese phonetic rules. The name origins are phonetic and phonotactic structure of a language, equally important in back-transliteration practice. such as phonetic composition, syllable structure. Li et al. (2007) incorporated name origin recogni- For example, English has unique consonant tion to improve the performance of personal name clusters such as /str/ and /ks/ which Chinese, transliteration. Besides multilingual processing, Japanese and Korean (CJK) do not have. the name origin also provides useful semantic in- Considering the NOR solutions by the use of these formation (regional and language information) for two clues, we can roughly group them into two common NLP tasks, such as co-reference resolu- categories: rule-based methods (for solutions tion and name entity recognition. based on lexical structures) and statistical methods Unfortunately, little attention has been given to (for solutions based on phonotactic structures). name origin recognition (NOR) so far in the litera- ture. In this paper, we are interested in two kinds Rule-based Method of name origin recognition: the origin of names Kuo and Yang (2004) proposed using a rule- written in English (ENOR) and the origin of based method to recognize different romanization names written in Chinese (CNOR). For ENOR, system for Chinese only. The left-to-right longest the origins include English (Eng), Japanese (Jap), match-based lexical segmentation was used to Chinese Mandarin Pinyin (Man) and Chinese Can- parse a test word. The romanization system is con- tonese Jyutping (Can). For CNOR, they include firmed if it gives rise to a successful parse of the three origins: Chinese (Chi, for both Mandarin and test word. This kind of approach (Qu and Grefen- Cantonese), Japanese and English (refer to Latin- stette, 2004) is suitable for romanization systems scripted language). that have a finite set of discriminative syllable in- Unlike previous work (Qu and Grefenstette, ventory, such as Pinyin for Chinese Mandarin. For 2004; Li et al., 2006; Li et al., 2007) where NOR the general tasks of identifying the language origin was formulated with a generative model, we re- and romanization system, rule based approach gard the NOR task as a classification problem. We sounds less attractive because not all languages further propose using a discriminative learning have a finite set of discriminative syllable inven- algorithm (Maximum Entropy model: MaxEnt) to tory. solve the problem. To draw direct comparison, we Statistical Method conduct experiments on the same personal name 1) N-gram Sum Method (SUM): Qu and Gre- corpora as that in the previous work by Li et al. fenstette (2004) proposed a NOR identifier using a (2006). We show that the MaxEnt method effec- trigram language model (Cavnar and Trenkle, tively incorporates diverse features and outper- 1994) to distinguish personal names of three lan- forms previous methods consistently across all test guage origins, namely Chinese, Japanese and Eng- cases. lish. In their work, the training set includes 11,416 The rest of the paper is organized as follows: in Chinese name entries, 83,295 Japanese name en- section 2, we review the previous work. Section 3 tries and 88,000 English name entries. However, elaborates our proposed approach and the features. the trigram is defined as the joint probabil- 57 ity pcc()ii−−12 c i for 3-character ccii−−12 c i rather than for both ENOR and CNOR. We explore and inte- the commonly used conditional probabil- grate multiple features into the discriminative clas- sifier and use a common dataset for benchmarking. ity pc(|iii c−−12 c ). Therefore, the so-called trigram in Qu and Grefenstette (2004) is basically a sub- Experimental results show that the MaxEnt model string unigram probability, which we refer to as effectively incorporates diverse features to demon- the n-gram (n-character) sum model (SUM) in this strate competitive performance. paper. Suppose that we have the unigram count 3 MaxEnt Model and Features Ccc()ii−−12 c i for character substring ccii−−12 c i , the unigram is then computed as: 3.1 MaxEnt Model for NOR Ccc()ii−−12 c i pcc()−− c = (1) The principle of maximum entropy (MaxEnt) ii12 i Ccc() c ∑ icc, c ii−−12 i model is that given a collection of facts, choose a ii−−12 i model consistent with all the facts, but otherwise which is the count of character substring cc−− c ii12 i as uniform as possible (Berger et al., 1996). Max- normalized by the sum of all 3-character string Ent model is known to easily combine diverse fea- counts in the name list for the language of interest. tures. For this reason, it has been widely adopted For origin recognition of Japanese names, this in many natural language processing tasks. The method works well with an accuracy of 92%. MaxEnt model is defined as: However, for English and Chinese, the results are K 1 f (,)cx far behind with a reported accuracy of 87% and = α ji pc(|)ij x ∏ (3) 70% respectively. Z j=1 2) N-gram Perplexity Method (PP): Li et al. NNK f (,)cx ==α ji (4) (2006) proposed using n-gram character perplexity Zpcx∑∑(|)ij∏ ii==11j=1 PPc to identify the origin of a Latin-scripted name. where c is the outcome label, x is the given obser- Using bigram, the PPc is defined as: i N − 1 c logpcc ( | ) ∑ = ii1− vation, also referred to as an instance. Z is a nor- PP = 2 Nc i 1 (2) c malization factor. N is the number of outcome where Nc is the total number of characters in the labels, the number of language origins in our case. test name, c is the ith character in the test name.