Back Transliteration from Japanese English Using Target English Context

Isao Goto†, Naoto Kato††, Terumasa Ehara†††, and Hideki Tanaka† †NHK Science and Technical ††ATR Spoken Language Trans- †††Tokyo University of Research Laboratories lation Research Laboratories Science, Suwa 1-11-10 Kinuta, Setagaya, 2-2-2 Hikaridai, Keihanna 5000-1, Toyohira, Chino, Tokyo, 157-8510, Japan Science City, Kyoto, 619-0288, Nagano, 391-0292, Japan goto.[email protected] Japan [email protected]. [email protected] [email protected] ac.jp

cross-language information retrieval if these Abstract words could be automatically restored to the original English words. This paper proposes method of automatic Back transliteration is the process of restoring back transliteration of proper nouns, in which transliterated words to the original English words. a Japanese transliterated-word is restored to Here is a problem of back transliteration. the original English word. The English words are created from a sequence of letters; thus [Back transliteration] our method can create new English words that ? クラッチフィールド are not registered in dictionaries or English (English word) ( cchi fi – do) word lists. When a character is con- verted into English letters, there are various There are many ambiguities to restoring a candidates of alphabetic characters. To ensure transliterated katakana word to its original Eng- adequate conversion, the proposed method lish word. For example, should "a" in "ku ra cchi uses a target English context to calculate the fi – ru do" be converted into the English letter of probability of an English character or string "a" or "" or some other letter or string? Trying corresponding to a Japanese katakana charac- to resolve the ambiguity is a difficult problem, ter or string. confirmed the effectiveness which means that back transliteration to the cor- of using the target English context by an ex- rect English word is also difficult. periment of personal-name back translitera- Using the pronunciation of a dictionary or lim- tion. iting output English words to a particular English word list prepared in advance can simplify the problem of back transliteration. However, these 1 Introduction methods cannot produce a new English word that In transliteration, a word in one language is con- is not registered in a dictionary or an English verted into a character string of another language word list. Transliterated words are mainly proper expressing how it is pronounced. In the case of nouns and technical terms, and such words are transliteration into Japanese, special characters often not registered. Thus, a back transliteration called katakana are used to show how a word is framework for creating new words would be very pronounced. For example, a personal name and useful. its transliterated word are shown below. A number of back transliteration methods for selecting English words from an English pronun- [Transliteration] ciation dictionary have been proposed. They in- Cunningham カニンガム ( ga ) clude Japanese-to-English (Knight and Graehl, 1998) 1 , Arabic-to-English (Stalls and Knight, Here, the italic alphabets are romanized Japanese katakana characters. New transliterated words such as personal names or technical terms in katakana are not al- 1 Their English letter-to-sound WFST does not convert Eng- ways listed in dictionaries. It would be useful for lish words that are not registered in a pronunciation diction- ary. 1998), and Korean-to-English (Lin and Chen, the English letter "u." The correspondence of an 2002). English letter or string to a katakana character or There are also methods that select English string varies depending on the surrounding char- words from an English word list, .g., Japanese- acters, i.e., on its English context. to-English (Fujii and Ishikawa, 2001) and - Thus, our back transliteration method uses the nese-to-English (Chen et al., 1998). target English context to calculate the probability Moreover, there are back transliteration meth- of English letters corresponding to a katakana ods capable of generating new words, there are character or string. some methods for back transliteration from - rean to English (Jeong et al., 1999; Kang and 2.2 Notation and conversion-candidate Choi, 2000). lattice These previous works did not take the target We formulate the word conversion process as a English context into account for calculating the unit conversion process for treating new words. plausibility of matching target characters with the Here, the unit is one or more characters that form source characters. a part of characters of the word. This paper presents a method of taking the tar- A katakana word, K, is expressed by equation get English context into account to generate an 2.1 with "^" and "$" added to its start and end, English word from a Japanese katakana word. respectively. Our character-based method can produce new K ==kkkkm+1 ... (2.1) English words that are not listed in the learning 0011m+ k = ^ k = $ (2.2) corpus. 0 , m+1 This paper is organized as follows. Section 2 where k j is the j-th character in the katakana describes our method. Section 3 describes the word, and m is the number of characters except experimental set-up and results. Section 4 dis- for "^" and "$" and k m+1 is a character string cusses the performance of our method based on 0 the experimental results. Section 5 concludes our from k0 to km+1 . research. We use katakana units constructed of one or more katakana characters. We denote a katakana 2 Proposed Method unit as ku. For any ku, many English units, eu, could be corresponded as conversion-candidates. The ku's and eu's are generated using a learning 2.1 Advantage of using English context corpus in which bilingual words are separated First we explain the difficulty of back translitera- into units and every ku unit is related an eu unit. tion without a pronunciation dictionary. Next, we L{}E denotes the lattice of all eu's correspond- clarify the reason for the difficulty. Finally, we ing to ku's covering a Japanese word. Every eu is clarify the effect using English context in back a node of the lattice and each node is connected transliteration. with next nodes. L{}E has a lattice structure start- In back transliteration, an English letter or ing from "^" and ending at "$." Figure 1 shows an string is chosen to correspond to a katakana char- example of L{}E corresponding to a katakana acter or string. However, this decision is difficult. For example, there are cases that an English letter word "キルシュシュタイン( ru shu shu i "u" corresponds to "a" of katakana, and there are n)." In the figure, each circle represents one eu. cases that the same English letter "u" does not A character string linking individual character correspond to the same "a" of katakana. "u" in units in the paths ppppdq∈(12 , ,.., ) between "^" Cunningham corresponds to "a" in katakana and and "$" in L{}E becomes a conversion candidate, "u" in Bush does not correspond to "a" in kata- where q is the number of paths between "^" and kana. It is difficult to resolve this ambiguity "$" in L{}E . without the pronunciation registered in a diction- We get English word candidates by joining eu's ary. from "^" to "$" in L{}E . We select a certain path, The difference in correspondence mainly comes from the difference of the letters around pd, in L{}E . The number of character units キ(ki) ル(ru) シュ(shu) シュ(shu) タ(ta) イ(i) ン(n) ^ cchi l ch ch ta e m $ chi ld che che tad mon ci le chou chou tag i mp cki les chu chu ji n cy lew s s ter y k ll sc sc tha yeh ng lle sch sch ti ・・・ngh khi llu schu schu to nin ki lou sh sh tta nn kie lu she she tu・・ ・ nne kii r shu shu nt ky rc su nw・・・ タイ qui・・・rd sy sy (tai) sz・・ ・sz・・ ・ t rg taj roo tay rou tey rr ti rre tie rt ty lu・・ ・ tye・・・

Figure 1: Example of lattice L{}E of conversion candidates units.

ˆ except for "^" and "$" in pd is expressed as np()d . E =EKarg maxP ( | ) . (2.6) The character units in p are numbered from start E d ˆ to end. Here, E represents an output result. The English word, E, resulting from the con- To use the English context for calculating the matching of an English unit with a katakana unit, version of a katakana word, K, for pd is expressed as follows: the above equation is transformed into Equation m+1 2.7 by using Bayes’ theorem. K ==kkkk0011.. m+ Eˆ = arg maxPP (EKE ) ( | ) (2.7) np()1d + = ku = ku ku.. ku , (2.3) E 001()1npd + lp()1+ Equation 2.7 contains a translation model in E ==eeeed .. 001()1lpd + which an English word is a condition and kata- = eunp()1d + = eu eu.. eu , (2.4) 001()1npd + kana is a result. The word in the translation model P(|)K E in ke00==ku 0 = eu 0 =^ , Equation 2.7 is broken down into character units ke==ku = eu =$ (2.5) mlpnpnp++1 (ddd )1 ( )1 + ( )1 +, by using equations 2.3 and 2.4. where ej is the j-th character in the English word. EEˆ = arg maxP ( ) E lp()d is the number of characters except for "^" np()1dd++ np ()1 np()1d + × P(,Kku00 , eu | E ) and "$" in the English word. eu0 for each pd ∑∑ eunp()1()1dd++ ku np in L{}E in equation 2.4 becomes the candidate 00 = arg maxP (E ) np()1d + English word. ku0 in equation 2.3 shows the E sequence of katakana units. np()1dd++ np ()1 × ∑∑{P(|Kku00 , eu ,) E eunp()1()1dd++ ku np 2.3 Probability models using target Eng- 00 np()1dd++ np ()1 np ()1 d + lish context ×PP(|,)(|)ku00 eu E eu 0 E } To determine the corresponding English word for (2.8) np()1d + a katakana word, the following equation 2.6 must eu0 includes information of E. K is only be calculated: affected by np()1d + . Thus equation 2.8 can be ku0 rewritten as follows: np()1dd++ np ()1 EEˆ = argmaxP ( ) P(|)ku00 eu is approximated by reduc- E ing the condition. np()1+ ⎧ d np()1dd++ np ()1 × ∑∑⎨P(|Kku0 ) P(|)ku00 eu np()1()1++ np ⎩ eudd ku np()1+ 00 d i−1 np()1d + = P(|kui ku00 , eu ) np()1dd++ np ()1 np ()1 d +⎫ ∏ ×PP(|)(|)ku00 eu eu 0 E ⎬. i=1 ⎭ np()1d + (2.9) start() i − 1 ≈ Pe(|kuistartibiendi()−+ ,, eu e () 1 ) np()1+ ∏ d is 1 when the string of K and i=1 P(|Kku0 ) np()1+ (2.12) ku d is the same, and the strings of the 0 where start(i) is the first position of the i-th char- np()1d + of all paths in the lattice and the string ku0 acter unit eui, while end(i) is the last position of np()1+ of the K is the same. Thus, d is al- the i-th character unit eui; and b is a constant. P(|Kku0 ) start()1 i − ways 1. Equation 2.12 takes English context estart() i− b and We approximate the sum of paths by selecting e into account. the maximum path. end() i + 1 np()1d + ˆ np()1dd++ np ()1 Next, the chunking model P(|)eu0 E is EEkueu≈ arg maxPP ( ) (00 | ) E lp()1d + transformed. All chunking patterns of E = e0 np()1d + × P(|)eu E np()1d + 0 into eu0 are denoted by each l(pd)+1 point

(2.10) between l(pd)+2 characters that serve or do not We show an instance of each probability serve as delimiters. eu and eu are deter- 0 np()1d + model with a concrete value as follows: mined in advance. l(pd)-1 points remain ambigu- ous. We represent the value that is delimiter or is P()E P(^Crutchfield$) , non-delimiter between ej and ej+1 by zj. We call the z delimiter distinction. j np()1d + delimiter P(|Kku0 ) z j = (2.13) P(^クラッチフィールド$ | ^ / ク/ラ/ッチ/フィー/ル/ド/$ ) , { non-delimiter ()(/////)ku ra cchi fi−− ru do ku ra cchi fi ru do Here, we show an example of English units us-

ing zj. np()1dd++ np ()1 P(|)ku00 eu P(^ /ク/ラ/ッチ/フィー/ル/ド/$ | ^ / C/ru/tch/fie/l/d / $) , (e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11) (//ku ra cchi / fi− //) ru do English: Crutchfie// / / ld/

P(|)eunp()1d + E 0 . Values of zj: 1 0 1 0 0 1 0 0 1 1 P(^ / C/ru/tch/fie/ld / $ | ^Crutchfield$) (z z z z z z z z z z ) 1 2 3 4 5 6 7 8 9 10

We broke down the language model P()E in In this example, a delimiter of zj is represented by equation 2.10 into letters. 1 and a non-delimiter is represented by 0. lp()1d + j−1 (2.11) The chunking model is transformed into a P()E ≈ ∏ Pe(|j eja− ) j=1 processing per character by using zj. And we re- Here, a is a constant. Equation 2.11 is an (a+1)- duce the condition. np()1d + gram model of English letters. P(|)eu0 E Next, we approximate the translation model = Pz(|)lp()1()1dd−+ e lp np()1dd++ np ()1 and the chunking model 00 P(|)ku00 eu lp()1− np()1+ d P(|)eud E . For this, we use our previously j−1 lp()1d + 0 = ∏ Pz(|j z00 , e ) proposed approximation technique (Goto et al., j=1 2003). The outline of the technique is shown as lp()1d − ≈ Pz(| zjj−+11 , e ) follows. ∏ j jc−−1 jc − (2.14) j=1 The conditional information of the English combinations of letters. In addition, we use j+1 vowel, consonant, and semi-vowel classes for the e j−c is as many as c characters and 1 character before and after z , respectively. The conditional translation model. We manually define the com- j binations of the letter positions such as e and e . information of z j−1 is as many as c+1 delimiter j j-1 j−−c 1 The feature functions consist of the letter combi- distinctions before zj. nations that meet the combinations of the letter By using equation 2.11, 2.12, and 2.14, equa- positions and are observed at least once in the tion 2.10 becomes as follows: learning data. lp()1d + ˆ j−1 E ≈ arg max∏ Pe (jja | e− ) 2.6 Corpus for learning E j=1

np()d A Japanese-English word list aligned by unit was start() i − 1 used for learning the translation model and the × ∏ Pe(|kuistartibiendi()−+ ,, eu e () 1 ) i=1 chunking model and for generating the lattice of lp()1d − conversion candidates. The alignment was done jj−+11 × ∏ Pz(|jjcjc z−−1 , e − ). by semi-automatically. A romanized katakana j=1 character usually corresponds to one or several (2.15) English letters or strings. For example, a roman- Equation 2.15 is the equation of our back ized katakana character "k" usually corresponds transliteration method. to an English letter "c," "k," "ch," or "q." With such heuristic rules, the Japanese-English word 2.4 Beam search solution for context corpus could be aligned by unit and the align- sensitive grammar ment errors were corrected manually. Equation 2.15 includes context-sensitive gram- mar. As such, it can not be carried out efficiently. 3 Experiment In decoding from the head of a word to the tail, eend(i)+1 in equation 2.15 becomes context- 3.1 Learning data and test data sensitive. Thus we try to get approximate results by using a beam search solution. To get the re- We conducted an experiment on back translitera- sults, we use dynamic programming. Every node tion using English personal names. The learning of eu in the lattice keeps the N-best results evalu- data used in the experiment are described below. ated by using a letter of eend(i)+1 that gives the The Dictionary of Western Names of 80,000 maximum probability in the next letters. When People2 was used as the source of the Japanese- the results of next node are evaluated for select- English word corpus. We chose the names in al- ing the N-best, the accurate probabilities from the phabet from A to Z and their corresponding kata- previous nodes are used. kana. The number of distinct words was 39,830 for English words and 39,562 for katakana words. 2.5 Learning probability models based The number of English-katakana pairs was on the maximum entropy method 83,0573. We related the alphabet and katakana The probability models are learned based on the character units in those words by using the maximum entropy method. This makes it possi- method described in section 2.6. We then used ble to prevent data sparseness relating to the the corpus to make the translation and the chunk- model as well as to efficiently utilize many con- ing models and to generate a lattice of conversion ditions, such as context, simultaneously. We use candidates. the Gaussian Prior (Chen and Rosenfeld, 1999) The learning of the language model was car- smoothing method for the language model. We ried out using a word list that was created by use one Gaussian variance. We use the value of merging two word lists: an American personal- the Gaussian variance that minimizes the test set's perplexity. The feature functions of the models based on 2 Published by Nichigai Associates in Japan in 1994. the maximum entropy method are defined as 3 This corpus includes many identical English-katakana word pairs. 4 name list , and English head words of the Dic- lp()1dd+ np () Eˆ = arg maxPe ( | ej−1 ) P (ku | eu ) tionary of Western Names of 80,000 people. The ∏∏jj−3 i i E ji==11 American name list contains frequency informa- lp()1d − tion for each name; we also used the frequency jj−+14 × ∏ Pz(|jj z−−43 , e j ). (3.3) data for the learning of the language model. A j=1 test set for evaluating the value of the Gaussian • Method D variance was created using the American name Used our translation model that considered Eng- list. The list was split 9:1, and we used the larger lish context, but not the chunking model. b = 3 in data for learning and the smaller data for evaluat- the translation model. The plausibility is ex- ing the parameter value. pressed as follows: The test data is as follows. The test data con- lp()1d + ˆ j−1 tained 333 katakana name words of American E = arg max∏ Pe (jj | e−3 ) Cabinet officials, and other high-ranking officials, E j=1 np()d as well as high-ranking governmental officials of start() i − 1 × ∏ Pe()kuistarti|,,.()−+ 3 eu iendi e () 1 Canada, the United Kingdom, Australia, and i=1 New Zealand (listed in the World Yearbook 2002 (3.4) published by Kyodo News in Japan). The English • Method E name words that were listed along with the corre- Used our language model, translation model, and sponding katakana names were used as answer chunking model. The plausibility is expressed as words. Words that included characters other than follows: the letters A to Z were excluded from the test lp()1d + data. Family names and First names were not ˆ j−1 E = arg max∏ Pe (jj | e−3 ) distinguished. E j=1

np()d start() i − 1 3.2 Experimental models × ∏ Pe()kuistarti|,,()−+ 3 eu iendi e () 1 i=1

We used the following methods to test the indi- lp()1d − vidual effects of each factor of our method. jj−+14 × ∏ Pz(|jj z−−43 , e j ). • Method A j=1 Used a model that did not take English context (3.5) into account. The plausibility is expressed as fol- lows: 3.3 Results np() 5 ˆ d Table 1 shows the results of the experiment on E = arg max∏ P (euii | ku ) . (3.1) E i=1 back transliteration from Japanese katakana to • Method B English. The conversion was determined to be Used our language model and a translation model successful if the generated English word agreed that did not consider English context. The con- perfectly with the English word in the test data. stant a = 3 in the language model. The plausibil- Table 2 shows examples of back transliterated ity is expressed as follows: words. lp()1dd+ np () Method A B C D E Eˆ = arg maxPe ( | ej−1 ) P (ku | eu ) ∏∏j jii−3 . Top 1 23.7 57.4 61.6 63.1 66.4 E ji==11 (3.2) Top 2 34.8 69.1 72.4 71.8 74.2 • Method C Top 3 42.9 73.6 76.6 75.4 79.3 Applied our chunking model to method B, with c Top 5 54.1 77.5 79.9 80.8 83.5 = 3 in the chunking model. The plausibility is Top 10 63.4 82.0 85.3 86.5 87.7 expressed as follows: Table 1: Ratio (%) of including the answer word in high-ranking words.

4 Prepared from the 1990 Census conducted by the U.S. Department of Commerce. Available at 5 For model D and E, we used N=50 for the beam search http://www.census.gov/genealogy/names/ . The list includes solution. In addition, we kept paths that represented parts of 91,910 distinct words. words existing in the learning data. Japanese katakana Created English method in an experiment using personal names. (romanized katakana) We will apply this technique to cross-language アシュクロフト Ashcroft information retrieval. (a shu ku to) キルシュシュタイン Kirschstein References (ki ru shu shu ta i n) Hsin-Hsi Chen, Sheng-Jie Huang, Yung-Wei Ding, スペンサー Spencer and Shih-Chung Tsai. 1998. Proper Name Transla- (su pe n -) tion in Cross-Language Information Retrieval. 36th パウエル Powell Annual Meeting of the Association for Computa- tional Linguistics and 17th International Conference (pa u e ru) on Computational Linguistics, pp.232-236. プリンシピ Principi Stanley F. Chen, Ronald Rosenfeld. 1999. A Gaussian

(pu n pi) Prior for Smoothing Maximum Entropy Models. Table 2: Example of English words produced. Technical Report CMU-CS-99-108, Carnegie Mel- lon University. 4 Discussion Bonnie Glover Stalls and Kevin Knight. 1998. Trans- lating Names and Technical Terms in Arabic Text. The correct-match ratio of the method E for the COLING/ACL Workshop on Computational Ap- first-ranked words was 66%. Its correct-match proaches to Semitic Languages. ratio for words up to the 10th rank was 87%. Isao Goto, Naoto Kato, Noriyoshi Uratani, and Teru- Regarding the top 1 ranked words, method B masa Ehara. 2003. Transliteration Considering that used a language model increase the ratio 33- Context Information based on the Maximum En- points from method A that did not use a language tropy Method. Machine Translation Summit IX, model. This demonstrates the effectiveness of the pp.125-132. language model. Kil Soon Jeong, Sung Hyun Myaeng, Jae Sung Lee, Also for the top 1 ranked words, method C and Key-Sun Choi. 1999. Automatic Identification which adopted the chunking model increase the and Back-Transliteration of Foreign Words for In- ratio 4-points from method B that did not adopt formation Retrieval. Information Processing and the chunking model in the top 1 ranked words. Management, Vol.35, .4, pp.523-540. This indicates the effectiveness of the chunking Byung-Ju Kang and Key-Sun Choi. 2000. Automatic model. Transliteration and Back-Transliteration by Deci- Method D that used a translation model taking sion Tree Learning. International Conference on English context into account had a ratio 5-points Language Resources and Evaluation. pp.1135-1411. higher in top 1 ranked words than that of method Kevin Knight and Jonathan Graehl. 1998. Machine B that used a translation model not taking Eng- Transliteration. Computational Linguistics, Vol.24, lish context into account. This demonstrates the No.4, pp.599-612. effectiveness of the language model. Wei-Hao Lin and Hsin-Hsi Chen. 2002. Backward Method E gave the best ratio. Its ratio for the Machine Transliteration by Learning Phonetic top 1 ranked word was 42-points higher than Similarity. 6th Conference on Natural Language Learning, pp.139-145. method A's. These results demonstrate the effectiveness of Atsushi Fujii and Tetsuya Ishikawa. 2001. Japa- nese/English Cross-Language Information Re- using English context for back transliteration. trieval: Exploration of Query Translation and Transliteration. Computers and the Humanities, 5 Conclusion Vol.35, No.4, pp.389-420. This paper described a method for Japanese to English back transliteration. Unlike conventional methods, our method uses a target English con- text to calculate the plausibility of matching be- tween English and katakana. Our method can treat English words that do not exist in learning data. We confirmed the effectiveness of our