Back Transliteration from Japanese to English Using Target English Context
Total Page:16
File Type:pdf, Size:1020Kb
Back Transliteration from Japanese to English Using Target English Context Isao Goto†, Naoto Kato††, Terumasa Ehara†††, and Hideki Tanaka† †NHK Science and Technical ††ATR Spoken Language Trans- †††Tokyo University of Research Laboratories lation Research Laboratories Science, Suwa 1-11-10 Kinuta, Setagaya, 2-2-2 Hikaridai, Keihanna 5000-1, Toyohira, Chino, Tokyo, 157-8510, Japan Science City, Kyoto, 619-0288, Nagano, 391-0292, Japan [email protected] Japan [email protected]. [email protected] [email protected] ac.jp cross-language information retrieval if these Abstract words could be automatically restored to the original English words. This paper proposes a method of automatic Back transliteration is the process of restoring back transliteration of proper nouns, in which transliterated words to the original English words. a Japanese transliterated-word is restored to Here is a problem of back transliteration. the original English word. The English words are created from a sequence of letters; thus [Back transliteration] our method can create new English words that ? クラッチフィールド are not registered in dictionaries or English (English word) (ku ra cchi fi – ru do) word lists. When a katakana character is con- verted into English letters, there are various There are many ambiguities to restoring a candidates of alphabetic characters. To ensure transliterated katakana word to its original Eng- adequate conversion, the proposed method lish word. For example, should "a" in "ku ra cchi uses a target English context to calculate the fi – ru do" be converted into the English letter of probability of an English character or string "a" or "u" or some other letter or string? Trying corresponding to a Japanese katakana charac- to resolve the ambiguity is a difficult problem, ter or string. We confirmed the effectiveness which means that back transliteration to the cor- of using the target English context by an ex- rect English word is also difficult. periment of personal-name back translitera- Using the pronunciation of a dictionary or lim- tion. iting output English words to a particular English word list prepared in advance can simplify the problem of back transliteration. However, these 1 Introduction methods cannot produce a new English word that In transliteration, a word in one language is con- is not registered in a dictionary or an English verted into a character string of another language word list. Transliterated words are mainly proper expressing how it is pronounced. In the case of nouns and technical terms, and such words are transliteration into Japanese, special characters often not registered. Thus, a back transliteration called katakana are used to show how a word is framework for creating new words would be very pronounced. For example, a personal name and useful. its transliterated word are shown below. A number of back transliteration methods for selecting English words from an English pronun- [Transliteration] ciation dictionary have been proposed. They in- Cunningham カニンガム (ka ni n ga mu) clude Japanese-to-English (Knight and Graehl, 1998) 1 , Arabic-to-English (Stalls and Knight, Here, the italic alphabets are romanized Japanese katakana characters. New transliterated words such as personal names or technical terms in katakana are not al- 1 Their English letter-to-sound WFST does not convert Eng- ways listed in dictionaries. It would be useful for lish words that are not registered in a pronunciation diction- ary. 1998), and Korean-to-English (Lin and Chen, the English letter "u." The correspondence of an 2002). English letter or string to a katakana character or There are also methods that select English string varies depending on the surrounding char- words from an English word list, e.g., Japanese- acters, i.e., on its English context. to-English (Fujii and Ishikawa, 2001) and Chi- Thus, our back transliteration method uses the nese-to-English (Chen et al., 1998). target English context to calculate the probability Moreover, there are back transliteration meth- of English letters corresponding to a katakana ods capable of generating new words, there are character or string. some methods for back transliteration from Ko- rean to English (Jeong et al., 1999; Kang and 2.2 Notation and conversion-candidate Choi, 2000). lattice These previous works did not take the target We formulate the word conversion process as a English context into account for calculating the unit conversion process for treating new words. plausibility of matching target characters with the Here, the unit is one or more characters that form source characters. a part of characters of the word. This paper presents a method of taking the tar- A katakana word, K, is expressed by equation get English context into account to generate an 2.1 with "^" and "$" added to its start and end, English word from a Japanese katakana word. respectively. Our character-based method can produce new K ==kkkkm+1 ... (2.1) English words that are not listed in the learning 0011m+ k = ^ k = $ (2.2) corpus. 0 , m+1 This paper is organized as follows. Section 2 where k j is the j-th character in the katakana describes our method. Section 3 describes the word, and m is the number of characters except experimental set-up and results. Section 4 dis- for "^" and "$" and k m+1 is a character string cusses the performance of our method based on 0 the experimental results. Section 5 concludes our from k0 to km+1 . research. We use katakana units constructed of one or more katakana characters. We denote a katakana 2 Proposed Method unit as ku. For any ku, many English units, eu, could be corresponded as conversion-candidates. The ku's and eu's are generated using a learning 2.1 Advantage of using English context corpus in which bilingual words are separated First we explain the difficulty of back translitera- into units and every ku unit is related an eu unit. tion without a pronunciation dictionary. Next, we L{}E denotes the lattice of all eu's correspond- clarify the reason for the difficulty. Finally, we ing to ku's covering a Japanese word. Every eu is clarify the effect using English context in back a node of the lattice and each node is connected transliteration. with next nodes. L{}E has a lattice structure start- In back transliteration, an English letter or ing from "^" and ending at "$." Figure 1 shows an string is chosen to correspond to a katakana char- example of L{}E corresponding to a katakana acter or string. However, this decision is difficult. For example, there are cases that an English letter word "キルシュシュタイン(ki ru shu shu ta i "u" corresponds to "a" of katakana, and there are n)." In the figure, each circle represents one eu. cases that the same English letter "u" does not A character string linking individual character correspond to the same "a" of katakana. "u" in units in the paths ppppdq∈(12 , ,.., ) between "^" Cunningham corresponds to "a" in katakana and and "$" in L{}E becomes a conversion candidate, "u" in Bush does not correspond to "a" in kata- where q is the number of paths between "^" and kana. It is difficult to resolve this ambiguity "$" in L{}E . without the pronunciation registered in a diction- We get English word candidates by joining eu's ary. from "^" to "$" in L{}E . We select a certain path, The difference in correspondence mainly comes from the difference of the letters around pd, in L{}E . The number of character units キ(ki) ル(ru) シュ(shu) シュ(shu) タ(ta) イ(i) ン(n) ^ cchi l ch ch ta e m $ chi ld che che tad hi mon ci le chou chou tag i mp cki les chu chu te ji n cy lew s s ter y ne k ll sc sc tha yeh ng ke lle sch sch ti yi・・・ngh khi llu schu schu to nin ki lou sh sh tta nn kie lu she she tu・・・ nne kii r shu shu nt ky rc su su nw・・・ タイ qui・・・rd sy sy (tai) re sz・・・sz・・・ t rg taj roo tay rou tey rr ti rre tie rt ty lu・・・ tye・・・ Figure 1: Example of lattice L{}E of conversion candidates units. ˆ except for "^" and "$" in pd is expressed as np()d . E =EKarg maxP ( | ) . (2.6) The character units in p are numbered from start E d ˆ to end. Here, E represents an output result. The English word, E, resulting from the con- To use the English context for calculating the matching of an English unit with a katakana unit, version of a katakana word, K, for pd is expressed as follows: the above equation is transformed into Equation m+1 2.7 by using Bayes’ theorem. K ==kkkk0011.. m+ Eˆ = arg maxPP (EKE ) ( | ) (2.7) np()1d + = ku = ku ku.. ku , (2.3) E 001()1npd + lp()1+ Equation 2.7 contains a translation model in E ==eeeed .. 001()1lpd + which an English word is a condition and kata- = eunp()1d + = eu eu.. eu , (2.4) 001()1npd + kana is a result. The word in the translation model P(|)K E in ke00==ku 0 = eu 0 =^ , Equation 2.7 is broken down into character units ke==ku = eu =$ (2.5) mlpnpnp++1 (ddd )1 ( )1 + ( )1 +, by using equations 2.3 and 2.4. where ej is the j-th character in the English word. EEˆ = arg maxP ( ) E lp()d is the number of characters except for "^" np()1dd++ np ()1 np()1d + × P(,Kku00 , eu | E ) and "$" in the English word. eu0 for each pd ∑∑ eunp()1()1dd++ ku np in L{}E in equation 2.4 becomes the candidate 00 = arg maxP (E ) np()1d + English word.