<<

DIRECTL: Language-Independent Approach Transliteration

Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer, Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton, AB, T6G 2E8, Canada {sj,abhargava,qdou,dwyer,kondrak}@cs.ualberta.ca

Abstract results of DIRECTL with its variants that incor- porate language-specific pre-processing, phonetic present DIRECTL: an online discrimi- alignment, and manual data correction. native sequence prediction model that em- 2 Transliteration alignment ploys a many-to-many alignment between target and source. Our system incorpo- In the transliteration task, training data consist of rates input segmentation, target charac- word pairs that map source language words to ter prediction, and sequence modeling in words in the target language. The matching be- a unified dynamic programming frame- tween character substrings in the source word and work. Experimental results suggest that target word is not explicitly provided. These hid- DIRECTL is able to independently dis- den relationships are generally known as align- cover many of the language-specific reg- ments. In this section, we describe an EM-based ularities in the training data. many-to-many alignment algorithm employed by DIRECTL. In Section 4, we discuss an alternative 1 Introduction phonetic alignment method. In the transliteration task, it seems intuitively im- We apply an unsupervised many-to-many align- portant to take into consideration the specifics of ment algorithm (Jiampojamarn et al., 2007) to the the languages in question. Of particular impor- transliteration task. The algorithm follows the ex- tance is the relative character length of the source pectation maximization (EM) paradigm. In the and target names, which vary widely depending on expectation step shown in Algorithm 1, partial counts γ of the possible substring alignments are whether languages employ alphabetic, syllabic, or T V ideographic scripts. On the other hand, faced with collected from each word pair (x ,y ) in the the reality of thousands of potential language pairs training data; T and V represent the lengths of that involve transliteration, the idea of a language- words x and y, respectively. The forward prob- independent approach is highly attractive. ability α is estimated by summing the probabili- ties of all possible sequences of substring pairings In this paper, we present DIRECTL: a translit- eration system that, in principle, can be applied to from left to right. The FORWARD-M2M procedure is similar to lines 5 through 12 of Algorithm 1, ex- any language pair. DIRECTL treats the transliter- ation task as a sequence prediction problem: given cept that it uses Equation 1 on line 8, Equation 2 on line 12, and initializes α . Likewise, the an input sequence of characters in the source lan- 0,0 := 1 backward probability β is estimated by summing guage, it produces the most likely sequence of the probabilities from right to left. characters in the target language. In Section 2, we discuss the alignment of character substrings α += δ(xt ,ǫ)α (1) in the source and target languages. Our transcrip- t,v t−+1 t−i,v tion model, described in Section 3, is based on t v α += δ(x ,y )α − − (2) an online discriminative training algorithm that t,v t−i+1 v−j+1 t i,v j makes it possible to efficiently learn the weights The maxX and maxY variables specify the of a large number of features. In Section 4, we maximum length of substrings that are permitted provide details of alternative approaches that in- when creating alignments. Also, for flexibility, we corporate language-specific information. Finally, allow a substring in the source word to be aligned in Section 5 and 6, we compare the experimental with a “null” letter (ǫ) in the target word.

28 Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages 28–31, Suntec, Singapore, 7 August 2009. c 2009 ACL and AFNLP Algorithm 1: Expectation-M2M alignment Algorithm 2: Online discriminative training T V Input: x ,y ,maxX,maxY,γ Input: Data {(x1, y1), (x2, y2),..., (xm, ym)}, Output: γ number of iterations k, size of -best list n T V Output: Learned weights ψ 1 α := FORWARD-M2M (x ,y ,maxX,maxY ) T V ~ 2 β := BACKWARD-M2M (x ,y ,maxX,maxY ) 1 ψ := 0

3 if (αT,V = 0) then 2 for k iterations do 4 return 3 for j = 1 ...m do 5 for t = 0 ...T , v = 0 ...V do 4 Yˆj = {yˆj1,..., yˆjn} = arg maxy[ψ · Φ(xj , y)]

6 if (t> 0) then 5 update ψ according to Yˆj and yj 7 for i = 1 ...maxX st t − i ≥ 0 do t 6 return ψ t αt i,v δ(xt i+1,ǫ)βt,v − 8 γ(xt i+1, ǫ)+= α − − T,V 9 if (v > 0 ∧ t> 0) then 10 for i = 1 ...maxX st t − i ≥ 0 do rameters ψ. The values of k and n are deter- 11 for j = 1 ...maxY st v − j ≥ 0 do t v mined using a development set. The model param- t v αt i,v j δ(xt i+1,yv j+1)βt,v − − 12 γ(xt i+1,yv j+1)+= α− − − − T,V eters are updated according to the correct output yj and the predicted n-best outputs Yˆj, to make the model prefer the correct output over the in- In the maximization step, we normalize the par- correct ones. Specifically, the feature weight vec- tial counts γ to the alignment probability δ using tor ψ is updated by using MIRA, the Margin In- the conditional probability distribution. The EM fused Relaxed Algorithm (Crammer and Singer, steps are repeated until the alignment probability 2003). MIRA modifies the current weight vector δ converges. Finally, the most likely alignment for ψo by finding the smallest changes such that the each word pair in the training data is computed new weight vector ψn separates the correct and in- with the standard Viterbi algorithm. correct outputs by a margin of at least ℓ(y, yˆ), the loss for a wrong prediction. We define this loss to 3 Discriminative training be 0 if yˆ = y; otherwise it is 1+ d, where d is We adapt the online discriminative training frame- the Levenshtein distance between y and yˆ. The work described in (Jiampojamarn et al., 2008) to update operation is stated as a quadratic program- the transliteration task. Once the training data has ming problem in Equation 3. We utilize a function been aligned, we can hypothesize that the ith let- from the SVMlight package (Joachims, 1999) to ter substring xi ∈ x in a source language word solve this optimization problem. is transliterated into the ith substring y ∈ y in i min k ψ − ψ k the target language word. Each word pair is rep- ψn n subject to ∀yˆ ∈ Yˆ : (3) resented as a feature vector Φ(x, y). Our feature ψ x, y x, y ℓ y, y vector consists of (1) n-gram context features, (2) n · (Φ( ) − Φ( ˆ)) ≥ ( ˆ) HMM-like transition features, and (3) linear-chain The arg max operation is performed by an exact features. The n-gram context features relate the search algorithm based on a phrasal decoder (Zens letter evidence that surrounds each letter xi to its and Ney, 2004). This decoder simultaneously output yi. We include all n-grams that fit within finds the l most likely substrings of letters x that a context window of size c. The c value is deter- generate the most probable output y, given the mined using a development set. The HMM-like feature weight vector ψ and the input word xT . transition features express the cohesion of the out- The search algorithm is based on the following dy- put y in the target language. We make a first order namic programming recurrence: Markov assumption, that these features are bi- Q(0, $) = 0 grams of the form (yi−1,yi). The linear-chain fea- Q(t,p) = max{ψ · φ(xt ,p′,p)+ Q(t′,p′)} tures are identical to the context features, except t′+1 p′,p, that yi is replaced with a bi-gram (yi−1,yi). t−maxX≤t′

29 a sub-output generated during the previous step. In Japanese, we replace each character The notation φ(xt ,p′,p) is a convenient way with one or two using standard tran- t′+1 to describe the components of our feature vector scription tables. For the Latin script, we simply Φ(x, y). The n-best predicted outputs Yˆ can be treat every letter as an IPA symbol (International discovered by backtracking from the end of the - Phonetic Association, 1999). The IPA contains a ble, which is denoted by Q(T + 1, $). subset of 26 letter symbols that tend to correspond to the usual phonetic value that the letter repre- 4 Beyond DIRECTL sents in the Latin script. The Chinese characters 4.1 Intermediate phonetic representation are first converted to Pinyin, which is then handled We experimented with converting the original - in the same way as the Latin script. nese characters to Pinyin as an intermediate repre- Similar solutions could be engineered for other sentation. Pinyin is the most commonly known scripts. We observed that the transcriptions do not Romanization system for Standard Mandarin. Its need to be very precise in order for ALINE to pro- alphabet contains the same 26 letters as English. duce high quality alignments. Each Chinese character can be transcribed pho- 4.3 System combination netically into Pinyin. Many resources for Pinyin The combination of predictions produced by sys- 1 conversion are available online. A small percent- tems based on different principles may lead to im- age of Chinese characters have multiple pronunci- proved prediction accuracy. We adopt the follow- ations represented by different Pinyin representa- ing combination algorithm. First, we rank the in- tions. For those characters (about 30 characters in dividual systems according to their top-1 accuracy the transliteration data), we manually selected the on the development set. To obtain the top-1 pre- pronunciations that are normally used for names. diction for each input word, we use simple voting, This preprocessing step significantly reduces the with ties broken according to the ranking of the size of target symbols from 370 distinct Chinese systems. We generalize this approach to handle n- characters to 26 Pinyin symbols which enables our best lists by first ordering the candidate translitera- system to produce better alignments. tions according to the highest rank assigned by any In order to verify whether the addition of of the systems, and then similarly breaking ties by language-specific knowledge can improve the voting and system ranking. overall accuracy, we also designed intermediate representations for Russian and Japanese. We 5 Evaluation focused on symbols that modify the neighbor- In the context of the NEWS 2009 Machine ing characters without producing phonetic output Transliteration Shared Task (Li et al., 2009), we themselves: the two characters in Russian, tested our system on six data sets: from English to and the long vowel and signs in Japanese. Chinese (EnCh) (Li et al., 2004), Hindi (EnHi), Those were combined with the neighboring char- Russian (EnRu) (Kumaran and Kellner, 2007), acters, creating new “super-characters.” Japanese Katakana (EnJa), and Korean (EnKo); and from Japanese Name to Japanese 4.2 Phonetic alignment with ALINE Kanji (JnJk)2. We optimized the models’ param- ALINE (Kondrak, 2000) is an algorithm that eters by training on the training portion of the performs phonetically-informed alignment of two provided data and measuring performance on the strings of phonemes. Since our task requires development portion. For the final testing, we the alignment of characters representing different trained the models on all the available labeled data writing scripts, we need to first replace every char- (training plus development data). For each data acter with a that is the most likely to be set, we converted any uppercase letters to lower- produced by that character. case. Our system outputs the top 10 candidate an- We applied slightly different methods to the swers for each input word. test languages. In converting the Table 1 reports the performance of our system into phonemes, we take advantage of the fact on the development and final test sets, measured that the Russian is largely phonemic, in terms of top-1 word accuracy (ACC). For cer- which makes it a relatively straightforward task. tain language pairs, we tested variants of the base

1For example, http://www.chinesetopinyin.com/ 2http://www.cjk.org/

30 Task Model Dev Test quality alignments. The former can be applied di- EnCh DIRECTL 72.4 71.7 INT(M2M) 73.9 73.4 rectly to the training data without the need for an INT(ALINE) 73.8 73.2 intermediate representation, while the latter does COMBINED 74.8 74.6 not require any training. Surprisingly, incorpo- EnHi DIRECTL 41.4 49.8 DIRECTL+MC 42.3 50.9 ration of language-specific intermediate represen- EnJa DIRECTL 49.9 50.0 tations does not consistently improve the perfor- INT(M2M)∗ 49.6 49.2 mance of our system, which indicates that DI- INT(ALINE) 48.3 51.0 COMBINED∗ 50.6 50.5 RECTL may be able to discover the structures im- EnKo DIRECTL 36.7 38.7 plicit in the training data without additional guid- EnRu DIRECTL 80.2 61.3 ance. The EnHi results suggest that manual clean- INT(M2M) 80.3 60.8 INT(ALINE) 80.0 60.7 ing of noisy data can yield noticeable gains in ac- COMBINED∗ 80.3 60.8 curacy. On the other hand, a simple method of JnJk DIRECTL 53.5 56.0 combining predictions from different systems pro- Table 1: Top-1 word accuracy on the development duced clear improvement on the EnCh set, but and test sets. The asterisk denotes the results ob- mixed results on two other sets. More research on tained after the test reference sets were released. this issue is warranted. Acknowledgments system described in Section 4. DIRECTL refers to our language-independent model, which uses This research was supported by the Alberta Inge- many-to-many alignments. The INT abbreviation nuity, Informatics Circle of Research Excellence denotes the models operating on the language- (iCORE), and Natural Sciences of Engineering specific intermediate representations described in Research Council of Canada (NSERC). Section 4.1. The alignment algorithm (ALINE or References M2M) is given in . Koby Crammer and Yoram Singer. 2003. Ultracon- In the EnHi set, many names consisted of mul- servative online algorithms for multiclass problems. tiple words: we assumed a one-to-one correspon- Journal of Machine Learning Research, 3:951–991. dence between consecutive English words and International Phonetic Association. 1999. Handbook consecutive Hindi words. In Table 1, the results in of the International Phonetic Association. Cam- bridge University Press. the first row (DIRECTL) were obtained with an au- tomatic cleanup script that replaced with Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007. Applying many-to-many alignments spaces, deleted the remaining and - and Hidden Markov Models to letter-to-phoneme merical symbols, and removed 43 transliteration conversion. In Proc. HLT-NAACL, pages 372–379. pairs with a disagreement between the number of Sittichai Jiampojamarn, Colin Cherry, and Grzegorz source and target words. The results in the sec- Kondrak. 2008. Joint processing and discriminative training for letter-to-phoneme conversion. In Proc. ond row (DIRECTL+MC) were obtained when the ACL, pages 905–913. cases with a disagreement were individually ex- Thorsten Joachims. 1999. Making large-scale SVM amined and corrected by a Hindi speaker. learning practical. Advances in kernel methods: We did not incorporate any external resources support vector learning, pages 169–184. MIT Press. into the models presented in Table 1. In order Grzegorz Kondrak. 2000. A new algorithm for the to emphasize the performance of our language- alignment of phonetic sequences. In Proc. NAACL, independent approach, we consistently used the pages 288–295. A. Kumaran and Tobias Kellner. 2007. A generic DIRECTL model for generating our “standard” framework for machine transliteration. In Proc. SI- runs on all six language pairs, regardless of its rel- GIR, pages 721–722. ative performance on the development sets. Haizhou Li, Min Zhang, and Jian . 2004. A joint source channel model for machine transliteration. In 6 Discussion Proc. ACL, pages 159–166. DIRECTL, our language-independent approach to Haizhou Li, A Kumaran, Min Zhang, and Vladimir transliteration achieves excellent results, espe- Pervouchine. 2009. Whitepaper of NEWS 2009 cially on the EnCh, EnRu, and EnHi data sets, machine transliteration shared task. In Proc. ACL- IJCNLP Named Entities Workshop. which represent a wide range of language pairs Richard Zens and Hermann Ney. 2004. Improvements and writing scripts. Both the many-to-many in phrase-based statistical machine translation. In and phonetic alignment algorithms produce high- Proc. HLT-NAACL, pages 257–264.

31