Directl: a Language Independent Approach to Transliteration

DIRECTL: a Language-Independent Approach to Transliteration Sittichai Jiampojamarn, Aditya Bhargava, Qing Dou, Kenneth Dwyer, Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton, AB, T6G 2E8, Canada {sj,abhargava,qdou,dwyer,kondrak}@cs.ualberta.ca Abstract results of DIRECTL with its variants that incor- porate language-specific pre-processing, phonetic We present DIRECTL: an online discrimi- alignment, and manual data correction. native sequence prediction model that em- 2 Transliteration alignment ploys a many-to-many alignment between target and source. Our system incorpo- In the transliteration task, training data consist of rates input segmentation, target charac- word pairs that map source language words to ter prediction, and sequence modeling in words in the target language. The matching be- a unified dynamic programming frame- tween character substrings in the source word and work. Experimental results suggest that target word is not explicitly provided. These hid- DIRECTL is able to independently dis- den relationships are generally known as align- cover many of the language-specific reg- ments. In this section, we describe an EM-based ularities in the training data. many-to-many alignment algorithm employed by DIRECTL. In Section 4, we discuss an alternative 1 Introduction phonetic alignment method. In the transliteration task, it seems intuitively im- We apply an unsupervised many-to-many align- portant to take into consideration the specifics of ment algorithm (Jiampojamarn et al., 2007) to the the languages in question. Of particular impor- transliteration task. The algorithm follows the ex- tance is the relative character length of the source pectation maximization (EM) paradigm. In the and target names, which vary widely depending on expectation step shown in Algorithm 1, partial counts γ of the possible substring alignments are whether languages employ alphabetic, syllabic, or T V ideographic scripts. On the other hand, faced with collected from each word pair (x ,y ) in the the reality of thousands of potential language pairs training data; T and V represent the lengths of that involve transliteration, the idea of a language- words x and y, respectively. The forward prob- independent approach is highly attractive. ability α is estimated by summing the probabilities of all possible sequences of substring pairings In this paper, we present DIRECTL: a transliteration system that, in principle, can be applied to from left to right. The FORWARD-M2M procedure is similar to lines 5 through 12 of Algorithm 1, ex- any language pair. DIRECTL treats the transliteration task as a sequence prediction problem: given cept that it uses Equation 1 on line 8, Equation 2 on line 12, and initializes α . Likewise, the an input sequence of characters in the source lan- 0,0 := 1 backward probability β is estimated by summing guage, it produces the most likely sequence of the probabilities from right to left. characters in the target language. In Section 2, we discuss the alignment of character substrings α += δ(xt ,ǫ)α (1) in the source and target languages. Our transcrip- t,v t−i+1 t−i,v tion model, described in Section 3, is based on t v α += δ(x ,y )α − − (2) an online discriminative training algorithm that t,v t−i+1 v−j+1 t i,v j makes it possible to efficiently learn the weights The maxX and maxY variables specify the of a large number of features. In Section 4, we maximum length of substrings that are permitted provide details of alternative approaches that in- when creating alignments. Also, for flexibility, we corporate language-specific information. Finally, allow a substring in the source word to be aligned in Section 5 and 6, we compare the experimental with a “null” letter (ǫ) in the target word. 28 Proceedings of the 2009 Named Entities Workshop, ACL-IJCNLP 2009, pages 28–31, Suntec, Singapore, 7 August 2009. c 2009 ACL and AFNLP Algorithm 1: Expectation-M2M alignment Algorithm 2: Online discriminative training T V Input: x ,y ,maxX,maxY,γ Input: Data {(x1, y1), (x2, y2),..., (xm, ym)}, Output: γ number of iterations k, size of n-best list n T V Output: Learned weights ψ 1 α := FORWARD-M2M (x ,y ,maxX,maxY ) T V ~ 2 β := BACKWARD-M2M (x ,y ,maxX,maxY ) 1 ψ := 0 3 if (αT,V = 0) then 2 for k iterations do 4 return 3 for j = 1 ...m do 5 for t = 0 ...T , v = 0 ...V do 4 Yˆj = {yˆj1,..., yˆjn} = arg maxy[ψ · Φ(xj , y)] 6 if (t> 0) then 5 update ψ according to Yˆj and yj 7 for i = 1 ...maxX st t − i ≥ 0 do t 6 return ψ t αt i,v δ(xt i+1,ǫ)βt,v − 8 γ(xt i+1, ǫ)+= α − − T,V 9 if (v > 0 ∧ t> 0) then 10 for i = 1 ...maxX st t − i ≥ 0 do rameters ψ. The values of k and n are deter- 11 for j = 1 ...maxY st v − j ≥ 0 do t v mined using a development set. The model param- t v αt i,v j δ(xt i+1,yv j+1)βt,v − − 12 γ(xt i+1,yv j+1)+= α− − − − T,V eters are updated according to the correct output yj and the predicted n-best outputs Yˆj, to make the model prefer the correct output over the in- In the maximization step, we normalize the par- correct ones. Specifically, the feature weight vec- tial counts γ to the alignment probability δ using tor ψ is updated by using MIRA, the Margin In- the conditional probability distribution. The EM fused Relaxed Algorithm (Crammer and Singer, steps are repeated until the alignment probability 2003). MIRA modifies the current weight vector δ converges. Finally, the most likely alignment for ψo by finding the smallest changes such that the each word pair in the training data is computed new weight vector ψn separates the correct and in- with the standard Viterbi algorithm. correct outputs by a margin of at least ℓ(y, yˆ), the loss for a wrong prediction. We define this loss to 3 Discriminative training be 0 if yˆ = y; otherwise it is 1+ d, where d is We adapt the online discriminative training frame- the Levenshtein distance between y and yˆ. The work described in (Jiampojamarn et al., 2008) to update operation is stated as a quadratic program- the transliteration task. Once the training data has ming problem in Equation 3. We utilize a function been aligned, we can hypothesize that the ith let- from the SVMlight package (Joachims, 1999) to ter substring xi ∈ x in a source language word solve this optimization problem. is transliterated into the ith substring y ∈ y in i min k ψ − ψ k the target language word. Each word pair is rep- ψn n o subject to ∀yˆ ∈ Yˆ : (3) resented as a feature vector Φ(x, y). Our feature ψ x, y x, y ℓ y, y vector consists of (1) n-gram context features, (2) n · (Φ( ) − Φ( ˆ)) ≥ ( ˆ) HMM-like transition features, and (3) linear-chain The arg max operation is performed by an exact features. The n-gram context features relate the search algorithm based on a phrasal decoder (Zens letter evidence that surrounds each letter xi to its and Ney, 2004). This decoder simultaneously output yi. We include all n-grams that fit within finds the l most likely substrings of letters x that a context window of size c. The c value is deter- generate the most probable output y, given the mined using a development set. The HMM-like feature weight vector ψ and the input word xT . transition features express the cohesion of the out- The search algorithm is based on the following dy- put y in the target language. We make a first order namic programming recurrence: Markov assumption, so that these features are bi- Q(0, $) = 0 grams of the form (yi−1,yi). The linear-chain fea- Q(t,p) = max{ψ · φ(xt ,p′,p)+ Q(t′,p′)} tures are identical to the context features, except t′+1 p′,p, that yi is replaced with a bi-gram (yi−1,yi). t−maxX≤t′<t Algorithm 2 trains a linear model in this fea- Q(T +1, $) = max{ψ · φ($,p′, $) + Q(T,p′)} ture space. The procedure makes k passes over p′ the aligned training data. During each iteration, To find the n-best predicted outputs, the table the model produces the n most likely output words Q records the top n scores for each output sub- Yˆj in the target language for each input word xj string that has the suffix p substring and is gen- t ′ in the source language, based on the current pa- erated by the input letter substring x1; here, p is 29 a sub-output generated during the previous step. In Japanese, we replace each Katakana character The notation φ(xt ,p′,p) is a convenient way with one or two phonemes using standard tran- t′+1 to describe the components of our feature vector scription tables. For the Latin script, we simply Φ(x, y). The n-best predicted outputs Yˆ can be treat every letter as an IPA symbol (International discovered by backtracking from the end of the ta- Phonetic Association, 1999). The IPA contains a ble, which is denoted by Q(T + 1, $). subset of 26 letter symbols that tend to correspond to the usual phonetic value that the letter repre- 4 Beyond DIRECTL sents in the Latin script. The Chinese characters 4.1 Intermediate phonetic representation are first converted to Pinyin, which is then handled We experimented with converting the original Chi- in the same way as the Latin script.

Load more