Beyond Parallel Data: Joint Alignment and Decipherment Improves Machine

Qing Dou , Ashish Vaswani, and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California qdou,avaswani,knight @isi.edu { }

Abstract Inspired by previous work, where decipher- ment is used to improve , we propose a new idea to combine word align- ment and decipherment into a single learning process. We use EM to estimate the model pa- rameters, not only to maximize the probabil- ity of parallel corpus, but also the monolingual corpus. We apply our approach to improve Malagasy-English machine translation, where only a small amount of parallel data is avail- able. In our experiments, we observe gains of 0.9 to 2.1 Bleu over a strong baseline. 1 Introduction State-of-the-art machine translation (MT) systems ap- ply statistical techniques to learn translation rules au- Figure 1: Combine word alignment and decipherment tomatically from parallel data. However, this reliance into a single learning process. on parallel data seriously limits the scope of MT ap- plication in the real world, as for many languages and domains, there is not enough parallel data to train a de- et al., 2009; Bergsma and Van Durme, 2011; Daume´ cent quality MT system. and Jagarlamudi, 2011; Irvine and Callison-Burch, However, compared with parallel data, there are 2013b; Irvine and Callison-Burch, 2013a; Irvine et al., much larger amounts of non parallel data. The abil- 2013). ity to learn a translation lexicon or even build a ma- Lately, there has been increasing interest in learn- chine translation system using monolingual data helps ing translation lexicons from non parallel data with de- address the problems of insufficient parallel data. Ravi cipherment techniques (Ravi and Knight, 2011; Dou and Knight (2011) are among the first to learn a full and Knight, 2012; Nuhn et al., 2012; Dou and Knight, MT system using only non parallel data through deci- 2013). Decipherment views one language as a cipher pherment. However, the performance of such systems for another and learns a translation lexicon that pro- is much lower compared with those trained with par- duces fluent text in the target (plaintext) language. Pre- allel data. In another work, Klementiev et al. (2012) vious work has shown that decipherment not only helps show that, given a phrase table, it is possible to esti- find for OOVs (Dou and Knight, 2012), but mate parameters for a phrase-based MT system from also improves translations of observed (Dou and non parallel data. Knight, 2013). Given that we often have some parallel data, it is We find that previous work using monolingual or more practical to improve a translation system trained comparable data to improve quality of machine transla- on parallel data by using additional non parallel data. tion separates two learning tasks: first, translation rules Rapp (1995) shows that with a seed lexicon, it is possi- are learned from parallel data, and then the information ble to induce new word level translations from non par- learned from parallel data is used to bootstrap learning allel data. Motivated by the idea that a translation lexi- with non parallel data. Inspired by approaches where con induced from non parallel data can be used to trans- joint inference reduces the problems of error propaga- late out of vocabulary words (OOV), a variety of prior tion and improves system performance, we combine research has tried to build a translation lexicon from the two separate learning processes into a single one, non parallel or comparable data (Fung and Yee, 1998; as shown in Figure 1. The contributions of this work Koehn and Knight, 2002; Haghighi et al., 2008; Garera are:

557 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 557–565, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational We propose a new objective function for word 2.2 Word Alignment • alignment that combines the process of word Given a source sentence F = f1,..., fj,..., fJ and a alignment and decipherment into a single learning target sentence E = e1,..., ei,..., eI, word alignment task. models describe the generative process employed to produce the French sentence from the English sentence In experiments, we find that the joint process out- • through alignments a = a1,..., aj,..., aJ. performs the previous pipeline approach, and ob- The IBM models 1-2 (Brown et al., 1993) and the serve Bleu gains of 0.9 and 2.1 on two different HMM word alignment model (Vogel et al., 1996) use test sets. two sets of parameters, distortion probabilities and translation probabilities, to define the joint probabil- We release 15.3 million tokens of monolingual ity of a target sentence and alignment given a source • Malagasy data from the web, as well as a small sentence. Malagasy dependency tree bank containing 20k tokens. J P(F, a E) = d(a j a j 1, j)t( f j ea ). (3) | | − | j 2 Joint Word Alignment and = Yj 1 Decipherment These alignment models share the same translation

2.1 A New Objective Function probabilities t( f j ea j ), but differ in their treatment of | the distortion probabilities d(a j a j 1, j). Brown et In previous work that uses monolingual data to im- | − al. (1993) introduce more advanced models for word prove machine translation, a seed translation lexicon alignment, such as Model 3 and Model 4, which use learned from parallel data is used to find new transla- more parameters to describe the generative process. We tions through either word vector based approaches or do not go into details of those models here and the decipherment. In return, selection of a seed lexicon reader is referred to the paper describing them. needs to be careful as using a poor quality seed lexi- Under the Model 1-2 and HMM alignment models, con could hurt the downstream process. Evidence from the probability of target sentence given source sentence a number of previous work shows that a joint inference is: process leads to better performance in both tasks (Jiang et al., 2008; Zhang and Clark, 2008). J

In the presence of parallel and monolingual data, we P(F E) = d(a j a j 1, j)t( f j ea j ). | | − | would like the alignment and decipherment models to Xa Yj=1 benefit from each other. Since the decipherment and Let θ denote all the parameters of the word align- word alignment models contain word-to-word transla- ment model. Given a corpus of sentence pairs tion probabilities t( f e), having them share these pa- | (E1, F1),..., (Em, Fm),..., (EM, FM), the standard ap- rameters during learning will allow us to pool infor- proach for training is to learn the maximum likelihood mation from both data types. This leads us to de- estimate of the parameters, that is, velop a new objective function that takes both learn- ing processes into account. Given our parallel data, (E1, F1),..., (Em, Fm),..., (EM, FM), and monolingual M 1 n N m m data Fmono,..., Fmono,..., Fmono, we seek to maximize θ∗ = arg max log P(F E ) θ | the likelihood of both. Our new objective function is Xm=1 defined as: = arg max log P(Fm, a Em) . θ |  a  X  M N   m m n We typically use the EM algorithm (Dempster et al., F joint = log P(F E ) + α log P(F ) (1) | mono 1977), to carry out this optimization. Xm=1 Xn=1 2.3 Decipherment The goal of training is to learn the parameters that maximize this objective, that is Given a corpus of N foreign text sequences (cipher- 1 n N text), Fmono,..., Fmono,..., Fmono, decipherment finds word-to-word translations that best describe the cipher- text. θ∗ = arg max F joint (2) θ Knight et al. (2006) are the first to study several natu- ral language decipherment problems with unsupervised In the next two sections, we describe the word align- learning. Since then, there has been increasing interest ment and decipherment models, and present how they in improving decipherment techniques and its applica- are combined to perform joint optimization. tion to machine translation (Ravi and Knight, 2011;

558 Dou and Knight, 2012; Nuhn et al., 2012; Dou and Knight, 2013; Nuhn et al., 2013). In order to speed up decipherment, Dou and Knight (2012) suggest that a frequency list of might contain enough information for decipherment. Accord- ing to them, a monolingual ciphertext Fmono is generated through the following generative story: Generate a sequence of two plaintext tokens e e • 1 2 with probability P(e1e2) given by a language model built from large numbers of plaintext bi- grams. Substitute e with f and e with f with probabil- • 1 1 2 2 ity t( f e ) t( f e ). 1| 1 · 2| 2 The probability of any cipher bigram F is:

P(F ) = P(e e ) t( f e ) t( f e ) (4) mono 1 2 · 1| 1 · 2| 2 Xe1e2 And the probability of the corpus is:

N n P(corpus) = P(Fmono) (5) Figure 2: Joint Word Alignment and Decipherment Yn=1 with EM Given a plaintext bigram language model, the goal is to manipulate t( f e) to maximize P(corpus). Theoret- renormalize the translation table and distortion table to | ically, one can directly apply EM to solve the problem update parameters in the new M step. (Knight et al., 2006). However, EM has time complex- The E step for parallel part can be computed effi- 2 ity O(N V ) and space complexity O(V f Ve), where V f , ciently using the forward-backward algorithm (Vogel et · e · Ve are the sizes of ciphertext and plaintext vocabularies al., 1996). However, as we pointed out in Section 2.3, respectively, and N is the number of cipher bigrams. the E step for the non parallel part has a time com- There have been previous attempts to make decipher- plexity of O(V2) with the forward-backward algorithm, ment faster. Ravi and Knight (2011) apply Bayesian where V is the size of English vocabulary, and is usu- learning to reduce the space complexity. However, ally very large. Previous work has tried to make de- Bayesian decipherment is still very slow with Gibbs cipherment scalable (Ravi and Knight, 2011; Dou and sampling (Geman and Geman, 1987). Dou and Knight Knight, 2012; Nuhn et al., 2013; Ravi, 2013). How- (2012) make sampling faster by introducing slice sam- ever, all of them are designed for decipherment with ei- pling (Neal, 2000) to Bayesian decipherment. Besides ther Bayesian inference or beam search. In contrast, we Bayesian decipherment, Nuhn et al. (2013) show that need an algorithm to make EM decipherment scalable. beam search can be used to solve a very large 1:1 word To overcome this problem, we modify the slice sam- substitution cipher. In subsection 2.4.1, we describe pling (Neal, 2000) approach used by Dou and Knight our approach that uses slice sampling to compute ex- (2012) to compute expected counts from non parallel pected counts for decipherment in the EM algorithm. data needed for the EM algorithm. 2.4 Joint Optimization 2.4.1 Draw Samples with Slice Sampling We now describe our EM approach to learn the param- To start the sampling process, we initialize the first eters that maximize F joint (equation 2), where the dis- sample by performing approximate Viterbi decoding tortion probabilities, d(a j a j 1, j) in the word align- using results from the last EM iteration. For each for- | − ment model are only learned from parallel data, and eign dependency bigram f1, f2, we find the top 50 can- didates for f and f ranked by t(e f ), and find the En- the translation probabilities, t( f e) are learned using 1 2 | | glish sequence e , e that maximizes t(e f ) t(e f ) both parallel and non parallel data. The E step and M 1 2 1| 1 · 2| 2 · step are illustrated in Figure 2. P(e1, e2). Our algorithm starts with EM learning only on par- Suppose the derivation probability for current sam- allel data for a few iterations. When the joint inference ple e current is P(e current), we use slice sampling to starts, we first compute expected counts from parallel draw a new sample in two steps: data and non parallel data using parameter values from Select a threshold T uniformly between 0 and • the last M step separately. Then, we add the expected P(e current). counts from both parallel data and non parallel data to- gether with different weights for the two. Finally we Draw a new sample e new uniformly from a pool •

559 of candidates: e new P(e new) > T . Spanish English { | } Parallel 10.3k 9.9k The first step is straightforward to implement. How- Non Parallel 80 million 400 million ever, it is not trivial to implement the second step. We adapt the idea from Dou and Knight (2012) for EM Table 1: Size of parallel and non parallel data for word learning. alignment experiments (Measured in number of tokens) Suppose our current sample e current contains En- glish tokens ei 1, ei, and ei+1 at position i 1, i, and − − i+1 respectively, and fi be the foreign token at position of f by t(e f ), then we add maximum the first 1000 i 0| i i. Using point-wise sampling, we draw a new sample candidates whose t( f e ) >= c into set B. For the rest i| 0 by changing token ei to a new token e0. Since the rest of the candidates, we set t( f e ) to a value smaller than i| 0 of the sample remains the same, only the probability of c (0.00001 in experiments). the P(ei 1e0ei+1) (The probability is given by a − bigram language model.), and the channel model prob- 2.4.2 Compute Expected Counts from Samples ability t( fi e0) change. Therefore, the probability of a With the ability to draw samples efficiently for deci- | sample is simplified as shown Equation 6. pherment using EM, we now describe how to compute expected counts from those samples. Let f1, f2 be a P(ei 1e0ei+1) t( fi e0) (6) specific ciphertext bigram, N be the number of sam- − · | Remember that in slice sampling, a new sample is ples we want to use to compute expected counts, and , drawn in two steps. For the first step, we choose a e1 e2 be one of the N samples. The expected counts for pairs ( f1, e1) and ( f2, e2) are computed as: threshold T uniformly between 0 and P(ei 1eiei+1) − · t( fi ei). We divide the second step into two cases based | count( f1, f2) on the observation that two types of samples are more α · N likely to have a probability higher than T (Dou and Knight, 2012): (1) those whose trigram probability is where count( f1, f2) is count of the bigram, and α is the high, and (2) those whose channel model probability is weight for non parallel data as shown in Equation 1. high. To find candidates that have high trigram proba- Expected counts collected for f1, f2 are accumulated bility, Dou and Knight (2012) build a top k sorted lists from each of its N samples. Finally, we collect ex- ranked by P(ei 1e0ei+1), which can be pre-computed pected counts using the same approach from each for- − off-line. Then, they test if the last item ek in the list eign bigram. satisfies the following inequality: 3 Word Alignment Experiments P(ei 1ekei+1) c < T (7) − · In this section, we show that joint word alignment and where c is a small constant and is set to prior in their decipherment improves the quality of word alignment. work. In contrast, we choose c empirically as we do We choose to evaluate word alignment performance not have a prior in our model. When the inequality in for Spanish and English as manual gold alignments Equation 7 is satisfied, a sample is drawn in the fol- are available. In experiments, our approach improves lowing way: Let set A = e0 ei 1e0ei+1 c > T and alignment F score by as much as 8 points. { | − · } set B = e0 t( fi e0) > c . Then we only need to sample { | | } 3.1 Experiment Setup e0 uniformly from A B until P(ei 1e0ei+1) t( fi e0) is ∪ − · | greater than T. It is easy to prove that all other candi- As shown in Table 1, we work with a small amount of dates that are not in the sorted list and with t( f e ) c parallel, manually aligned Spanish-English data (Lam- i| 0 ≤ have a upper bound probability: P(ei 1ekei+1) c. There- et al., 2005), and a much larger amount of mono- − · fore, they do not need to be considered. lingual data. Second, when the last item ek in the list does not The parallel data is extracted from Europarl, which meet the condition in Equation 7, we keep drawing consists of articles from European parliament plenary samples e0 randomly until its probability is greater than sessions. The monolingual data comes from English the threshold T. and Spanish versions of Gigaword corpra containing As we mentioned before, the choice of the small con- news articles from different news agencies. stant c is empirical. A large c reduces the number of We view Spanish as a cipher of English, and follow items in set B, but makes the condition P(ei 1ekei+1) the approach proposed by Dou and Knight (2013) to − · c < T less likely to satisfy, which slows down the sam- extract dependency bigrams from parsed Spanish and pling. On the contrary, a small c increases the number English monolingual data for decipherment. We only of items in set B significantly as EM does not encour- keep bigrams where both tokens appear in the paral- age a sparse distribution, which also slows down the lel data. Then, we perform Spanish to English (En- sampling. In our experiments, we set c to 0.001 based glish generating Spanish) word alignment and Span- on the speed of decipherment. Furthermore, to reduce ish to English decipherment simultaneously with the the size of set B, we rank all the candidate translations method discussed in section 2.

560 3.1.1 Results Source Malagasy English We align all 500 sentences in the parallel corpus, and Parallel tune the decipherment weight (α) for Model 1 and Global Voices 2.0 million 1.8 million HMM using the last 100 sentences. The best weights Web News 2.2k 2.1k are 0.1 for Model 1, and 0.005 for HMM. We start with Non Parallel Model 1 with only parallel data for 5 iterations, and Gigaword N/A 2.4 billion switch to the joint process for another 5 iterations with allAfrica N/A 396 million Model 1 and 5 more iterations of HMM. In the end, we Local News 15.3 million N/A use the first 100 sentence pairs of the corpus for evalu- ation. Table 2: Size of Malagasy and English data used in our Figure 3 compares the learning curve of alignment experiments (Measured in number of tokens) F-score between EM without decipherment (baseline) and our joint word alignment and decipherment. From the learning curve, we find that at the 6th iteration, 2 machine translation and decipherment. iterations after we start the joint process, alignment F- 4.2 Data score is improved from 34 to 43, and this improvement is held through the rest of the Model 1 iterations. The Table 2 shows the data available to us in our experi- alignment model switches to HMM from the 11th iter- ments. The majority of parallel text comes from Global ation, and at the 12th iteration, we see a sudden jump Voices1 (GV). The website contains international news in F-score for both the baseline and the joint approach. translated into different foreign languages. Besides We see consistent improvement of F-score till the end that, we also have a very small amount of parallel text of HMM iterations. containing local web news, with English translations provided by native speakers at the University of Texas, 4 Improving Low Density Languages Austin. The Malagasy side of this small parallel corpus Machine Translation with Joint Word also has syntactical annotation, which is used to train a very basic Malagasy part of speech tagger and depen- Alignment and Decipherment dency parser. In the previous section, we show that the joint word We also have much larger amounts of non paral- alignment and decipherment process improves quality lel data for both languages. For Malagasy, we spent of word alignment significantly for Spanish and En- two months manually collecting 15.3 million tokens of 2 glish. In this section, we test our approach in a more news text from local news websites in Madagascar. challenging setting: improving the quality of machine We have released this data for future research use. For translation in a real low density language setting. English, we have 2.4 billion tokens from the Gigaword In this task, our goal is to build a system to trans- corpus. Since the Malagasy monolingual data is col- late Malagasy news into English. We have a small lected from local websites, it is reasonable to argue that amount of parallel data, and larger amounts of mono- those data contain significant amount of information re- lingual data collected from online websites. We build a lated to Africa. Therefore, we also collect 396 million dependency parser for Malagasy to parse the monolin- tokens of African news in English from allAfrica.com. gual data to perform dependency based decipherment 4.3 Building A Dependency Parser for Malagasy (Dou and Knight, 2013). In the end, we perform joint word alignment and decipherment, and show that the Since Malagasy and English have very different word joint learning process improves Bleu scores by up to orders, we decide to apply dependency based decipher- 2.1 points over a phrase-based MT baseline. ment for the two languages as suggested by Dou and Knight (2013). To extract dependency relations, we 4.1 The Malagasy Language need to parse monolingual data in Malagasy and En- Malagasy is the official language of Madagascar. It has glish. For English, there are already many good parsers around 18 million native speakers. Although Mada- available. In our experiments, we use Turbo parser gascar is an African country, Malagasy belongs to the (Martins et al., 2013) trained on the English Penn Tree- Malayo-Polynesian branch of the Austronesian lan- bank (Marcus et al., 1993) to parse all our English guage family. Malagasy and English have very dif- monolingual data. However, there is no existing good ferent word orders. First of all, in contrast to En- parser for Malagasy. glish, which has a subject-verb-object (SVO) word or- The quality of a dependency parser depends on the der, Malagasy has a verb-object-subject (VOS) word amount of training data available. State-of-the-art En- order. Besides that, Malagasy is a typical head ini- glish parsers are built from Penn , which con- tial language: Determiners precede nouns, while other tains over 1 million tokens of annotated syntactical modifiers and relative clauses follow nouns (e.g. ny 1globalvoicesonline.org “the” boky “book” mena “red”). The significant dif- 2aoraha.com, gazetiko.com, inovaovao.com, ferences in word order pose great challenges for both expressmada.com, lakroa.com

561 Figure 3: Learning curve showing our joint word alignment and decipherment approach improves word alignment quality over the traditional EM without decipherment (Model 1: Iteration 1 to 10, HMM: Iteration 11 to 15) trees. In contrast, the available data for training a Mala- first use a POS tagger and parser with poor perfor- gasy parser is rather limited, with only 168 sentences, mance to parse 788 sentences (20k tokens) on the and 2.8k tokens, as shown in Table 2. At the very be- Malagasy side of the parallel corpus from Global ginning, we use the last 120 sentences as training data Voices. Then, we correct both the dependency links to train a part of speech (POS) tagger using a toolkit and POS tags based on information from dictionaries3 provided by Garrette et al. (2013) and a dependency and the English translation of the parsed sentence. We parser with the Turbo parser. We test the performance spent 3 months to manually project English dependen- of the parser on the first 48 sentences and obtain 72.4% cies to Malagasy and eventually improve test set pars- accuracy. ing accuracy from 72.4% to 80.0%. We also make this One obvious way to improve tagging and ac- data available for future research use. curacy is to get more annotated data. We find more data 4.4 Machine Translation Experiments with only part of speech tags containing 465 sentences and 10k tokens released by (Garrette et al., 2013), and In this section, we present the data used for our MT ff add this data as extra training data for POS tagger. experiments, and compare three di erent systems to Also, we download an online dictionary that contains justify our joint word alignment and decipherment ap- POS tags for over 60k Malagasy word types from mala- proach. gasyword.org. The dictionary is very helpful for tag- 4.4.1 Baseline Machine Translation System ging words never seen in the training data. We build a state-of-the-art phrase-based MT system, It is natural to think that creation of annotated data PBMT, using Moses (Koehn et al., 2007). PBMT has 3 for training a POS tagger and a parser requires large models: a translation model, a distortion model, and amounts of efforts from annotators who understand the a language model. We train the other models using language well. However, we find that through the help half of the Global Voices parallel data (the rest is re- of parallel data and dictionaries, we are able to create served for development and testing), and build a 5- more annotated data by ourselves to improve tagging gram language model using 834 million tokens from and parsing accuracy. This idea is inspired by previ- AFP section of English Gigaword, 396 million tokens ous work that tries to learn a semi-supervised parser from allAfrica, and the English part of the parallel cor- by projecting dependency relations from one language pus for training. For alignment, we run 10 iterations (with good dependency parsers) to another (Yarowsky of Model 1, and 5 iterations of HMM. We did not run and Ngai, 2001; Ganchev et al., 2009). However, we Model 3 and Model 4 as we see no improvements in find those automatic approaches do not work well for Bleu scores from running those models. We do word Malagasy. 3an online dictionary from malagasyword.org, as well as To further expand our Malagasy training data, we a lexicon learned from the parallel data

562 alignment in two directions and use grow-diag-final- Parallel and heuristic to obtain final alignment. During decod- Malagasy English ing, we use 8 standard features in Moses to score a can- Train (GV) 0.9 million 0.8 million didate translation: direct and inverse translation proba- Tune (GV) 22.2k 20.2k bilities, direct and inverse lexical weighting, a language Test (GV) 23k 21k model score, a distortion score, phrase penalty, and Test (Web) 2.2k 2.1k word penalty. The weights for the features are learned Non Parallel on the tuning data using minimum error rate training Malagasy English (MERT) (Och, 2003). Gigaword N/A 834 million To compare with previous decipherment approach to Web 15.3 million 396 million improve machine translation, we build a second base- line system. We follow the work by Dou and Knight Table 3: Size of training, tuning, and testing data in (2013) to decipher Malagasy into English, and build a number of tokens (GV: Global Voices) translation lexicon Tdecipher from decipherment. To im- prove machine translation, we simply use Tdecipher as an additional parallel corpus. First, we filter Tdecipher 4.5 Results by keeping only translation pairs ( f, e), where f is ob- We tune each system three times with MERT and served in the Spanish part and e is observed in the En- choose the best weights based on Bleu scores on tuning glish part of the parallel corpus. Then we append all set. the Spanish and English words in the filtered Tdecipher Table 4 shows that while using a translation lexicon to the end of Spanish part and English part of the paral- learnt from decipherment does not improve the quality lel corpus respectively. The training and tuning process of machine translation significantly, the joint approach is the same as the baseline machine translation system improves Bleu score by 0.9 and 2.1 on Global Voices PBMT. We call this system Decipher-Pipeline. test set and web news test set respectively. The results show that the parsing quality correlates with gains in 4.4.2 Joint Word Alignment and Decipherment Bleu scores. Scores in the brackets in the last row of for Machine Translation the table are achieved using a dependency parser with When deciphering Malagasy to English, we extract 72.4% attachment accuracy, while scores outside the Malagasy dependency bigrams using all available brackets are obtained using a dependency parser with Malagasy monolingual data plus the Malagasy part of 80.0% attachment accuracy. the Global Voices parallel data, and extract English We analyze the results and find the gain mainly dependency bigrams using 834 million tokens from comes from two parts. First, adding expected counts English Gigaword, and 396 million tokens from al- from non parallel data makes the distribution of trans- lAfrica news to build an English dependency language lation probabilities sparser in word alignment models. model. In the other direction, we extract English de- The probabilities of translation pairs favored by both pendency bigrams from English part of the entire paral- parallel data and decipherment becomes higher. This lel corpus plus 9.7 million tokens from allAfrica news gain is consistent with previous observation where a 4 , and use 17.3 million tokens Malagasy monolingual sparse prior is applied to EM to help improve word data (15.3 million from the web and 2.0 million from alignment and machine translation (Vaswani et al., Global Voices) to build a Malagasy dependency lan- 2012). Second, expected counts from decipherment guage model. We require that all dependency bigrams also help discover new translation pairs in the paral- only contain words observed in the parallel data used lel data for low frequency words, where those words to train the baseline MT system. are either aligned to NULL or wrong translations in the During learning, we run Model 1 without decipher- baseline. ment for 5 iterations. Then we perform joint word alignment and decipherment for another 5 iterations 5 Conclusion and Future Work with Model 1 and 5 iterations with HMM. We tune decipherment weights (α) for Model 1 and HMM us- We propose a new objective function for word align- ing grid search against Bleu score on a development ment to combine the process of word alignment and set. In the end, we only extract rules from one di- decipherment into a single task. In, experiments, we rection P(English Malagasy), where the decipherment find that the joint process performs better than previous | leu weights for Model 1 and HMM are 0.5 and 0.005 re- pipeline approach, and observe B gains of 0.9 and spectively. We chose this because we did not find any 2.1 point on Global Voices and local web news test sets, benefits to tune the weights on each direction, and then respectively. Finally, our research leads to the release use grow-diag-final-end heuristic to form final align- of 15.3 million tokens of monolingual Malagasy data from the web as well as a small Malagasy dependency ments. We call this system Decipher-Joint. tree bank containing 20k tokens. 4We do not find further Bleu gains by using more English Given the positive results we obtain by using the monolingual data. joint approach to improve word alignment, we are in-

563 Decipherment System Tune (GV) Test (GV) Test (Web) None PBMT (Baseline) 18.5 17.1 7.7 Separate Decipher-Pipeline 18.5 17.4 7.7 Joint Decipher-Joint 18.9 (18.7) 18.0 (17.7) 9.8 (8.5)

Table 4: Decipher-Pipeline does not show significant improvement over the baseline system. In contrast, Decipher- Joint using joint word alignment and decipherment approach achieves a Bleu gain of 0.9 and 2.1 on the Global Voices test set and the web news test set, respectively. The results in brackets are obtained using a parser trained with only 120 sentences. (GV: Global Voices) spired to apply this approach to help find translations Pascale Fung and Lo Yuen Yee. 1998. An IR approach for out of vocabulary words, and to explore other pos- for translating new words from nonparallel, compa- sible ways to improve machine translation with deci- rable texts. In Proceedings of the 36th Annual Meet- pherment. ing of the Association for Computational Linguis- tics and 17th International Conference on Computa- tional Linguistics - Volume 1. Association for Com- Acknowledgments putational Linguistics. This work was supported by NSF Grant 0904684 and Kuzman Ganchev, Jennifer Gillenwater, and Ben ARO grant W911NF-10-1-0533. The authors would Taskar. 2009. Dependency grammar induction via like to thank David Chiang, Malte Nuhn, Victoria Fos- bitext projection constraints. In Proceedings of the sum, Ashish Vaswani, Ulf Hermjakob, Yang Gao, and Joint Conference of the 47th Annual Meeting of the Hui Zhang (in no particular order) for their comments ACL and the 4th International Joint Conference on and suggestions. Natural Language Processing of the AFNLP: Vol- ume 1 - Volume 1. Association for Computational Linguistics. References Nikesh Garera, Chris Callison-Burch, and David Shane Bergsma and Benjamin Van Durme. 2011. Yarowsky. 2009. Improving translation lexicon in- Learning bilingual lexicons using the visual similar- duction from monolingual corpora via dependency ity of labeled web images. In Proceedings of the contexts and part-of-speech equivalences. In Pro- Twenty-Second International Joint Conference on ceedings of the Thirteenth Conference on Computa- Artificial Intelligence - Volume Three. AAAI Press. tional Natural Language Learning. Association for Computational Linguistics. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The math- Dan Garrette, Jason Mielens, and Jason Baldridge. ematics of statistical machine translation: Parameter 2013. Real-world semi-supervised learning of pos- estimation. Computational Linguistics, 19:263–311. taggers for low-resource languages. In Proceed- ings of the 51st Annual Meeting of the Association Hal Daume,´ III and Jagadeesh Jagarlamudi. 2011. Do- for Computational Linguistics (Volume 1: Long Pa- main adaptation for machine translation by mining pers). Association for Computational Linguistics. unseen words. In Proceedings of the 49th Annual Stuart Geman and Donald Geman. 1987. Stochas- Meeting of the Association for Computational Lin- tic relaxation, Gibbs distributions, and the Bayesian guistics: Human Language Technologies. Associa- restoration of images. In Readings in computer vi- tion for Computational Linguistics. sion: issues, problems, principles, and paradigms. Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Morgan Kaufmann Publishers Inc. Maximum likelihood from incomplete data via the Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, EM algorithm. Computational Linguistics, 39(4):1– and Dan Klein. 2008. Learning bilingual lexicons 38. from monolingual corpora. In Proceedings of ACL- 08: HLT. Association for Computational Linguis- Qing Dou and Kevin Knight. 2012. Large scale deci- tics. pherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empir- Ann Irvine and Chris Callison-Burch. 2013a. Com- ical Methods in Natural Language Processing and bining bilingual and comparable corpora for low re- Computational Natural Language Learning. Asso- source machine translation. In Proceedings of the ciation for Computational Linguistics. Eighth Workshop on Statistical Machine Transla- tion. Association for Computational Linguistics, Au- Qing Dou and Kevin Knight. 2013. Dependency- gust. based decipherment for resource-limited machine translation. In Proceedings of the 2013 Conference Ann Irvine and Chris Callison-Burch. 2013b. Su- on Empirical Methods in Natural Language Process- pervised bilingual lexicon induction with multiple ing. Association for Computational Linguistics. monolingual signals. In Proceedings of the 2013

564 Conference of the North American Chapter of the Malte Nuhn, Arne Mauser, and Hermann Ney. 2012. Association for Computational Linguistics: Human Deciphering foreign language by combining lan- Language Technologies. Association for Computa- guage models and context vectors. In Proceedings tional Linguistics. of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume Ann Irvine, Chris Quirk, and Hal Daume III. 2013. 1. Association for Computational Linguistics. Monolingual marginal matching for translation model adaptation. In Proceedings of the Conference Malte Nuhn, Julian Schamper, and Hermann Ney. on Empirical Methods in Natural Language Process- 2013. Beam search for solving substitution ciphers. ing. Association for Computational Linguistics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Associ- Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan Lu.¨ ation for Computational Linguistics. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Franz Josef Och. 2003. Minimum error rate training Proceedings of ACL-08: HLT. Association for Com- in statistical machine translation. In Proceedings of putational Linguistics. the 41st Annual Meeting of Association for Compu- tational Linguistics. Association for Computational Alexandre Klementiev, Ann Irvine, Chris Callison- Linguistics. Burch, and David Yarowsky. 2012. Toward statisti- Reinhard Rapp. 1995. Identifying word translations cal machine translation without parallel corpora. In in non-parallel texts. In Proceedings of the 33rd an- Proceedings of the 13th Conference of the European nual meeting of Association for Computational Lin- Chapter of the Association for Computational Lin- guistics. Association for Computational Linguistics. guistics. Association for Computational Linguistics. Sujith Ravi and Kevin Knight. 2011. Deciphering for- Kevin Knight, Anish Nair, Nishit Rathod, and Kenji eign language. In Proceedings of the 49th Annual Yamada. 2006. Unsupervised analysis for decipher- Meeting of the Association for Computational Lin- ment problems. In Proceedings of the COLING/ACL guistics: Human Language Technologies. Associa- 2006 Main Conference Poster Sessions. Association tion for Computational Linguistics. for Computational Linguistics. Sujith Ravi. 2013. Scalable decipherment for machine Philipp Koehn and Kevin Knight. 2002. Learning a translation via hash sampling. In Proceedings of the translation lexicon from monolingual corpora. In 51th Annual Meeting of the Association for Compu- Proceedings of the ACL-02 Workshop on Unsuper- tational Linguistics. Association for Computational vised Lexical Acquisition. Association for Computa- Linguistics. tional Linguistics. Ashish Vaswani, Liang Huang, and David Chiang. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris 2012. Smaller alignment models for better trans- Callison-Burch, Marcello Federico, Nicola Bertoldi, lations: Unsupervised word alignment with the l0- Brooke Cowan, Wade Shen, Christine Moran, norm. In Proceedings of the 50th Annual Meeting of Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra the Association for Computational Linguistics: Long Constantin, and Evan Herbst. 2007. Moses: open Papers - Volume 1. Association for Computational source toolkit for statistical machine translation. In Linguistics. Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Stephan Vogel, Hermann Ney, and Christoph Tillmann. Association for Computational Linguistics. 1996. HMM-based word alignment in statistical translation. In Proceedings of the 16th Conference Patrik Lambert, Adria´ De Gispert, Rafael Banchs, and on Computational Linguistics - Volume 2. Associa- Jose´ B. Marino.˜ 2005. Guidelines for word align- tion for Computational Linguistics. ment evaluation and manual alignment. Language David Yarowsky and Grace Ngai. 2001. Inducing mul- Resources and Evaluation, 39(4):267–285. tilingual POS taggers and NP bracketers via robust Mitchell Marcus, Beatrice Santorini, and Mary Ann projection across aligned corpora. In Proceedings Marcinkiewicz. 1993. Building a large annotated of the Second Meeting of the North American Chap- corpus of English: The Penn Treebank. Computa- ter of the Association for Computational Linguistics tional Linguistics, 19(2). on Language Technologies. Association for Compu- tational Linguistics. Andre Martins, Miguel Almeida, and Noah A. Smith. Yue Zhang and Stephen Clark. 2008. Joint word seg- 2013. Turning on the Turbo: Fast third-order non- mentation and POS tagging using a single percep- projective Turbo parsers. In Proceedings of the 51st tron. In Proceedings of ACL-08: HLT, Columbus, Annual Meeting of the Association for Computa- Ohio. Association for Computational Linguistics. tional Linguistics (Volume 2: Short Papers). Asso- ciation for Computational Linguistics.

Radford Neal. 2000. Slice sampling. Annals of Statis- tics, 31.

565