Beyond Parallel Data: Joint Word Alignment and Decipherment Improves Machine Translation

Beyond Parallel Data: Joint Word Alignment and Decipherment Improves Machine Translation Qing Dou , Ashish Vaswani, and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California qdou,avaswani,knight @isi.edu { } Abstract Inspired by previous work, where decipherment is used to improve machine translation, we propose a new idea to combine word alignment and decipherment into a single learning process. We use EM to estimate the model parameters, not only to maximize the probability of parallel corpus, but also the monolingual corpus. We apply our approach to improve Malagasy-English machine translation, where only a small amount of parallel data is avail- able. In our experiments, we observe gains of 0.9 to 2.1 Bleu over a strong baseline. 1 Introduction State-of-the-art machine translation (MT) systems apply statistical techniques to learn translation rules au- Figure 1: Combine word alignment and decipherment tomatically from parallel data. However, this reliance into a single learning process. on parallel data seriously limits the scope of MT ap- plication in the real world, as for many languages and domains, there is not enough parallel data to train a de- et al., 2009; Bergsma and Van Durme, 2011; Daume´ cent quality MT system. and Jagarlamudi, 2011; Irvine and Callison-Burch, However, compared with parallel data, there are 2013b; Irvine and Callison-Burch, 2013a; Irvine et al., much larger amounts of non parallel data. The abil- 2013). ity to learn a translation lexicon or even build a ma- Lately, there has been increasing interest in learn- chine translation system using monolingual data helps ing translation lexicons from non parallel data with de- address the problems of insufficient parallel data. Ravi cipherment techniques (Ravi and Knight, 2011; Dou and Knight (2011) are among the first to learn a full and Knight, 2012; Nuhn et al., 2012; Dou and Knight, MT system using only non parallel data through deci- 2013). Decipherment views one language as a cipher pherment. However, the performance of such systems for another and learns a translation lexicon that pro- is much lower compared with those trained with par- duces fluent text in the target (plaintext) language. Pre- allel data. In another work, Klementiev et al. (2012) vious work has shown that decipherment not only helps show that, given a phrase table, it is possible to esti- find translations for OOVs (Dou and Knight, 2012), but mate parameters for a phrase-based MT system from also improves translations of observed words (Dou and non parallel data. Knight, 2013). Given that we often have some parallel data, it is We find that previous work using monolingual or more practical to improve a translation system trained comparable data to improve quality of machine transla- on parallel data by using additional non parallel data. tion separates two learning tasks: first, translation rules Rapp (1995) shows that with a seed lexicon, it is possi- are learned from parallel data, and then the information ble to induce new word level translations from non par- learned from parallel data is used to bootstrap learning allel data. Motivated by the idea that a translation lexi- with non parallel data. Inspired by approaches where con induced from non parallel data can be used to trans- joint inference reduces the problems of error propaga- late out of vocabulary words (OOV), a variety of prior tion and improves system performance, we combine research has tried to build a translation lexicon from the two separate learning processes into a single one, non parallel or comparable data (Fung and Yee, 1998; as shown in Figure 1. The contributions of this work Koehn and Knight, 2002; Haghighi et al., 2008; Garera are: 557 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 557–565, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics We propose a new objective function for word 2.2 Word Alignment • alignment that combines the process of word Given a source sentence F = f1,..., fj,..., fJ and a alignment and decipherment into a single learning target sentence E = e1,..., ei,..., eI, word alignment task. models describe the generative process employed to produce the French sentence from the English sentence In experiments, we find that the joint process out- • through alignments a = a1,..., aj,..., aJ. performs the previous pipeline approach, and ob- The IBM models 1-2 (Brown et al., 1993) and the serve Bleu gains of 0.9 and 2.1 on two different HMM word alignment model (Vogel et al., 1996) use test sets. two sets of parameters, distortion probabilities and translation probabilities, to define the joint probabil- We release 15.3 million tokens of monolingual ity of a target sentence and alignment given a source • Malagasy data from the web, as well as a small sentence. Malagasy dependency tree bank containing 20k tokens. J P(F, a E) = d(a j a j 1, j)t( f j ea ). (3) | | − | j 2 Joint Word Alignment and = Yj 1 Decipherment These alignment models share the same translation 2.1 A New Objective Function probabilities t( f j ea j ), but differ in their treatment of | the distortion probabilities d(a j a j 1, j). Brown et In previous work that uses monolingual data to im- | − al. (1993) introduce more advanced models for word prove machine translation, a seed translation lexicon alignment, such as Model 3 and Model 4, which use learned from parallel data is used to find new transla- more parameters to describe the generative process. We tions through either word vector based approaches or do not go into details of those models here and the decipherment. In return, selection of a seed lexicon reader is referred to the paper describing them. needs to be careful as using a poor quality seed lexi- Under the Model 1-2 and HMM alignment models, con could hurt the downstream process. Evidence from the probability of target sentence given source sentence a number of previous work shows that a joint inference is: process leads to better performance in both tasks (Jiang et al., 2008; Zhang and Clark, 2008). J In the presence of parallel and monolingual data, we P(F E) = d(a j a j 1, j)t( f j ea j ). | | − | would like the alignment and decipherment models to Xa Yj=1 benefit from each other. Since the decipherment and Let θ denote all the parameters of the word align- word alignment models contain word-to-word transla- ment model. Given a corpus of sentence pairs tion probabilities t( f e), having them share these pa- | (E1, F1),..., (Em, Fm),..., (EM, FM), the standard ap- rameters during learning will allow us to pool infor- proach for training is to learn the maximum likelihood mation from both data types. This leads us to de- estimate of the parameters, that is, velop a new objective function that takes both learning processes into account. Given our parallel data, (E1, F1),..., (Em, Fm),..., (EM, FM), and monolingual M 1 n N m m data Fmono,..., Fmono,..., Fmono, we seek to maximize θ∗ = arg max log P(F E ) θ | the likelihood of both. Our new objective function is Xm=1 defined as: = arg max log P(Fm, a Em) . θ | a X M N m m n We typically use the EM algorithm (Dempster et al., F joint = log P(F E ) + α log P(F ) (1) | mono 1977), to carry out this optimization. Xm=1 Xn=1 2.3 Decipherment The goal of training is to learn the parameters that maximize this objective, that is Given a corpus of N foreign text sequences (cipher- 1 n N text), Fmono,..., Fmono,..., Fmono, decipherment finds word-to-word translations that best describe the ciphertext. θ∗ = arg max F joint (2) θ Knight et al. (2006) are the first to study several natural language decipherment problems with unsupervised In the next two sections, we describe the word align- learning. Since then, there has been increasing interest ment and decipherment models, and present how they in improving decipherment techniques and its applica- are combined to perform joint optimization. tion to machine translation (Ravi and Knight, 2011; 558 Dou and Knight, 2012; Nuhn et al., 2012; Dou and Knight, 2013; Nuhn et al., 2013). In order to speed up decipherment, Dou and Knight (2012) suggest that a frequency list of bigrams might contain enough information for decipherment. Accord- ing to them, a monolingual ciphertext bigram Fmono is generated through the following generative story: Generate a sequence of two plaintext tokens e e • 1 2 with probability P(e1e2) given by a language model built from large numbers of plaintext bigrams. Substitute e with f and e with f with probabil- • 1 1 2 2 ity t( f e ) t( f e ). 1| 1 · 2| 2 The probability of any cipher bigram F is: P(F ) = P(e e ) t( f e ) t( f e ) (4) mono 1 2 · 1| 1 · 2| 2 Xe1e2 And the probability of the corpus is: N n P(corpus) = P(Fmono) (5) Figure 2: Joint Word Alignment and Decipherment Yn=1 with EM Given a plaintext bigram language model, the goal is to manipulate t( f e) to maximize P(corpus). Theoret- renormalize the translation table and distortion table to | ically, one can directly apply EM to solve the problem update parameters in the new M step. (Knight et al., 2006). However, EM has time complex- The E step for parallel part can be computed effi- 2 ity O(N V ) and space complexity O(V f Ve), where V f , ciently using the forward-backward algorithm (Vogel et · e · Ve are the sizes of ciphertext and plaintext vocabularies al., 1996).

Load more