Dependency-Based Decipherment for Resource-Limited Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Dependency-Based Decipherment for Resource-Limited Machine Translation Qing Dou and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {qdou,knight}@isi.edu Abstract We introduce dependency relations into deci- phering foreign languages and show that de- pendency relations help improve the state-of- the-art deciphering accuracy by over 500%. We learn a translation lexicon from large amounts of genuinely non parallel data with decipherment to improve a phrase-based ma- chine translation system trained with limited parallel data. In experiments, we observe BLEU gains of 1.2 to 1.8 across three different test sets. 1 Introduction State-of-the-art machine translation (MT) systems apply statistical techniques to learn translation rules from large amounts of parallel data. However, par- Figure 1: Improving machine translation with deci- allel data is limited for many language pairs and do- pherment (Grey boxes represent new data and process). mains. Mono: monolingual; LM: language model; LEX: trans- In general, it is easier to obtain non parallel data. lation lexicon; TM: translation model. The ability to build a machine translation system using monolingual data could alleviate problems caused by insufficient parallel data. Towards build- data. Dou and Knight (2012) successfully apply ing a machine translation system without a paral- decipherment to learn a domain specific translation lel corpus, Klementiev et al. (2012) use non paral- lexicon from monolingual data to improve out-of- lel data to estimate parameters for a large scale MT domain machine translation. Although their ap- system. Other work tries to learn full MT systems proach works well for Spanish/French, they do not using only non parallel data through decipherment show whether their approach works for other lan- (Ravi and Knight, 2011; Ravi, 2013). However, the guage pairs. Moreover, the non parallel data used in performance of such systems is poor compared with their experiments is created from a parallel corpus. those trained with parallel data. Such highly comparable data is difficult to obtain in Given that we often have some parallel data, reality. it is more practical to improve a translation sys- In this work, we improve previous work by Dou tem trained on parallel corpora with non parallel and Knight (2012) using genuinely non parallel data, 1668 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1668–1676, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics and propose a framework to improve a machine cipher for English and apply Bayesian learning to di- translation system trained with a small amount of rectly decipher Spanish into English. Unfortunately, parallel data. As shown in Figure 1, we use a lexi- their approach can only work on small data with lim- con learned from decipherment to improve transla- ited vocabulary. Dou and Knight (2012) propose two tions of both observed and out-of-vocabulary (OOV) techniques to make Bayesian decipherment scalable. words. The main contributions of this work are: First, unlike Ravi and Knight (2011), who deci- pher whole sentences, Dou and Knight (2012) deci- • We extract bigrams based on dependency re- pher bigrams. Reducing a ciphertext to a set of bi- lations for decipherment, which improves the grams with counts significantly reduces the amount state-of-the-art deciphering accuracy by over of cipher data. According to Dou and Knight (2012), 500%. a ciphertext bigram F is generated through the fol- • We demonstrate how to improve translations lowing generative story: of words observed in parallel data by us- • Generate a sequence of two plaintext tokens ing a translation lexicon obtained from large e1e2 with probability P (e1e2) given by a lan- amounts of non parallel data. guage model built from large numbers of plain- • We show that decipherment is able to find cor- text bigrams. rect translations for OOV words. • Substitute e1 with f1 and e2 with f2 with prob- P (f |e ) · P (f |e ) • We use a translation lexicon learned by de- ability 1 1 2 2 . ciphering large amounts of non parallel data The probability of any cipher bigram F is: to improve a phrase-based MT system trained 2 with limited amounts of parallel data. In ex- X Y periments, we observe 1.2 to 1.8 BLEU gains P (F ) = P (e1e2) P (fi|ei) across three different test sets. e1e2 i=1 Given a corpus of N cipher bigrams F ...F , the 2 Previous Work 1 N probability of the corpus is: Motivated by the idea that a translation lexicon in- N duced from non parallel data can be applied to Y P (corpus) = P (Fj) MT, a variety of prior research has tried to build a j=1 translation lexicon from non parallel or compara- ble data (Rapp, 1995; Fung and Yee, 1998; Koehn Given a plaintext bigram language model, and Knight, 2002; Haghighi et al., 2008; Garera et the goal is to manipulate P (f|e) to maximize al., 2009; Bergsma and Van Durme, 2011; Daume´ P (corpus). Theoretically, one can directly apply and Jagarlamudi, 2011; Irvine and Callison-Burch, EM to solve the problem (Knight et al., 2006). How- 2 2013b; Irvine and Callison-Burch, 2013a). Al- ever, EM has time complexity O(N · Ve ) and space though previous work is able to build a translation complexity O(Vf · Ve), where Vf , Ve are the sizes lexicon without parallel data, little has used the lex- of ciphertext and plaintext vocabularies respectively, icon to improve machine translation. and N is the number of cipher bigrams. There has been increasing interest in learning Ravi and Knight (2011) apply Bayesian learning translation lexicons from non parallel data with de- to reduce the space complexity. Instead of esti- cipherment techniques (Ravi and Knight, 2011; Dou mating probabilities P (f|e), Bayesian learning tries and Knight, 2012; Nuhn et al., 2012). Decipher- to draw samples from plaintext sequences given ci- ment views one language as a cipher for another and phertext bigrams. During sampling, the probability learns a translation lexicon that produces a good de- of any possible plaintext sample e1e2 is given as: cipherment. 2 In an effort to build a MT system without a paral- Y Psample(e1e2) = P (e1e2) Pbayes(fi|ei) lel corpus, Ravi and Knight (2011) view Spanish as a i=1 1669 mision´ de naciones unidas en oriente medio de naciones unidas en oriente medio”. The cor- mision´ de mision´ naciones rect decipherment for the bigram “naciones unidas” de naciones naciones unidas should be “united nations”. Since the deciphering naciones unidas mision´ en model described by Dou and Knight (2012) does unidas en en oriente not consider word reordering, it needs to decipher en oriente oriente medio the bigram into “nations united” in order to get oriente medio the right word translations “naciones”→“nations” and “unidas”→“united”. However, the English lan- Table 1: Comparison of adjacent bigrams (left) and de- guage model used for decipherment is built from En- pendency bigrams (right) extracted from the same Span- ish text glish adjacent bigrams, so it strongly disprefers “na- tions united” and is not likely to produce a sensi- ble decipherment for “naciones unidas”. The Span- with Pbayes(fi|ei) defined as: ish bigram “oriente medio” poses the same prob- lem. Thus, without considering word reordering, the αP0(fi|ei) + count(fi, ei) model described by Dou and Knight (2012) is not a Pbayes(fi|ei) = α + count(ei) good fit for deciphering Spanish into English. However, if we extract bigrams based on depen- where P is a base distribution, and α is a parameter 0 dency relations for both languages, the model fits that controls how much we trust P . count(f , e ) 0 i i better. To extract such bigrams, we first use de- and count(e ) record the number of times f , e and i i i pendency parsers to parse both languages, and ex- e appear in previously generated samples respec- i tract bigrams by putting head word first, followed tively. by the modifier.1 We call these dependency bi- At the end of sampling, P (f |e ) is estimated by: i i grams. The right column in Table 1 lists exam- count(fi, ei) ples of Spanish dependency bigrams extracted from P (fi|ei) = the same Spanish phrase. With a language model count(ei) built with English dependency bigrams, the same However, Bayesian decipherment is still very model used for deciphering adjacent bigrams is slow with Gibbs sampling (Geman and Geman, able to decipher Spanish dependency bigram “na- 1987), as each sampling step requires considering ciones(head) unidas(modifier)” into “nations(head) Ve possibilities. Dou and Knight (2012) solve the united(modifier)”. problem by introducing slice sampling (Neal, 2000) We might instead propose to consider word re- to Bayesian decipherment. ordering when deciphering adjacent bigrams (e.g. 3 From Adjacent Bigrams to Dependency add an operation to swap tokens in a bigram). How- Bigrams ever, using dependency bigrams has the following advantages: A major limitation of work by Dou and Knight (2012) is their monotonic generative story for deci- • First, using dependency bigrams avoids com- phering adjacent bigrams. While the generation pro- plicating the model, keeping deciphering effi- cess works well for deciphering similar languages cient and scalable. (e.g. Spanish and French) without considering re- • Second, it addresses the problem of long dis- ordering, it does not work well for languages that tance reordering, which can not be modeled by are more different in grammar and word order (e.g. swapping tokens in bigrams. Spanish and English). In this section, we first look at why adjacent bigrams are bad for decipherment.