Toward Statistical Machine Translation Without Parallel Corpora

Toward Statistical Machine Translation without Parallel Corpora Alexandre Klementiev Ann Irvine Chris Callison-Burch David Yarowsky Center for Language and Speech Processing Johns Hopkins University Abstract novel algorithm to estimate reordering features from monolingual data alone, and we report the We estimate the parameters of a phrase- performance of a phrase-based statistical model based statistical machine translation sys- (Koehn et al., 2003) estimated using these mono- tem from monolingual corpora instead of a lingual features. bilingual parallel corpus. We extend exist- ing research on bilingual lexicon induction Most of the prior work on lexicon induction to estimate both lexical and phrasal trans- is motivated by the idea that it could be applied lation probabilities for MT-scale phrase- to machine translation but stops short of actu- tables. We propose a novel algorithm to es- ally doing so. Lexicon induction holds the po- timate reordering probabilities from mono- tential to create machine translation systems for lingual data. We report translation results languages which do not have extensive parallel for an end-to-end translation system us- corpora. Training would only require two large ing these monolingual features alone. Our monolingual corpora and a small bilingual dictio- method only requires monolingual corpora in source and target languages, a small nary, if one is available. The idea is that intrin- bilingual dictionary, and a small bitext for sic properties of monolingual data (possibly along tuning feature weights. In this paper, we ex- with a handful of bilingual pairs to act as exam- amine an idealization where a phrase-table ple mappings) can provide independent but infor- is given. We examine the degradation in mative cues to learn translations because words translation performance when bilingually (and phrases) behave similarly across languages. estimated translation probabilities are re- This work is the first attempt to extend and apply moved and show that 80%+ of the loss can be recovered with monolingually estimated these ideas to an end-to-end machine translation features alone. We further show that our pipeline. While we make an explicit assumption monolingual features add 1.5 BLEU points that a table of phrasal translations is given a priori, when combined with standard bilingually we induce every other parameter of a full phrase- estimated phrase table features. based translation system from monolingual data alone. The contributions of this work are: 1 Introduction • In Section 2.2 we analyze the challenges The parameters of statistical models of transla- of using bilingual lexicon induction for sta- tion are typically estimated from large bilingual tistical MT (performance on low frequency parallel corpora (Brown et al., 1993). However, items, and moving from words to phrases). these resources are not available for most language pairs, and they are expensive to produce in • In Sections 3.1 and 3.2 we use multiple cues quantities sufficient for building a good transla- present in monolingual data to estimate lexi- tion system (Germann, 2001). We attempt an en- cal and phrasal translation scores. tirely different approach; we use cheap and plen- • In Section 3.3 we propose a novel algo- tiful monolingual resources to induce an end-to- rithm for estimating phrase reordering fea- end statistical machine translation system. In par- tures from monolingual texts. ticular, we extend the long line of work on in- ducing translation lexicons (beginning with Rapp • Finally, in Section 5 we systematically drop (1995)) and propose to use multiple independent feature functions from a phrase table and cues present in monolingual texts to estimate lex- then replace them with monolingually es- ical and phrasal translation probabilities for large, timated equivalents, reporting end-to-end MT-scale phrase-tables. We then introduce a translation quality. 130 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 130–140, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics 2 Background ls We begin with a brief overview of the stan- Wieviel Profi in man sollte seines Facebook verdienen dard phrase-based statistical machine translation aufrgund How model. Here, we define the parameters which we later replace with monolingual alternatives. much We continue with a discussion of bilingual lex- should m icon induction; we extend these methods to es- you m timate the monolingual parameters in Section 3. charge d This approach allows us to replace expensive/rare bilingual parallel training data with two large for d monolingual corpora, a small bilingual dictionary, your m and ≈2,000 sentence bilingual development set, Facebook d which are comparatively plentiful/inexpensive. profile s 2.1 Parameters of phrase-based SMT Statistical machine translation (SMT) was first Figure 1: The reordering probabilities from the phrase- based models are estimated from bilingual data by cal- formulated as a series of probabilistic mod- culating how often in the parallel corpus a phrase pair els that learn word-to-word correspondences (f, e) is orientated with the preceding phrase pair in from sentence-aligned bilingual parallel corpora the 3 types of orientations (monotone, swapped, and (Brown et al., 1993). Current methods, includ- discontinuous). ing phrase-based (Och, 2002; Koehn et al., 2003) and hierarchical models (Chiang, 2005), typically start by word-aligning a bilingual parallel cor- age word translation probabilities, w(ei|fj), pus (Och and Ney, 2003). They extract multi- are calculated via phrase-pair-internal word word phrases that are consistent with the Viterbi alignments. word alignments and use these phrases to build new translations. A variety of parameters are es- • Reordering model. Each phrase pair (e, f) timated using the bitexts. Here we review the pa- also has associated reordering parameters, rameters of the standard phrase-based translation po(orientation|f, e), which indicate the dis- model (Koehn et al., 2007). Later we will show tribution of its orientation with respect to the how to estimate them using monolingual texts in- previously translated phrase. Orientations stead. These parameters are: are monotone, swap, discontinuous (Tillman, 2004; Kumar and Byrne, 2004), see Figure 1. • Phrase pairs. Phrase extraction heuristics (Venugopal et al., 2003; Tillmann, 2003; • Other features. Other typical features are Och and Ney, 2004) produce a set of phrase n-gram language model scores and a phrase pairs (e, f) that are consistent with the word penalty, which governs whether to use fewer alignments. In this paper we assume that the longer phrases or more shorter phrases. phrase pairs are given (without any scores), These are not bilingually estimated, so we and we induce every other parameter of the can re-use them directly without modifica- phrase-based model from monolingual data. tion. • Phrase translation probabilities. Each phrase pair has a list of associated fea- The features are combined in a log linear model, ture functions (FFs). These include phrase and their weights are set through minimum error translation probabilities, φ(e|f) and φ(f|e), rate training (Och, 2003). We use the same log which are typically calculated via maximum linear formulation and MERT but propose alterna- likelihood estimation. tives derived directly from monolingual data for all parameters except for the phrase pairs them- • Lexical weighting. Since MLE overestimates selves. Our pipeline still requires a small bitext of φ for phrase pairs with sparse counts, lexi- approximately 2,000 sentences to use as a devel- cal weighting FFs are used to smooth. Aver- opment set for MERT parameter tuning. 131 2.2 Bilingual lexicon induction for SMT Bilingual lexicon induction describes the class of algorithms that attempt to learn translations from 40 monolingual corpora. Rapp (1995) was the first to propose using non-parallel texts to learn the 30 translations of words. Using large, unrelated En- ● 20 ● ● glish and German corpora (with 163m and 135m ● Accuracy, % Accuracy, ● ● words) and a small German-English bilingual dic- ● ● tionary (with 22k entires), Rapp (1999) demon- 10 ● strated that reasonably accurate translations could ● Top 1 Top 10 be learned for 100 German nouns that were not 0 contained in the seed bilingual dictionary. His al- 0 100 200 300 400 500 600 gorithm worked by (1) building a context vector Corpus Frequency representing an unknown German word by count- ing its co-occurrence with all the other words Figure 2: Accuracy of single-word translations in- in the German monolingual corpus, (2) project- duced using contextual similarity as a function of the source word corpus frequency. Accuracy is the pro- ing this German vector onto the vector space of portion of the source words with at least one correct English using the seed bilingual dictionary, (3) (bilingual dictionary) translation in the top 1 and top calculating the similarity of this sparse projected 10 candidate lists. vector to vectors for English words that were con- structed using the English monolingual corpus, nouns in Rapp (1995), 1,000 most frequent words and (4) outputting the English words with the in Koehn and Knight (2002), or 2,000 most fre- highest similarity as the most likely translations. quent nouns in Haghighi et al. (2008)). Although A variety of subsequent work has extended the previous work reported high translation accuracy, original idea either by exploring different mea- it may be misleading to extrapolate the results to sures of vector similarity (Fung and Yee, 1998) SMT, where it is necessary to translate a much or by proposing other ways of measuring simi- larger set of words and phrases, including many larity beyond co-occurence within a context win- low frequency items. dow. For instance, Schafer and Yarowsky (2002) In a preliminary study, we plotted the accuracy demonstrated that word translations tend to co- of translations against the frequency of the source occur in time across languages. Koehn and Knight words in the monolingual corpus.

Toward Statistical Machine Translation Without Parallel Corpora

(Meta-) Evaluation of Machine Translation

Public Project Presentation and Updates

8.14: Annual Public Report

Statistical and Hybrid Machine Translation Between All European Languages

Abstract the Circle of Meaning: from Translation

Improving Statistical Machine Translation Efficiency by Triangulation

Factored Translation Models and Discriminative Training For

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Edinburgh's Statistical Machine Translation Systems for WMT16

Improved Learning for Machine Translation

A Survey of Statistical Machine Translation