Toward Statistical without Parallel Corpora

Alexandre Klementiev Ann Irvine Chris Callison-Burch David Yarowsky Center for Language and Speech Processing

Abstract novel algorithm to estimate reordering features from monolingual data alone, and we report the We estimate the parameters of a phrase- performance of a phrase-based statistical model based statistical machine translation sys- (Koehn et al., 2003) estimated using these mono- tem from monolingual corpora instead of a lingual features. bilingual parallel corpus. We extend exist- ing research on bilingual lexicon induction Most of the prior work on lexicon induction to estimate both lexical and phrasal trans- is motivated by the idea that it could be applied lation probabilities for MT-scale phrase- to machine translation but stops short of actu- tables. We propose a novel algorithm to es- ally doing so. Lexicon induction holds the po- timate reordering probabilities from mono- tential to create machine translation systems for lingual data. We report translation results languages which do not have extensive parallel for an end-to-end translation system us- corpora. Training would only require two large ing these monolingual features alone. Our monolingual corpora and a small bilingual dictio- method only requires monolingual corpora in source and target languages, a small nary, if one is available. The idea is that intrin- bilingual dictionary, and a small bitext for sic properties of monolingual data (possibly along tuning feature weights. In this paper, we ex- with a handful of bilingual pairs to act as exam- amine an idealization where a phrase-table ple mappings) can provide independent but infor- is given. We examine the degradation in mative cues to learn translations because words translation performance when bilingually (and phrases) behave similarly across languages. estimated translation probabilities are re- This work is the first attempt to extend and apply moved and show that 80%+ of the loss can be recovered with monolingually estimated these ideas to an end-to-end machine translation features alone. We further show that our pipeline. While we make an explicit assumption monolingual features add 1.5 BLEU points that a table of phrasal translations is given a priori, when combined with standard bilingually we induce every other parameter of a full phrase- estimated phrase table features. based translation system from monolingual data alone. The contributions of this work are: 1 Introduction • In Section 2.2 we analyze the challenges The parameters of statistical models of transla- of using bilingual lexicon induction for sta- tion are typically estimated from large bilingual tistical MT (performance on low frequency parallel corpora (Brown et al., 1993). However, items, and moving from words to phrases). these resources are not available for most lan- guage pairs, and they are expensive to produce in • In Sections 3.1 and 3.2 we use multiple cues quantities sufficient for building a good transla- present in monolingual data to estimate lexi- tion system (Germann, 2001). We attempt an en- cal and phrasal translation scores. tirely different approach; we use cheap and plen- • In Section 3.3 we propose a novel algo- tiful monolingual resources to induce an end-to- rithm for estimating phrase reordering fea- end statistical machine translation system. In par- tures from monolingual texts. ticular, we extend the long line of work on in- ducing translation lexicons (beginning with Rapp • Finally, in Section 5 we systematically drop (1995)) and propose to use multiple independent feature functions from a phrase table and cues present in monolingual texts to estimate lex- then replace them with monolingually es- ical and phrasal translation probabilities for large, timated equivalents, reporting end-to-end MT-scale phrase-tables. We then introduce a translation quality.

130 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 130–140, Avignon, France, April 23 - 27 2012. c 2012 Association for Computational Linguistics 2 Background ls We begin with a brief overview of the stan- Wieviel Profi in man sollte seines Facebook verdienen dard phrase-based statistical machine translation aufrgund How model. Here, we define the parameters which we later replace with monolingual alternatives. much We continue with a discussion of bilingual lex- should m icon induction; we extend these methods to es- you m timate the monolingual parameters in Section 3. charge d This approach allows us to replace expensive/rare bilingual parallel training data with two large for d monolingual corpora, a small bilingual dictionary, your m and ≈2,000 sentence bilingual development set, Facebook d which are comparatively plentiful/inexpensive. profile s 2.1 Parameters of phrase-based SMT Statistical machine translation (SMT) was first Figure 1: The reordering probabilities from the phrase- based models are estimated from bilingual data by cal- formulated as a series of probabilistic mod- culating how often in the parallel corpus a phrase pair els that learn word-to-word correspondences (f, e) is orientated with the preceding phrase pair in from sentence-aligned bilingual parallel corpora the 3 types of orientations (monotone, swapped, and (Brown et al., 1993). Current methods, includ- discontinuous). ing phrase-based (Och, 2002; Koehn et al., 2003) and hierarchical models (Chiang, 2005), typically start by word-aligning a bilingual parallel cor- age word translation probabilities, w(ei|fj), pus (Och and Ney, 2003). They extract multi- are calculated via phrase-pair-internal word word phrases that are consistent with the Viterbi alignments. word alignments and use these phrases to build new translations. A variety of parameters are es- • Reordering model. Each phrase pair (e, f) timated using the bitexts. Here we review the pa- also has associated reordering parameters, rameters of the standard phrase-based translation po(orientation|f, e), which indicate the dis- model (Koehn et al., 2007). Later we will show tribution of its orientation with respect to the how to estimate them using monolingual texts in- previously translated phrase. Orientations stead. These parameters are: are monotone, swap, discontinuous (Tillman, 2004; Kumar and Byrne, 2004), see Figure 1. • Phrase pairs. Phrase extraction heuristics (Venugopal et al., 2003; Tillmann, 2003; • Other features. Other typical features are Och and Ney, 2004) produce a set of phrase n-gram language model scores and a phrase pairs (e, f) that are consistent with the word penalty, which governs whether to use fewer alignments. In this paper we assume that the longer phrases or more shorter phrases. phrase pairs are given (without any scores), These are not bilingually estimated, so we and we induce every other parameter of the can re-use them directly without modifica- phrase-based model from monolingual data. tion. • Phrase translation probabilities. Each phrase pair has a list of associated fea- The features are combined in a log linear model, ture functions (FFs). These include phrase and their weights are set through minimum error translation probabilities, φ(e|f) and φ(f|e), rate training (Och, 2003). We use the same log which are typically calculated via maximum linear formulation and MERT but propose alterna- likelihood estimation. tives derived directly from monolingual data for all parameters except for the phrase pairs them- • Lexical weighting. Since MLE overestimates selves. Our pipeline still requires a small bitext of φ for phrase pairs with sparse counts, lexi- approximately 2,000 sentences to use as a devel- cal weighting FFs are used to smooth. Aver- opment set for MERT parameter tuning.

131 2.2 Bilingual lexicon induction for SMT

Bilingual lexicon induction describes the class of algorithms that attempt to learn translations from 40 monolingual corpora. Rapp (1995) was the first to propose using non-parallel texts to learn the 30 translations of words. Using large, unrelated En- ●

20 ● ● glish and German corpora (with 163m and 135m ●

Accuracy, % Accuracy, ●

● words) and a small German-English bilingual dic- ● ● tionary (with 22k entires), Rapp (1999) demon- 10 ● strated that reasonably accurate translations could ● Top 1 Top 10 be learned for 100 German nouns that were not 0 contained in the seed bilingual dictionary. His al- 0 100 200 300 400 500 600 gorithm worked by (1) building a context vector Corpus Frequency representing an unknown German word by count- ing its co-occurrence with all the other words Figure 2: Accuracy of single-word translations in- in the German monolingual corpus, (2) project- duced using contextual similarity as a function of the source word corpus frequency. Accuracy is the pro- ing this German vector onto the vector space of portion of the source words with at least one correct English using the seed bilingual dictionary, (3) (bilingual dictionary) translation in the top 1 and top calculating the similarity of this sparse projected 10 candidate lists. vector to vectors for English words that were con- structed using the English monolingual corpus, nouns in Rapp (1995), 1,000 most frequent words and (4) outputting the English words with the in Koehn and Knight (2002), or 2,000 most fre- highest similarity as the most likely translations. quent nouns in Haghighi et al. (2008)). Although A variety of subsequent work has extended the previous work reported high translation accuracy, original idea either by exploring different mea- it may be misleading to extrapolate the results to sures of vector similarity (Fung and Yee, 1998) SMT, where it is necessary to translate a much or by proposing other ways of measuring simi- larger set of words and phrases, including many larity beyond co-occurence within a context win- low frequency items. dow. For instance, Schafer and Yarowsky (2002) In a preliminary study, we plotted the accuracy demonstrated that word translations tend to co- of translations against the frequency of the source occur in time across languages. Koehn and Knight words in the monolingual corpus. Figure 2 shows (2002) used similarity in spelling as another kind the result for translations induced using contex- of cue that a pair of words may be translations of tual similarity (defined in Section 3.1). Unsur- one another. Garera et al. (2009) defined context prisingly, frequent terms have a substantially bet- vectors using dependency relations rather than ad- ter chance of being paired with a correct transla- jacent words. Bergsma and Van Durme (2011) tion, with words that only occur once having a low 1 used the visual similarity of labeled web images chance of being translated accurately. This prob- to learn translations of nouns. Additional related lem is exacerbated when we move to multi-token work on learning translations from monolingual phrases. As with phrase translation features esti- corpora is discussed in Section 6. mated from parallel data, longer phrases are more sparse, making similarity scores less reliable than In this paper, we apply bilingual lexicon in- for single words. duction methods to statistical machine translation. Another impediment (not addressed in this Given the obvious benefits of not having to rely paper) for using lexicon induction for SMT is on scarce bilingual parallel training data, it is sur- the number of translations that must be learned. prising that bilingual lexicon induction has not Learning translations for all words in the source been used for SMT before now. There are sev- language requires n2 vector comparisons, since eral open questions that make its applicability to each word in the source language vocabulary must SMT uncertain. Previous research on bilingual lexicon induction learned translations only for a 1For a description of the experimental setup used to pro- small number of high frequency words (e.g. 100 duce these translations, see Experiment 8 in Section 5.2.

132 ES Context Projected ES EN Context terrorist (en) Vector Context Vector compare Vectors terrorista (es)

economico s1 ✓ ✓ ✓ t1 policy

planeta s2 ✓ t2 growth project

tasa s3 ✓ ✓ ✓ t3 foreign Occurrences dict.

terrorist (en) extranjero sN-1 ✓ ✓ ✓ tM-1 economic riqueza (es)

empleo sN ✓ tM activity

policy activity of para crecer para crecer to expand Occurrences (projected) Time

Figure 3: Scoring contextual similarity of phrases: first, contextual vectors are projected using a small Figure 4: Temporal histograms of the English phrase seed dictionary and then compared with the target lan- terrorist, its Spanish translation terrorista, and riqueza guage candidates. (wealth) collected from monolingual texts spanning a 13 year period. While the correct translation has a good temporal match, the non-translation riqueza has be compared against the vectors for all words in a distinctly different signature. the target language vocabulary. The size of the n2 comparisons hugely increases if we compare vec- N- and target phrase e with an M-dimensional tors for multi-word phrases instead of just words. vector (see Figure 3). The component values of In this work, we avoid this problem by assuming the vector representing a phrase correspond to that a limited set of phrase pairs is given a pri- how often each of the words in that vocabulary ori (but without scores). By limiting ourselves appear within a two word window on either side to phrases in a phrase table, we vastly limit the of the phrase. These counts are collected using search space of possible translations. This is an monolingual corpora. After the values have been idealization because high quality translations are computed, a contextual vector f is projected onto guaranteed to be present. However, as our lesion the English vector space using translations in a experiments in Section 5.1 show, a phrase table seed bilingual dictionary to map the component without accurate translation probability estimates values into their appropriate English vector posi- is insufficient to produce high quality translations. tions. This sparse projected vector is compared We show that lexicon induction methods can be to the vectors representing all English phrases e. used to replace bilingual estimation of phrase- and Each phrase pair in the phrase table is assigned lexical-translation probabilities, making a signifi- a contextual similarity score c(f, e) based on the cant step towards SMT without parallel corpora. similarity between e and the projection of f. Various means of computing the component 3 Monolingual Parameter Estimation values and vector similarity measures have been proposed in literature (e.g. Rapp (1999), Fung and We use bilingual lexicon induction methods to es- Yee (1998)). Following Fung and Yee (1998), we timate the parameters of a phrase-based transla- compute the value of the k-th component of f’s tion model from monolingual data. Instead of contextual vector as follows: scores estimated from bilingual parallel data, we make use of cues present in monolingual data to wk = nf,k × (log(n/nk) + 1) provide multiple orthogonal estimates of similar- ity between a pair of phrases. where nf,k and nk are the number of times sk ap- pears in the context of f and in the entire corpus, 3.1 Phrasal similarity features and n is the maximum number of occurrences of Contextual similarity. We extend the vector any word in the data. Intuitively, the more fre- space approach of Rapp (1999) to compute sim- quently sk appears with f and the less common ilarity between phrases in the source and tar- it is in the corpus in general, the higher its com- get languages. More formally, assume that ponent value. Similarity between two vectors is measured as the cosine of the angle between them. (s1, s2, . . . sN ) and (t1, t2, . . . tM ) are (arbitrarily indexed) source and target vocabularies, respec- Temporal similarity. In addition to contex- tively. A source phrase f is represented with an tual similarity, phrases in two languages may

133 be scored in terms of their temporal similarity 3.2 Lexical similarity features (Schafer and Yarowsky, 2002; Klementiev and In addition to the three phrase similarity features Roth, 2006; Alfonseca et al., 2009). The intu- used in our model – c(f, e), t(f, e) and w(f, e) – ition is that news stories in different languages we include four additional lexical similarity fea- will tend to discuss the same world events on the tures for each of phrase pair. The first three lex- same day. The frequencies of translated phrases ical features c (f, e), t (f, e) and w (f, e) over time give them particular signatures that will lex lex lex are the lexical equivalents of the phrase-level con- tend to spike on the same dates. For instance, if textual, temporal and wikipedia topic similarity the phrase asian tsunami is used frequently dur- scores. They score the similarity of individual ing a particular time span, the Spanish transla- words within the phrases. To compute these tion maremoto asiatico´ is likely to also be used lexical similarity features, we average similarity frequently during that time. Figure 4 illustrates scores over all possible word alignments across how the temporal distribution of terrorist is more the two phrases. Because individual words are similar to Spanish terrorista than to other Span- more frequent than multiword phrases, the accu- ish phrases. We calculate the temporal similar- racy of c , t , and w tends to be higher than ity between a pair of phrases t(f, e) using the lex lex lex their phrasal equivalents (this is similar to the ef- method defined by Klementiev and Roth (2006). fect observed in Figure 2). We generate a temporal signature for each phrase by sorting the set of (time-stamped) documents in Orthographic / phonetic similarity. The final the monolingual corpus into a sequence of equally lexical similarity feature that we incorporate is sized temporal bins and then counting the number o(f, e), which measures the orthographic similar- of phrase occurrences in each bin. In our exper- ity between words in a phrase pair. Etymolog- iments, we set the window size to 1 day, so the ically related words often retain similar spelling size of temporal signatures is equal to the num- across languages with the same writing system, ber of days spanned by our corpus. We use cosine and low string edit distance sometimes signals distance to compare the normalized temporal sig- translation equivalency. Berg-Kirkpatrick and natures for a pair of phrases (f, e). Klein (2011) present methods for learning cor- respondences between the alphabets of two lan- Topic similarity. Phrases and their translations guages. We can also extend this idea to language are likely to appear in articles written about the pairs not sharing the same writing system since same topic in two languages. Thus, topic or cat- many cognates, borrowed words, and names re- egory information associated with monolingual main phonetically similar. Transliterations can be data can also be used to indicate similarity be- generated for tokens in a source phrase (Knight tween a phrase and its candidate translation. In and Graehl, 1997), with o(f, e) calculating pho- order to score a pair of phrases, we collect their netic similarity rather than orthographic. topic signatures by counting their occurrences in each topic and then comparing the resulting vec- The three phrasal and four lexical similarity tors. We again use the cosine similarity mea- scores are incorporated into the log linear trans- sure on the normalized topic signatures. In our lation model as feature functions, replacing the experiments, we use interlingual links between bilingually estimated phrase translation probabil- Wikipedia articles to estimate topic similarity. We ities φ and lexical weighting probabilities w. Our treat each linked article pair as a topic and collect seven similarity scores are not the only ones that counts for each phrase across all articles in its cor- could be incorporated into the translation model. responding language. Thus, the size of a phrase Various other similarity scores can be computed topic signature is the number of article pairs with depending on the available monolingual data and interlingual links in Wikipedia, and each compo- its associated metadata (see, e.g. Schafer and nent contains the number of times the phrase ap- Yarowsky (2002)). pears in (the appropriate side of) the correspond- 3.3 Reordering ing pair. Our Wikipedia-based topic similarity feature, w(f, e), is similar in spirit to polylingual The remaining component of the phrase-based topic models (Mimno et al., 2009), but it is scal- SMT model is the reordering model. We able to full bilingual lexicon induction. introduce a novel algorithm for estimating

134 Input: Source and target phrases f and e,

Source and target monolingual corpora Cf and Ce, ls (i) (i) N Phrase table pairs T = {(f , e )}i=1. Das Profi in eines einfach Facebook Anlegen ist Output: Orientation features (pm, ps, pd). What Sf ← sentences containing f in Cf ; Se ← sentences containing e in Ce; does N (i) (Bf , −, −) ← CollectOccurs(f, ∪i=1f ,Sf ); N (i) your (Be,Ae,De) ← CollectOccurs(e, ∪i=1e ,Se); c = c = c = 0; m s d Facebook 0 foreach unique f in Bf do foreach translation e0 of f 0 in T do profile s 0 cm = cm + #Be (e ); 0 cs = cs + #Ae (e ); reveal 0 cd = cd + #De (e );

c ← cm + cs + cd; cm cs cd Figure 6: Collecting phrase orientation statistics for return ( c , c , c ) a English-German phrase pair (“profile”, “Profils”) CollectOccurs(r, R, S) from non-parallel sentences (the German sentence B ← (); A ← (); D ← (); translates as “Creating a Facebook profile is easy”). foreach sentence s ∈ S do foreach occurrence of phrase r in s do 0 B ← B + (longest preceding r and in R); taining their corresponding translations (e, e ), we A ← A + (longest following r and in R); are able to increment orientation counts for (f, e) D ← D + (longest discontinuous w/ r and in 0 R); by looking at whether e and e are adjacent, swapped, or discontinuous. The orientations cor- return (B, A, D); respond directly to those shown in Figure 1. One subtly of our method is that shorter and Figure 5: Algorithm for estimating reordering more frequent phrases (e.g. punctuation) are more probabilities from monolingual data. likely to appear in multiple orientations with a given phrase, and therefore provide poor evi- po(orientation|f, e) from two monolingual cor- dence of reordering. Therefore, we (a) collect pora instead a bitext. the longest contextual phrases (which also appear Figure 1 illustrates how the phrase pair orienta- in the phrase table) for reordering feature estima- tion statistics are estimated in the standard phrase- tion, and (b) prune the set of sentences so that based SMT pipeline. For a phrase pair like (f = we only keep a small set of least frequent contex- “Profils”, e = “profile”), we count its orien- tual phrases (this has the effect of dropping many tation with the previously translated phrase pair function words and punctuation marks and and re- (f 0 = “in Facebook”, e0 = “Facebook”) across lying more heavily on multi-word content phrases all translated sentence pairs in the bitext. to estimate the reordering).2 In our pipeline we do not have translated sen- Our algorithm for learning the reordering pa- tence pairs. Instead, we look for monolingual rameters is given in Figure 5. The algorithm sentences in the source corpus which contain estimates a probability distribution over mono- the source phrase that we are interested in, like tone, swap, and discontinuous orientations (pm, f = “Profils”, and at least one other phrase ps, pd) for a phrase pair (f, e) from two mono- that we have a translation for, like f 0 = “in lingual corpora Cf and Ce. It begins by calling Facebook”. We then look for all target lan- CollectOccurs to collect the longest match- guage sentences in the target monolingual cor- ing phrase table phrases that precede f in source pus that contain the translation of f (here e = monolingual data (Bf ), as well as those that pre- “profile”) and any translation of f 0. Figure 6 il- cede (Be), follow (Ae), and are discontinuous lustrates that it is possible to find evidence for (De) with e in the target language data. For each f 0 f po(swapped|Profils, profile), even from the non- unique phrase preceding , we look up transla- 3 parallel, non-translated sentences drawn from two tions in the phrase table T. Next, we count how independent monolingual corpora. By looking for 2The pruning step has an additional benefit of minimizing foreign sentences containing pairs of adjacent for- the memory needed for orientation feature estimations. 0 3 eign phrases (f, f ) and English sentences con- #L(x) returns the count of object x in list L.

135 Monolingual training corpora Spanish-English phrase table Europarl Gigaword Wikipedia Phrase pairs 3,093,228 date range 4/96-10/09 5/94-12/08 n/a Spanish phrases 89,386 uniq shared dates 829 5,249 n/a English phrases 926,138 Spanish articles n/a 3,727,954 59,463 Spanish unigrams 13,216 English articles n/a 4,862,876 59,463 Avg # translations 98.7 Spanish lines 1,307,339 22,862,835 2,598,269 Spanish bigrams 41,426 English lines 1,307,339 67,341,030 3,630,041 Avg # translations 31.9 Spanish words 28,248,930 774,813,847 39,738,084 Spanish trigrams 34,744 English words 27,335,006 1,827,065,374 61,656,646 Avg # translations 13.5

Table 1: Statistics about the monolingual training data and the phrase table that was used in all of the experiments. many translations e0 of f 0 appeared before, after was re-run for every experiment. or were discontinuous with e in the target lan- We estimate the parameters of our model from guage data. Finally, the counts are normalized and two sets of monolingual data, detailed in Table 1: returned. These normalized counts are the values we use as estimates of po(orientation|f, e). • First, we treat the two sides of the Europarl parallel corpus as independent, monolingual 4 Experimental Setup corpora. Haghighi et al. (2008) also used this method to show how well translations We use the Spanish-English language pair to test could be learned from monolingual corpora our method for estimating the parameters of an under ideal conditions, where the contextual SMT system from monolingual corpora. This al- and temporal distribution of words in the two lows us to compare our method against the nor- monolingual corpora are nearly identical. mal bilingual training procedure. We expect bilin- gual training to result in higher translation qual- • Next, we estimate the features from truly ity because it is a more direct method for learn- monolingual corpora. To estimate the con- ing translation probabilities. We systematically textual and temporal similarity features, we remove different parameters from the standard use the Spanish and English Gigaword cor- phrase-based model, and then replace them with pora.5 These corpora are substantially larger our monolingual equivalents. Our goal is to re- than the Europarl corpora, providing 27x as cover as much of the loss as possible for each of much Spanish and 67x as much English for the deleted bilingual components. contextual similarity, and 6x as many paired The standard phrase-based model that we use dates for temporal similarity. Topical simi- as our top-line is the Moses system (Koehn et larity is estimated using Spanish and English al., 2007) trained over the full Europarl v5 par- Wikipedia articles that are paired with inter- allel corpus (Koehn, 2005). With the exception language links. of maximum phrase length (set to 3 in our ex- periments), we used default values for all of the To project context vectors from Spanish to En- parameters. All experiments use a trigram lan- glish, we use a bilingual dictionary containing en- guage model trained on the English side of the tries for 49,795 Spanish words. Note that end-to- Europarl corpus using SRILM with Kneser-Ney end translation quality is robust to substantially smoothing. To tune feature weights in minimum reducing dictionary size, but we omit these ex- error rate training, we use a development bitext periments due to space constraints. The con- of 2,553 sentence pairs, and we evaluate per- text vectors for words and phrases incorporate co- formance on a test set of 2,525 single-reference occurrence counts using a two-word window on translated newswire articles. These development either side. and test datasets were distributed in the WMT The title of our paper uses the word towards be- shared task (Callison-Burch et al., 2010).4 MERT cause we assume that an inventory of phrase pairs is given. Future work will explore inducing the 4Specifcially, news-test2008 plus news-syscomb2009 for dev and newstest2009 for test. 5We use the afp, apw and xin sections of the corpora.

136 1 all mono / M/M 10,18 all monodistortion / M/- 17 9, Wikipedia topicalw/- monodistorion / 16 contextual monodistortion / c/- 15 8, orthographic monodistortiono/- / 7,14 temporal monodistortion / t/- 13 6, none mono / -/M 12 5, none distortion / -/- 4 none bilingual / -/B 3 bilingual distortion / B/- 2 bilingual bilingual B/B / (Moses)1 Phrase orientation scores / Exp scores 1, 19 BM/B bilingual BM/B 19 1, all+ monobilingual /

BLEU 0 5 10 15 20 25 21.87

1 B/B 21.54

2 B/- 12.86 jhu.edu/ from downed be for translations. may scripts output These with and along experiments code our of running set full dis- our we tribute space, for omitted be must procedures tion Software. show. 5.1 Section in experiments the translation as of transla- quality, estimates end-to-end a good low without contain in quality These result tion and trigram). noise of each lot for 14 to un- each igram for translations 100 nearly high from was (ranging translations possible the of However, number average translations. and good constrained contain was to likely translations the possible phrase, of source each set 3 For than pairs. more phrase estimated for million we scores reordering experiments, the and our of similarity In details gives 1 pairs. Table phrase we scores. but its pairs, of phrase all its drop keep We table phrase corpus. Europarl parallel the the from learned use model bilingual we the that experiments, our of Across all texts. monolingual from itself table phrase of recovered. be 82% can Over loss score de- BLEU corpora. features the monolingual monolingual truly of from rived Performance 8: Figure probabilities. reordering for is second the types features, different Spanish- phrase-table the for a how is from indicate part removed labels first the The are equiva- estimated, monolingual 5-10). features are with (experiments estimated parameters replaced data of are bilingually Europarl they monolingual when when from recovered score estimated be BLEU lents can in 1-4) (experiments loss system the translation English of Much 7: Figure 3

BLEU -/B 1 all mono / M/M 10,18 all monodistortion / M/- 17 9, Wikipedia topicalw/- monodistorion / 16 contextual monodistortion / c/- 15 8, orthographic monodistortiono/- / 7,14 temporal monodistortion / t/- 13 6, none mono / -/M 12 5, none distortion / -/- 4 none bilingual / -/B 3 bilingual distortion / B/- 2 bilingual bilingual B/B / (Moses)1 Phrase orientation scores / Exp scores 1, 19 BM/B bilingual BM/B 19 1, all+ monobilingual / 00 5 10 1515 20 25 4.00 4 10.15

10.15 -/- 12

-/M EstimatedUsing Monolingual Corpora BLEU 13.13 13.13 10.52 eas aydtiso u estima- our of details many Because 13

˜ 0 5 10 15 20 25 t/- 5

anni/papers/lowresmt/ -/M EstimatedUsing Europarl 14.02 15.35 14.02 21.87 14

o/- 6 t/- 1 B/B 14.07 14.02 1 all mono / M/M 10,18 all monodistortion / M/- 17 9, Wikipedia topicalw/- monodistorion / 16 contextual monodistortion / c/- 15 8, orthographic monodistortiono/- / 7,14 temporal monodistortion / t/- 13 6, none mono / -/M 12 5, none distortion / -/- 4 none bilingual / -/B 3 bilingual distortion / B/- 2 bilingual bilingual B/B / (Moses)1 Phrase orientation scores / Exp scores 14.02 14.07 1, 19 BM/B bilingual BM/B 19 1, all+ monobilingual / 15 21.54 c/- 7 o/- 2 B/- 17.00 14.78 17.00 16 8 w/- c/- 12.86 http://www.cs. BLEU BLEU 3 -/B 17.92 16.85 BLEU 17.92 0 5 10 15 20 25 17 9 0 5 10 15 20 25 M/- M/- 0 5 10 15 20 25 4.00 4 10.15 21.87 17.50 18.79

10.15 -/- 18.79 12 10 1 18 M/M M/M -/MB/B EstimatedUsing Monolingual Corpora 21.54 13.13 13.13 10.52 13 2 22.92

B/- 5 23.36 t/- 23.36 1 -/M 1 19 BM/B EstimatedUsing Europarl BM/B 12.86 14.02 15.35 14.02 3 BLEU 14 -/B 137 BLEU o/- 6 00 5 10 1515 20 25 t/- 4.00 iet o ntne u loih o estimating for the algorithm of our sides instance, two For the bitext. from drawn are corpora monolin- gual the when monolin- recover could our equivalents gual much how show 5-10 Experiments features monolingual equivalent Adding 5.2 points. BLEU 17 over score of the total with a dropping abysmal, is result- quality the penalty, translation only phrase ing the leaving and feature dropped, LM the are the and features reordering table the phrase both simi- is when Spanish However, and lar. English in marginally order a only word dips require since score not the estimate, does to that bitext distance- feature simple distortion a based with replaced and eliminated When probability reordering probabilities. the reordering the than greater transla- phrase probabilities the tion phrase- include the (which features of table contribution Spanish- relative For the model. English, standard the from rameters pa- estimated bilingually remove 1-4 Experiments experiments Lesion 5.1 parameters. the estimate to truly cor- corpora using monolingual when monolingual recovery the shows independent, treating 8 Figure an but pus. as corpus side training each Europarl the using estimated from recovered are they be when features can monolingual our loss much performance how esti- the shows It bilingually of removed. the are of features each mated phrase- when standard model the based of performance Figure the results. shows experimental 7 give 8 and 7 Figures Results Experimental 5 14.07 4 14.02 14.02 14.07 10.15

10.15 -/- 15 12 c/- 7 -/M EstimatedUsing Monolingual Corpora o/- siae sn Europarl using estimated 17.00 14.78 13.13 17.00 13.13 10.52 16 13 8

5 w/- c/- t/- -/M EstimatedUsing Europarl 17.92 16.85 14.02 17.92 15.35 14.02 14 17 9

6 M/- o/- M/-t/- 17.50 18.79 14.07 φ 14.02 14.02 14.07 18.79 10 15 18 7 n h eia weights lexical the and c/- M/Mo/- M/M 17.00 14.78 17.00 16 w/- 8 c/- 22.92 23.36 p 23.36 1 1 o 19 17.92 16.85 BM/B 17.92 ( BM/B 17 9 orientation M/- M/- 17.50 18.79 18.79 10 18 M/M M/M | ,e f, 22.92 23.36 23.36 1 w 1 19 BM/B )

is ) BM/B is reordering probabilities from monolingual data (– Their method has no notion of translation similar- /M) adds 5 BLEU points, which is 73% of the po- ity aside from a bilingual dictionary. Similarly, tential recovery going from the model (–/–) to the Sanchez-Cartagena´ et al. (2011) supplement an model with bilingual reordering features (–/B). SMT phrase table with translation pairs extracted Of the temporal, orthographic, and contextual from a bilingual dictionary and give each a fre- monolingual features the temporal feature per- quency of one for computing translation scores. forms the best. Together (M/–), they recover Ravi and Knight (2011) treat MT without paral- more than each individually. Combining mono- lel training data as a decipherment task and learn lingually estimated reordering and phrase table a translation model from monolingual text. They features (M/M) yields a total gain of 13.5 BLEU translate corpora of Spanish time expressions and points, or over 75% of the BLEU score loss that subtitles, which both have a limited vocabulary, occurred when we dropped all features from the into English. Their method has not been applied phrase table. However, these results use “mono- to broader domains of text. lingual” corpora which have practically identical Most work on learning translations from mono- phrasal and temporal distributions. lingual texts only examine small numbers of fre- quent words. Huang et al. (2005) and Daume´ and 5.3 Estimating features using truly Jagarlamudi (2011) are exceptions that improve monolingual corpora MT by mining translations for OOV items. Experiments 12-18 estimate all of the features A variety of past research has focused on min- from truly monolingual corpora. Our novel al- ing parallel or comparable corpora from the web gorithm for estimating reordering holds up well (Munteanu and Marcu, 2006; Smith et al., 2010; and recovers 69% of the loss, only 0.4 BLEU Uszkoreit et al., 2010). Others use an existing points less than when estimated from the Europarl SMT system to discover parallel sentences within monolingual texts. The temporal similarity fea- independent monolingual texts, and use them to ture does not perform as well as when it was esti- re-train and enhance the system (Schwenk, 2008; mated using Europarl data, but the contextual fea- Chen et al., 2008; Schwenk and Senellart, 2009; ture does. The topic similarity using Wikipedia Rauf and Schwenk, 2009; Lambert et al., 2011). performs the strongest of the individual features. These are complementary but orthogonal to our Combining the monolingually estimated re- research goals. ordering features with the monolingually esti- mated similarity features (M/M) yields a total 7 Conclusion gain of 14.8 BLEU points, or over 82% of the BLEU point loss that occurred when we dropped This paper has demonstrated a novel set of tech- all features from the phrase table. This is equiv- niques for successfully estimating phrase-based alent to training the standard system on a bi- SMT parameters from monolingual corpora, po- text with roughly 60,000 lines or nearly 2 million tentially circumventing the need for large bitexts, words (learning curve omitted for space). which are expensive to obtain for new languages Finally, we supplement the standard bilingually and domains. We evaluated the performance of estimated model parameters with our monolin- our algorithms in a full end-to-end translation sys- gual features (BM/B), and we see a 1.5 BLEU tem. Assuming that a bilingual-corpus-derived point increase over the standard model. There- phrase table is available, we were able utilize our fore, our monolingually estimated scores capture monolingually-estimated features to recover over some novel information not contained in the stan- 82% of BLEU loss that resulted from removing dard feature set. the bilingual-corpus-derived phrase-table proba- bilities. We also showed that our monolingual fea- 6 Additional Related Work tures add 1.5 BLEU points when combined with standard bilingually estimated features. Thus our Carbonell et al. (2006) described a data-driven techniques have stand-alone efficacy when large MT system that used no parallel text. It produced bilingual corpora are not available and also make translation lattices using a bilingual dictionary a significant contribution to combined ensemble and scored them using an n-gram language model. performance when they are.

138 References on Data-Driven Machine Translation, Toulouse, France. Enrique Alfonseca, Massimiliano Ciaramita, and Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Keith Hall. 2009. Gazpacho and summer rash: and Dan Klein. 2008. Learning bilingual lexi- lexical relationships from temporal patterns of web cons from monolingual corpora. In Proceedings of search queries. In Proceedings of EMNLP. ACL/HLT. Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simple Fei Huang, Ying Zhang, and Stephan Vogel. 2005. effective decipherment via combinatorial optimiza- Mining key phrase translations from web corpora. tion. In Proceedings of the 2011 Conference on In Proceedings of EMNLP. Empirical Methods in Natural Language Process- ing (EMNLP-2011), Edinburgh, Scotland, UK. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discov- Shane Bergsma and Benjamin Van Durme. 2011. ery from multilingual comparable corpora. In Pro- Learning bilingual lexicons using the visual simi- ceedings of the ACL/Coling. larity of labeled web images. In Proceedings of the International Joint Conference on Artificial Intelli- Kevin Knight and Jonathan Graehl. 1997. Machine gence. transliteration. In Proceedings of ACL. Peter Brown, John Cocke, Stephen Della Pietra, Vin- Philipp Koehn and Kevin Knight. 2002. Learning a cent Della Pietra, Frederick Jelinek, Robert Mercer, translation lexicon from monolingual corpora. In and Paul Poossin. 1988. A statistical approach to ACL Workshop on Unsupervised Lexical Acquisi- language translation. In 12th International Confer- tion. ence on Computational Linguistics (CoLing-1988). Philipp Koehn, Franz Josef Och, and Daniel Marcu. Peter Brown, Stephen Della Pietra, Vincent Della 2003. Statistical phrase-based translation. In Pro- Pietra, and Robert Mercer. 1993. The mathemat- ceedings of HLT/NAACL. ics of machine translation: Parameter estimation. Philipp Koehn, Hieu Hoang, Alexandra Birch, Computational Linguistics, 19(2):263–311, June. Chris Callison-Burch, Marcello Federico, Nicola Chris Callison-Burch, Philipp Koehn, Christof Monz, Bertoldi, Brooke Cowan, Wade Shen, Christine Kay Peterson, Mark Przybocki, and Omar Zaidan. Moran, Richard Zens, Chris Dyer, Ondrej Bojar, 2010. Findings of the 2010 joint workshop on sta- Alexandra Constantin, and Evan Herbst. 2007. tistical machine translation and metrics for machine Moses: Open source toolkit for statistical machine translation. In Proceedings of the Workshop on Sta- translation. In Proceedings of the ACL-2007 Demo tistical Machine Translation. and Poster Sessions. Jaime Carbonell, Steve Klein, David Miller, Michael Philipp Koehn. 2005. Europarl: A parallel corpus for Steinbaum, Tomer Grassiany, and Jochen Frey. statistical machine translation. In Proceedings of 2006. Context-based machine translation. In Pro- the Machine Translation Summit. ceedings of AMTA. Shankar Kumar and William Byrne. 2004. Local Boxing Chen, Min Zhang, Aiti Aw, and Haizhou Li. phrase reordering models for statistical machine 2008. Exploiting n-best hypotheses for SMT self- translation. In Proceedings of HLT/NAACL. enhancement. In Proceedings of ACL/HLT, pages Patrik Lambert, Holger Schwenk, Christophe Ser- 157–160. van, and Sadaf Abdul-Rauf. 2011. Investigations David Chiang. 2005. A hierarchical phrase-based on translation model adaptation using monolingual model for statistical machine translation. In Pro- data. In Proceedings of the Workshop on Statistical ceedings of ACL. Machine Translation, pages 284–293, Edinburgh, Hal Daume´ and Jagadeesh Jagarlamudi. 2011. Do- Scotland, UK. main adaptation for machine translation by mining David Mimno, Hanna Wallach, Jason Naradowsky, unseen words. In Proceedings of ACL/HLT. David Smith, and Andrew McCallum. 2009. Pascale Fung and Lo Yuen Yee. 1998. An IR approach Polylingual topic models. In Proceedings of for translating new words from nonparallel, compa- EMNLP. rable texts. In Proceedings of ACL/CoLing. Dragos Stefan Munteanu and Daniel Marcu. 2006. Nikesh Garera, Chris Callison-Burch, and David Extracting parallel sub-sentential fragments from Yarowsky. 2009. Improving translation lexicon in- non-parallel corpora. In Proceedings of the duction from monolingual corpora via dependency ACL/Coling. contexts and part-of-speech equivalences. In Thir- Franz Josef Och and Hermann Ney. 2003. A sys- teenth Conference On Computational Natural Lan- tematic comparison of various statistical alignment guage Learning (CoNLL-2009), Boulder, Colorado. models. Computational Linguistics, 29(1):19–51. Ulrich Germann. 2001. Building a statistical machine Franz Josef Och and Hermann Ney. 2004. The align- translation system from scratch: How much bang ment template approach to statistical machine trans- for the buck can we expect? In ACL 2001 Workshop lation. Computational Linguistics, 30(4):417–449.

139 Franz Joseph Och. 2002. Statistical Machine Transla- tion: From Single-Word Models to Alignment Tem- plates. Ph.D. thesis, RWTH Aachen. Franz Josef Och. 2003. Minimum error rate training for statistical machine translation. In Proceedings of ACL. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of ACL. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and Ger- man corpora. In Proceedings of ACL. Sadaf Abdul Rauf and Holger Schwenk. 2009. On the use of comparable corpora to improve SMT perfor- mance. In Proceedings of EACL. Sujith Ravi and Kevin Knight. 2011. Deciphering for- eign language. In Proceedings of ACL/HLT. Vctor M. Sanchez-Cartagena,´ Felipe Sanchez-´ Martnez, and Juan Antonio Perez-Ortiz.´ 2011. Integrating shallow-transfer rules into phrase-based statistical machine translation. In Proceedings of the XIII Machine Translation Summit. Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of CoNLL. Holger Schwenk and Jean Senellart. 2009. Transla- tion model adaptation for an Arabic/French news translation system by lightly-supervised training. In MT Summit. Holger Schwenk. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation. In Proceedings of IWSLT. Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from compa- rable corpora using document level alignment. In Proceedings of HLT/NAACL. Christoph Tillman. 2004. A unigram orientation model for statistical machine translation. In Pro- ceedings of HLT/NAACL. Christoph Tillmann. 2003. A projection extension al- gorithm for statistical machine translation. In Pro- ceedings of EMNLP. Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. 2010. Large scale parallel docu- ment mining for machine translation. In Proceed- ings of CoLing. Ashish Venugopal, Stephan Vogel, and Alex Waibel. 2003. Effective phrase translation extraction from alignment models. In Proceedings of ACL.

140