Model Combination for Machine Translation

Model Combination for Machine Translation John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Och UC Berkeley Google, Inc. [email protected] {shankarkumar,ciprianchelba,och}@google.com Abstract systems (Frederking and Nirenburg, 1994). In this paper, we present model combination, a technique Machine translation benefits from two types that unifies these two approaches by learning a con- of decoding techniques: consensus decoding sensus model over the n-gram features of multiple over multiple hypotheses under a single model underlying component models. and system combination over hypotheses from different models. We present model combina- Model combination operates over the compo- tion, a method that integrates consensus de- nent models’ posterior distributions over translation coding and system combination into a uni- derivations, encoded as a forest of derivations.1 We fied, forest-based technique. Our approach combine these components by constructing a linear makes few assumptions about the underly- consensus model that includes features from each ing component models, enabling us to com- component. We then optimize this consensus model bine systems with heterogenous structure. Un- over the space of all translation derivations in the like most system combination techniques, we support of all component models’ posterior distribu- reuse the search space of component models, which entirely avoids the need to align trans- tions. By reusing the components’ search spaces, lation hypotheses. Despite its relative sim- we entirely avoid the hypothesis alignment problem plicity, model combination improves trans- that is central to standard system combination ap- lation quality over a pipelined approach of proaches (Rosti et al., 2007). first applying consensus decoding to individ- Forest-based consensus decoding techniques dif- ual systems, and then applying system combi- fer in whether they capture model predictions nation to their output. We demonstrate BLEU through n-gram posteriors (Tromble et al., 2008; improvements across data sets and language pairs in large-scale experiments. Kumar et al., 2009) or expected n-gram counts (DeNero et al., 2009; Li et al., 2009b). We evaluate both in controlled experiments, demonstrating their 1 Introduction empirical similarity. We also describe algorithms for expanding translation forests to ensure that n-grams Once statistical translation models are trained, a de- are local to a forest’s hyperedges, and for exactly coding approach determines what translations are fi- computing n-gram posteriors efficiently. nally selected. Two parallel lines of research have Model combination assumes only that each trans- shown consistent improvements over the standard lation model can produce expectations of n-gram max-derivation decoding objective, which selects features; the latent derivation structures of compo- the highest probability derivation. Consensus de- nent systems can differ arbitrarily. This flexibility coding procedures select translations for a single allows us to combine phrase-based, hierarchical, and system by optimizing for model predictions about syntax-augmented translation models. We evaluate n-grams, motivated either as minimizing Bayes risk by combining three large-scale systems on Chinese- (Kumar and Byrne, 2004), maximizing sentence English and Arabic-English NIST data sets, demon- similarity (DeNero et al., 2009), or approximating a strating improvements of up to 1.4 BLEU over the max-translation objective (Li et al., 2009b). System combination procedures, on the other hand, generate 1In this paper, we use the terms translation forest and hyper- translations from the output of multiple component graph interchangeably. 975 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 975–983, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics I ... telescope green witch was here green witch was here n P (n) 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here “man with” “saw the” rule root 1.0 applied rule “man with” rule leaves was here green witch was here I ... saw the ... man with ... telescope blue witch green witch blue witch Yo vi al hombre con el telescopio Step 1: Compute Single-Model N-gram Features Step 1: Compute Combination Features “I saw with the telescope the man” 2.1 Computing Combination Features “I saw the man with the telescope” Phrase-based system Hierarchical system Phrase-based model Hierarchical model The first step in model combination is to compute n-gram expectations from component system 2 2 2 2 I ... man I ... telescope vpb(“saw the”) = 0.7 vh(“saw the”) = 1.0 vpb(“saw the”) = 0.9 vh(“saw the”) = 0.7 posteriors—the same quantities found in MBR, con- ... ... ... ... sensus, and variational decoding techniques. For an 0.4 0.6 “telescope the” “saw the” the ... telescope n-gram g and system i, the expectation Step 2: Construct a Search Space Step 2: Construct a Search Space vn(g) = [h(d, g)] 0.3 1.0 i EPi(d|f) I ... telescope [αpb = 1] [αh = 1] [αpb = 1] [αh = 1] “saw with” “man with” R R can be either an n-gram expected count, if h(d, g) Rpb Rh Rpb Rh is the count of g in d, or the posterior probability I ... saw the ... man with ... telescope that d contains g, if h(d, g) is an indicator function. Section 3 describes how to compute these features Yo vi al hombre con el telescopio efficiently. Figure 1: An example translation forest encoding two 2 2 2.2 Constructing a Search Space “saw the”: vpb =0.7,vh =1.0 “saw the”: v2 =0.9,v2 =0.7 synchronous derivations for a Spanish sentence: one solid pb h The second step in model combination constructs a and one dotted. Nodes are annotated with their left and ! " ! " right unigram contexts, and hyperedges are annotated hypothesis space of translation derivations, which Step 3: Add Features for the Combination Model Step 3: Add Features for the Combination Model with scores θ · φ(r) and the bigrams they introduce. includes all derivations present in the forests con- tributed by each component system. This search Step 4: Model Training and Inference Step 4: Model Training and Inference space D is also a translation forest, and consists of best single system max-derivation baseline, and con- the conjoined union of the component forests. Let w = arg max BLEU arg max s (d) ; e w = arg max BLEU arg max s (d) ; e sistent improvements over a more complex multi- w w R Ri be the root node of component hypergraph Di. w d D(f) w d D(f) !" ∈ # $ !" ∈ # $ system pipeline that includes independent consensus Rpb Rh For all i, we include all of Di in D, along with an d∗ = arg max sw(d) d∗ = arg max sw(d) decoding and system combination. d D d D ∈ ∈ edge from Ri to R, the root of D. D may contain 2 Model Combination derivations from different types of translation systems. However, D only contains derivations (and Model combination is a model-based approach to se- therefore translations) that appeared in the hypothe- lecting translations using information from multiple sis space of some component system. We do not in- component systems. Each system provides its poste- termingle the component search spaces in any way. rior distributions over derivations Pi(d|f), encoded as a weighted translation forest (i.e., translation hy- 2.3 Features for the Combination Model pergraph) in which hyperedges correspond to trans- The third step defines a new combination model over lation rule applications r.2 The conditional distribu- all of the derivations in the search space D, and then tion over derivations takes the form: annotates D with features that allow for efficient P model inference. We use a linear model over four exp r∈d θi · φi(r) Pi(d|f) = P P types of feature functions of a derivation: d0∈D(f) exp r∈d0 θi · φi(r) 1. Combination feature functions on n-grams where D(f) is the set of synchronous derivations en- n P n vi (d) = g∈Ngrams(d) vi (g) score a deriva- coded in the forest, r iterates over rule applications tion according to the n-grams it contains. in d, and θi is the parameter vector for system i. The 2. Model score feature function b gives the model feature vector φi is system specific and includes both score θ · φ (d) of a derivation d under the sys- translation model and language model features. Fig- i i tem i that d is from. ure 1 depicts an example forest. Model combination includes four steps, described 3. A length feature ` computes the word length of below. The entire sequence is illustrated in Figure 2. the target-side yield of a derivation. 2Phrase-based systems produce phrase lattices, which are in- 4. A system indicator feature αi is 1 if the deriva- stances of forests with arity 1. tion came from system i, and 0 otherwise. 976 I ... telescope green witch was here green witch was here n P (n) 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here “man with” “saw the” rule root 1.0 applied rule “man with” rule leaves was here green witch was here I ... saw the ... man with ... telescope blue witch green witch blue witch Yo vi al hombre con el telescopio StepAll 1: of Compute these features Single-Model are local N-gram to ruleFeatures applications Step 1: Compute Combination Features “I saw with the telescope the man” (hyperedges) in D. The combination features pro- “I saw the man with the telescope” videPhrase-based information system sharing acrossHierarchical the derivations system of Phrase-based model Hierarchical model different2 systems, but are functions2 of n-grams, and 2 2 I ... man I ... telescope vpb(“saw the”) = 0.7 vh(“saw the”) = 1.0 vpb(“saw the”) = 0.9 vh(“saw the”) = 0.7 so can be scored..

Model Combination for Machine Translation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support