
Model Combination for Machine Translation John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Och UC Berkeley Google, Inc. [email protected] {shankarkumar,ciprianchelba,och}@google.com Abstract systems (Frederking and Nirenburg, 1994). In this paper, we present model combination, a technique Machine translation benefits from two types that unifies these two approaches by learning a con- of decoding techniques: consensus decoding sensus model over the n-gram features of multiple over multiple hypotheses under a single model underlying component models. and system combination over hypotheses from different models. We present model combina- Model combination operates over the compo- tion, a method that integrates consensus de- nent models’ posterior distributions over translation coding and system combination into a uni- derivations, encoded as a forest of derivations.1 We fied, forest-based technique. Our approach combine these components by constructing a linear makes few assumptions about the underly- consensus model that includes features from each ing component models, enabling us to com- component. We then optimize this consensus model bine systems with heterogenous structure. Un- over the space of all translation derivations in the like most system combination techniques, we support of all component models’ posterior distribu- reuse the search space of component models, which entirely avoids the need to align trans- tions. By reusing the components’ search spaces, lation hypotheses. Despite its relative sim- we entirely avoid the hypothesis alignment problem plicity, model combination improves trans- that is central to standard system combination ap- lation quality over a pipelined approach of proaches (Rosti et al., 2007). first applying consensus decoding to individ- Forest-based consensus decoding techniques dif- ual systems, and then applying system combi- fer in whether they capture model predictions nation to their output. We demonstrate BLEU through n-gram posteriors (Tromble et al., 2008; improvements across data sets and language pairs in large-scale experiments. Kumar et al., 2009) or expected n-gram counts (DeNero et al., 2009; Li et al., 2009b). We evaluate both in controlled experiments, demonstrating their 1 Introduction empirical similarity. We also describe algorithms for expanding translation forests to ensure that n-grams Once statistical translation models are trained, a de- are local to a forest’s hyperedges, and for exactly coding approach determines what translations are fi- computing n-gram posteriors efficiently. nally selected. Two parallel lines of research have Model combination assumes only that each trans- shown consistent improvements over the standard lation model can produce expectations of n-gram max-derivation decoding objective, which selects features; the latent derivation structures of compo- the highest probability derivation. Consensus de- nent systems can differ arbitrarily. This flexibility coding procedures select translations for a single allows us to combine phrase-based, hierarchical, and system by optimizing for model predictions about syntax-augmented translation models. We evaluate n-grams, motivated either as minimizing Bayes risk by combining three large-scale systems on Chinese- (Kumar and Byrne, 2004), maximizing sentence English and Arabic-English NIST data sets, demon- similarity (DeNero et al., 2009), or approximating a strating improvements of up to 1.4 BLEU over the max-translation objective (Li et al., 2009b). System combination procedures, on the other hand, generate 1In this paper, we use the terms translation forest and hyper- translations from the output of multiple component graph interchangeably. 975 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 975–983, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics I ... telescope green witch was here green witch was here n P (n) 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here “man with” “saw the” rule root 1.0 applied rule “man with” rule leaves was here green witch was here I ... saw the ... man with ... telescope blue witch green witch blue witch Yo vi al hombre con el telescopio Step 1: Compute Single-Model N-gram Features Step 1: Compute Combination Features “I saw with the telescope the man” 2.1 Computing Combination Features “I saw the man with the telescope” Phrase-based system Hierarchical system Phrase-based model Hierarchical model The first step in model combination is to com- pute n-gram expectations from component system 2 2 2 2 I ... man I ... telescope vpb(“saw the”) = 0.7 vh(“saw the”) = 1.0 vpb(“saw the”) = 0.9 vh(“saw the”) = 0.7 posteriors—the same quantities found in MBR, con- ... ... ... ... sensus, and variational decoding techniques. For an 0.4 0.6 “telescope the” “saw the” the ... telescope n-gram g and system i, the expectation Step 2: Construct a Search Space Step 2: Construct a Search Space vn(g) = [h(d, g)] 0.3 1.0 i EPi(d|f) I ... telescope [αpb = 1] [αh = 1] [αpb = 1] [αh = 1] “saw with” “man with” R R can be either an n-gram expected count, if h(d, g) Rpb Rh Rpb Rh is the count of g in d, or the posterior probability I ... saw the ... man with ... telescope that d contains g, if h(d, g) is an indicator function. Section 3 describes how to compute these features Yo vi al hombre con el telescopio efficiently. Figure 1: An example translation forest encoding two 2 2 2.2 Constructing a Search Space “saw the”: vpb =0.7,vh =1.0 “saw the”: v2 =0.9,v2 =0.7 synchronous derivations for a Spanish sentence: one solid pb h The second step in model combination constructs a and one dotted. Nodes are annotated with their left and ! " ! " right unigram contexts, and hyperedges are annotated hypothesis space of translation derivations, which Step 3: Add Features for the Combination Model Step 3: Add Features for the Combination Model with scores θ · φ(r) and the bigrams they introduce. includes all derivations present in the forests con- tributed by each component system. This search Step 4: Model Training and Inference Step 4: Model Training and Inference space D is also a translation forest, and consists of best single system max-derivation baseline, and con- the conjoined union of the component forests. Let w = arg max BLEU arg max s (d) ; e w = arg max BLEU arg max s (d) ; e sistent improvements over a more complex multi- w w R Ri be the root node of component hypergraph Di. w d D(f) w d D(f) !" ∈ # $ !" ∈ # $ system pipeline that includes independent consensus Rpb Rh For all i, we include all of Di in D, along with an d∗ = arg max sw(d) d∗ = arg max sw(d) decoding and system combination. d D d D ∈ ∈ edge from Ri to R, the root of D. D may contain 2 Model Combination derivations from different types of translation sys- tems. However, D only contains derivations (and Model combination is a model-based approach to se- therefore translations) that appeared in the hypothe- lecting translations using information from multiple sis space of some component system. We do not in- component systems. Each system provides its poste- termingle the component search spaces in any way. rior distributions over derivations Pi(d|f), encoded as a weighted translation forest (i.e., translation hy- 2.3 Features for the Combination Model pergraph) in which hyperedges correspond to trans- The third step defines a new combination model over lation rule applications r.2 The conditional distribu- all of the derivations in the search space D, and then tion over derivations takes the form: annotates D with features that allow for efficient P model inference. We use a linear model over four exp r∈d θi · φi(r) Pi(d|f) = P P types of feature functions of a derivation: d0∈D(f) exp r∈d0 θi · φi(r) 1. Combination feature functions on n-grams where D(f) is the set of synchronous derivations en- n P n vi (d) = g∈Ngrams(d) vi (g) score a deriva- coded in the forest, r iterates over rule applications tion according to the n-grams it contains. in d, and θi is the parameter vector for system i. The 2. Model score feature function b gives the model feature vector φi is system specific and includes both score θ · φ (d) of a derivation d under the sys- translation model and language model features. Fig- i i tem i that d is from. ure 1 depicts an example forest. Model combination includes four steps, described 3. A length feature ` computes the word length of below. The entire sequence is illustrated in Figure 2. the target-side yield of a derivation. 2Phrase-based systems produce phrase lattices, which are in- 4. A system indicator feature αi is 1 if the deriva- stances of forests with arity 1. tion came from system i, and 0 otherwise. 976 I ... telescope green witch was here green witch was here n P (n) 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here “man with” “saw the” rule root 1.0 applied rule “man with” rule leaves was here green witch was here I ... saw the ... man with ... telescope blue witch green witch blue witch Yo vi al hombre con el telescopio StepAll 1: of Compute these features Single-Model are local N-gram to ruleFeatures applications Step 1: Compute Combination Features “I saw with the telescope the man” (hyperedges) in D. The combination features pro- “I saw the man with the telescope” videPhrase-based information system sharing acrossHierarchical the derivations system of Phrase-based model Hierarchical model different2 systems, but are functions2 of n-grams, and 2 2 I ... man I ... telescope vpb(“saw the”) = 0.7 vh(“saw the”) = 1.0 vpb(“saw the”) = 0.9 vh(“saw the”) = 0.7 so can be scored..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-