<<

Model Combination for Machine

John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Och UC Berkeley , Inc. [email protected] {shankarkumar,ciprianchelba,och}@google.com

Abstract systems (Frederking and Nirenburg, 1994). In this paper, we present model combination, a technique Machine translation benefits from two types that unifies these two approaches by learning a con- of decoding techniques: consensus decoding sensus model over the n-gram features of multiple over multiple hypotheses under a single model underlying component models. and system combination over hypotheses from different models. We present model combina- Model combination operates over the compo- tion, a method that integrates consensus de- nent models’ posterior distributions over translation coding and system combination into a uni- derivations, encoded as a forest of derivations.1 We fied, forest-based technique. Our approach combine these components by constructing a linear makes few assumptions about the underly- consensus model that includes features from each ing component models, enabling us to com- component. We then optimize this consensus model bine systems with heterogenous structure. Un- over the space of all translation derivations in the like most system combination techniques, we support of all component models’ posterior distribu- reuse the search space of component models, which entirely avoids the need to align trans- tions. By reusing the components’ search spaces, lation hypotheses. Despite its relative sim- we entirely avoid the hypothesis alignment problem plicity, model combination improves trans- that is central to standard system combination ap- lation quality over a pipelined approach of proaches (Rosti et al., 2007). first applying consensus decoding to individ- Forest-based consensus decoding techniques dif- ual systems, and then applying system combi- fer in whether they capture model predictions nation to their output. We demonstrate BLEU through n-gram posteriors (Tromble et al., 2008; improvements across data sets and pairs in large-scale experiments. Kumar et al., 2009) or expected n-gram counts (DeNero et al., 2009; Li et al., 2009b). We evaluate both in controlled experiments, demonstrating their 1 Introduction empirical similarity. We also describe for expanding translation forests to ensure that n-grams Once statistical translation models are trained, a de- are local to a forest’s hyperedges, and for exactly coding approach determines what are fi- computing n-gram posteriors efficiently. nally selected. Two parallel lines of research have Model combination assumes only that each trans- shown consistent improvements over the standard lation model can produce expectations of n-gram max-derivation decoding objective, which selects features; the latent derivation structures of compo- the highest derivation. Consensus de- nent systems can differ arbitrarily. This flexibility coding procedures select translations for a single allows us to combine phrase-based, hierarchical, and system by optimizing for model predictions about -augmented translation models. We evaluate n-grams, motivated either as minimizing Bayes risk by combining three large-scale systems on Chinese- (Kumar and Byrne, 2004), maximizing sentence English and Arabic-English NIST data sets, demon- similarity (DeNero et al., 2009), or approximating a strating improvements of up to 1.4 BLEU over the max-translation objective (Li et al., 2009b). System combination procedures, on the other hand, generate 1In this paper, we use the terms translation forest and hyper- translations from the output of multiple component graph interchangeably.

975 Human Language : The 2010 Annual Conference of the North American Chapter of the ACL, pages 975–983, Los Angeles, California, June 2010. c 2010 Association for Computational I ... telescope green witch was here green witch was here n P (n) 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here “man with” “saw the” rule root 1.0 applied rule “man with” rule leaves was here green witch was here I ... saw the ... man with ... telescope blue witch green witch blue witch

Yo vi al hombre con el telescopio

Step 1: Compute Single-Model N-gram Features Step 1: Compute Combination Features “I saw with the telescope the man” 2.1 Computing Combination Features “I saw the man with the telescope” Phrase-based system Hierarchical system Phrase-based model Hierarchical model The first step in model combination is to com- pute n-gram expectations from component system 2 2 2 2 I ... man I ... telescope vpb(“saw the”) = 0.7 vh(“saw the”) = 1.0 vpb(“saw the”) = 0.9 vh(“saw the”) = 0.7 posteriors—the same quantities found in MBR, con- ...... sensus, and variational decoding techniques. For an 0.4 0.6 “telescope the” “saw the” the ... telescope n-gram g and system i, the expectation Step 2: Construct a Search Space Step 2: Construct a Search Space vn(g) = [h(d, g)] 0.3 1.0 i EPi(d|f) I ... telescope [αpb = 1] [αh = 1] [αpb = 1] [αh = 1] “saw with” “man with” R R can be either an n-gram expected count, if h(d, g)

Rpb Rh Rpb Rh is the count of g in d, or the posterior probability I ... saw the ... man with ... telescope that d contains g, if h(d, g) is an indicator function. Section 3 describes how to compute these features Yo vi al hombre con el telescopio efficiently.

Figure 1: An example translation forest encoding two 2 2 2.2 Constructing a Search Space “saw the”: vpb =0.7,vh =1.0 “saw the”: v2 =0.9,v2 =0.7 synchronous derivations for a Spanish sentence: one solid pb h The second step in model combination constructs a and one dotted. Nodes are annotated with their left and ! " ! " right unigram contexts, and hyperedges are annotated hypothesis space of translation derivations, which Step 3: Add Features for the Combination Model Step 3: Add Features for the Combination Model with scores θ · φ(r) and the they introduce. includes all derivations present in the forests con- tributed by each component system. This search Step 4: Model Training and Inference Step 4: Model Training and Inference space D is also a translation forest, and consists of best single system max-derivation baseline, and con- the conjoined union of the component forests. Let w = arg max BLEU arg max s (d) ; e w = arg max BLEU arg max s (d) ; e sistent improvements over a more complex multi- w w R Ri be the root node of component hypergraph Di. w d D(f) w d D(f) !" ∈ # $ !" ∈ # $ system pipeline that includes independent consensus Rpb Rh For all i, we include all of Di in D, along with an d∗ = arg max sw(d) d∗ = arg max sw(d) decoding and system combination. d D d D ∈ ∈ edge from Ri to R, the root of D. D may contain 2 Model Combination derivations from different types of translation sys- tems. However, D only contains derivations (and Model combination is a model-based approach to se- therefore translations) that appeared in the hypothe- lecting translations using information from multiple sis space of some component system. We do not in- component systems. Each system provides its poste- termingle the component search spaces in any way. rior distributions over derivations Pi(d|f), encoded as a weighted translation forest (i.e., translation hy- 2.3 Features for the Combination Model pergraph) in which hyperedges correspond to trans- The third step defines a new combination model over lation rule applications r.2 The conditional distribu- all of the derivations in the search space D, and then tion over derivations takes the form: annotates D with features that allow for efficient P  model inference. We use a linear model over four exp r∈d θi · φi(r) Pi(d|f) = P P  types of feature functions of a derivation: d0∈D(f) exp r∈d0 θi · φi(r) 1. Combination feature functions on n-grams where D(f) is the set of synchronous derivations en- n P n vi (d) = g∈Ngrams(d) vi (g) score a deriva- coded in the forest, r iterates over rule applications tion according to the n-grams it contains. in d, and θi is the parameter vector for system i. The 2. Model score feature function b gives the model feature vector φi is system specific and includes both score θ · φ (d) of a derivation d under the sys- translation model and features. Fig- i i tem i that d is from. ure 1 depicts an example forest. Model combination includes four steps, described 3. A length feature ` computes the length of below. The entire sequence is illustrated in Figure 2. the target-side yield of a derivation.

2Phrase-based systems produce phrase lattices, which are in- 4. A system indicator feature αi is 1 if the deriva- stances of forests with arity 1. tion came from system i, and 0 otherwise.

976 I ... telescope green witch was here green witch was here n P (n) 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here “man with” “saw the” rule root 1.0 applied rule “man with” rule leaves was here green witch was here I ... saw the ... man with ... telescope blue witch green witch blue witch

Yo vi al hombre con el telescopio

StepAll 1: of Compute these features Single-Model are local N-gram to ruleFeatures applications Step 1: Compute Combination Features “I saw with the telescope the man” (hyperedges) in D. The combination features pro- “I saw the man with the telescope” videPhrase-based information system sharing acrossHierarchical the derivations system of Phrase-based model Hierarchical model different2 systems, but are functions2 of n-grams, and 2 2 I ... man I ... telescope vpb(“saw the”) = 0.7 vh(“saw the”) = 1.0 vpb(“saw the”) = 0.9 vh(“saw the”) = 0.7 so can be scored... on any translation... forest. Model ...... score features are already local to rule applications. 0.4 0.6 “telescope the” “saw the” the ... telescope TheStep length 2: Construct feature a Search is scored Space in the standard way. Step 2: Construct a Search Space System indicator features are scored only on the hy- 0.3 1.0 peredges from Ri to R that link each component for- I ... telescope [αpb = 1] [αh = 1] [αpb = 1] [αh = 1] “saw with” “man with” est to the common root. R R Scoring the joint search space D with these fea- Rpb Rh Rpb Rh tures involves annotating each rule application r (i.e. I ... saw the ... man with ... telescope hyperedge) with the value of each feature. Yo vi al hombre con el telescopio 2.4 Model Training and Inference

We have defined the following2 combination2 model “saw the”: vpb =0.7,vh =1.0 “saw the”: v2 =0.9,v2 =0.7 sw(d) with weights w over derivations d from I dif- pb h ferent component models:! " ! " Step 3: Add Features for the Combination Model Step 3: Add Features for the Combination Model I " 4 # X X n n α b ` Step 4: Modelwi vi Training(d) + wandi α Inferencei(d) +w ·b(d)+w ·`(d) Step 4: Model Training and Inference i=1 n=1

Becausew = we arg have max assessedBLEU allarg of max thesesw(d features) ; e on w = arg max BLEU arg max sw(d) ; e R w d D(f) w d D(f) !" ∈ # $ !" ∈ # $ local rule applications, we can find the highest scor- Rpb Rh d∗ = arg max∗ sw(d) d∗ = arg max sw(d) ing derivationd dD = arg max s (d) using standard d D ∈ w ∈ d∈D max-sum (Viterbi) inference over D. Figure 2: Model combination applied to a phrase-based We learn the weights of this consensus model us- (pb) and a hierarchical model (h) includes four steps. (1) ing hypergraph-based minimum-error-rate training shows an excerpt of the feature function for each (Kumar et al., 2009). This procedure maximizes the component, (2) depicts the result of conjoining a phrase translation quality of d∗ on a held-out set, according lattice with a hierarchical forest, (3) shows example hy- to a corpus-level evaluation metric B(·; e) that com- peredge features of the combination model, including bi- gram features vn and system indicators α , and (4) gives pares to a reference set e. We used BLEU, choosing i i training and decoding objectives. w to maximize the BLEU score of the set of transla- tions predicted by the combination model. Posterior represent a model’s be- 3 Computing Combination Features lief that the translation will contain a particular n- n The combination features vi (d) score derivations gram at least once. They can be expressed as from each model with the n-gram predictions of the EP (d|f) [δ(d, g)] for an indicator function δ(d, g) others. These predictions sum over all derivations that is 1 if n-gram g appears in derivation d. These under a single component model to compute a pos- quantities arise in approximating BLEU for lattice- terior belief about each n-gram. In this paper, we based and hypergraph-based minimum Bayes risk compare two kinds of combination features, poste- decoding (Tromble et al., 2008; Kumar et al., 2009). 3 rior probabilities and expected counts. Expected n-gram counts EP (d|f) [c(d, g)] represent n g 3 the model’s belief of how many times an -gram The model combination framework could incorporate ar- will appear in the translation. These quantities ap- bitrary features on the common output space of the models, but we focus on features that have previously proven useful for con- pear in forest-based consensus decoding (DeNero et sensus decoding. al., 2009) and variational decoding (Li et al., 2009b).

977 Methods for computing both of these quantities ap- 3.2 Ensuring N-gram Locality pear in the literature. However, we address two out- DeNero et al. (2009) describes an efficient standing issues below. In Section 5, we also com- for computing n-gram expected counts from a trans- pare the two quantities experimentally. lation forest. This method assumes n-gram local- 3.1 Computing N-gram Posteriors Exactly ity of the forest, the property that any n-gram intro- duced by a hyperedge appears in all derivations that Kumar et al. (2009) describes an efficient approx- include the hyperedge. However, decoders may re- imate algorithm for computing n-gram posterior combine forest nodes whenever the language model probabilities. Algorithm 1 is an exact algorithm that does not distinguish between n-grams due to back- computes all n-gram posteriors from a forest in a off (Li and Khudanpur, 2008). In this case, a forest single inside pass. The algorithm tracks two quanti- encoding of a posterior distribution may not exhibit ties at each node n: regular inside scores β(n) and n-gram locality in all regions of the search space. n-gram inside scores βˆ(n, g) that sum the scores of Figure 3 shows a hypergraph which contains non- all derivations rooted at n that contain n-gram g. local , along with its local expansion. For each hyperedge, we compute ¯b(g), the sum of Algorithm 2 expands a forest to ensure n-gram lo- scores for derivations that do not contain g (Lines 8- cality while preserving the encoded distribution over 11). We then use that quantity to compute the score derivations. Let a forest (N,R) consist of nodes N of derivations that do contain g (Line 17). and hyperedges R, which correspond to rule appli- Algorithm 1 Computing n-gram posteriors cations. Let Rules(n) be the subset of R rooted by 1: for n ∈ N in topological order do n, and Leaves(r) be the leaf nodes of rule applica- 2: β(n) ← 0 tion r. The expanded forest (Ne,Re) is constructed 3: βˆ(n, g) ← 0, ∀g ∈ Ngrams(n) by a function Reapply(r, L) that applies the rule of r 0 0 4: for r ∈ Rules(n) do to a new set of leaves L ⊂ Ne, forming a pair (r , n ) 5: w ← exp [θ · φ(r)] consisting of a new rule application r0 rooted by n0. 6: b ← w P is a map from nodes in N to subsets of Ne which ¯ 7: b(g) ← w, ∀g ∈ Ngrams(n) tracks how N projects to Ne. Two nodes in Ne are 8: for ` ∈ Leaves(r) do identical if they have the same (n−1)-gram left and 9: b ← b × β(`) right contexts and are projections of the same node 10: for g ∈ Ngrams(n) do in N. The symbol N denotes a set cross-product.   11: ¯b(g) ← ¯b(g) × β(`) − βˆ(`, g) Algorithm 2 Expanding for n-gram locality 12: β(n) ← β(n) + b 1: Ne ← {}; Re ← {} 13: for g ∈ Ngrams(n) do 2: for n ∈ N in topological order do 14: if g ∈ Ngrams(r) then 3: P (n) ← {} 15: βˆ(n, g) ← βˆ(n, g)+b 4: for r ∈ Rules(n) do 16: else 5: for L ∈ N [P (`)] do 17: βˆ(n, g) ← βˆ(n, g)+b − ¯b(g) `∈Leaves(r) 6: r0, n0 ← Reapply(r, L) 18: for g ∈ Ngrams(root) (all g in the HG) do 7: P (n) ← P (n) ∪ {n0} βˆ(root,g) 19: P (g|f) ← 0 β(root) 8: Ne ← Ne ∪ {n } 9: R ← R ∪ {r0} This algorithm can in principle compute the pos- e e terior probability of any indicator function on local This transformation preserves the original distri- features of a derivation. More generally, this algo- bution over derivations by splitting states, but main- rithm demonstrates how vector-backed inside passes taining continuations from those split states by du- can compute quantities beyond expectations of local plicating rule applications. The process is analogous features (Li and Eisner, 2009).4 Chelba and Maha- jan (2009) developed a similar algorithm for lattices. over the rules of a derivation, even if the features they indicate are local. Therefore, Algorithm 1 is not an instance of an ex- 4Indicator functions on derivations are not locally additive pectation semiring computation.

978 I ... telescope consensus-decoded outputs. The best consensus de- green witch was here green witch was here n P (n) coding methods for individual systems already re- 0.4 blue witch was here !→ “saw the” 0.6 the ... telescope blue witch was here quire the computation-intensive steps of model com- rule root “man with” “saw the” n 1.0 applied rule bination: producing lattices or forests, computing - “man with” rule leaves gram feature expectations, and re-decoding to max- was here imize a secondary consensus objective. Hence, to green witch was here I ... saw the ... man with ... telescope maximize the performance of system combination, blue witch green witch blue witch these steps must be performed for each system, Yo vi al hombre con el telescopio whereas model combination requires only one for- Figure 3: Hypergraph expansion ensures n-gram locality est rescoring pass over all systems. without affecting the distribution over derivations. In the left example, trigrams “green witch was” and “blue witch Model combination also leverages aggregate was” are non-local due to language model back-off. On statistics from the components’ posteriors, whereas Step 1: Compute Single-Model N-gram Features Step 1: Compute Combination Featuresthe right, states are split to“I enforce saw with the telescope locality. the man” system combiners typically do not. Zhao and He “I saw the man with the telescope” Phrase-based system Hierarchical system Phrase-based model Hierarchical model (2009) showed that n-gram posterior features are useful in the context of a system combination model, I ... telescope 2 v2(“saw the”) = 1.0 v2 (“saw the”) = 0.9 vto2(“saw expanding the”) = bigram 0.7 latticesI ... to man encode a trigram his- vpb(“saw the”) = 0.7 h pb h even when computed from k-best lists...... tory at each... lattice node (Weng et al., 1998). 0.4 0.6 Despite these advantages, system combination “telescope the” “saw the” the ... telescope Step 2: Construct a Search Space Step 2: Construct a Search Space4 Relationship to Prior Work may be more appropriate in some settings. In par- ticular, model combination is designed primarily for 0.3 1.0 Model combination is aI ... multi-system telescope generaliza- [αpb = 1] [αh = 1] [αpb = 1] [αh = 1] “manstatistical with” systems that generate hypergraph outputs. tion of consensus or minimum Bayes“saw risk with” decod- R R Model combination can in principle integrate a non- ing. When only one component system is included, Rpb Rh Rpb Rh statistical system that generates either a single hy- model combination is identicalI ... saw to minimumthe ... man Bayes with ... telescope pothesis or an unweighted forest.6 Likewise, the pro- risk decoding over hypergraphs, as described in Ku- Yo vi al hombre con elcedure telescopio could be applied to statistical systems that mar et al. (2009).5 only generate k-best lists. However, we would not 2 2 expect the same strong performance from model “saw the”: vpb =0.7,vh =1.0 “saw the”: v2 =04.1.9,v2 =0 System.7 Combination pb h combination in these constrained settings. ! " ! System combination" techniques in machine trans- Step 3: Add Features for the Combination Model Step 3: Add Features for the Combinationlation take Model as input the outputs {e , ··· , e } of k 1 k 4.2 Joint Decoding and Collaborative Decoding translation systems, where ei is a structured transla- Step 4: Model Training and Inference Step 4: Model Training and Inference tion object (or k-best lists thereof), typically viewed Liu et al. (2009) describes two techniques for com- as a sequence of . The dominant approach in bining multiple synchronous , which the w = arg max BLEU arg max sw(d) ; e w = arg max BLEU arg max sw(d) ; e R w d D(f) w thed D field(f) chooses a primary translation ep as a back- authors characterize as joint decoding. Joint de- !" ∈ # $ !" ∈ # $ Rpb Rh d∗ = arg max sw(d) d∗ = arg max sw(d) bone, then finds an alignment ai to the backbone for coding does not involve a consensus or minimum- d D d D ∈ ∈ each ei. A new search space is constructed from Bayes-risk decoding objective; indeed, their best these backbone-aligned outputs, and then a voting results come from standard max-derivation decod- procedure or feature-based model predicts a final ing (with a multi-system ). More impor- consensus translation (Rosti et al., 2007). Model tantly, their computations rely on a correspondence combination entirely avoids this alignment problem between nodes in the hypergraph outputs of differ- by viewing hypotheses as n-gram occurrence vec- ent systems, and so they can only joint decode over tors rather than word sequences. models with similar search strategies. We combine a Model combination also requires less total com- phrase-based model that uses left-to-right decoding putation than applying system combination to with two hierarchical systems that use bottom-up de- coding — a scenario to which joint decoding is not 5We do not refer to model combination as a minimum Bayes risk decoding procedure despite this similarity because risk im- applicable. Though Liu et al. (2009) rightly point plies a belief distribution over outputs, and we now have mul- out that most models can be decoded either left-to- tiple output distributions that are not necessarily calibrated. Moreover, our generalized, multi-model objective (Section 2.4) 6A single hypothesis can be represented as a forest, while an is motivated by BLEU, but not a direct approximation to it. unweighted forest could be assigned a uniform distribution.

979 right or bottom-up, such changes can have substan- BLEU (%) tial implications for search efficiency and search er- ar-en zh-en Sys Base dev nist08 dev nist08 ror. We prefer to maintain the flexibility of using dif- PB MAX 51.6 43.9 37.7 25.4 ferent search strategies in each component system. PB MBR 52.4∗ 44.6∗ 38.6∗ 27.3∗ ∗ ∗ ∗ ∗ Li et al. (2009a) is another related technique for PB CON 52.4 44.6 38.7 27.2 combining translation systems by leveraging model Hiero MAX 50.9 43.3 40.0 27.2 Hiero MBR 51.4∗ 43.8∗ 40.6∗ 27.8 predictions of n-gram features. K-best lists of par- Hiero CON 51.5∗ 43.8∗ 40.5∗ 28.2 tial translations are iteratively reranked using n- SAMT MAX 51.7 43.8 40.8∗ 28.4 gram features from the predictions of other mod- SAMT MBR 52.7∗ 44.5∗ 41.1∗ 28.8∗ ∗ ∗ ∗ ∗ els (which are also iteratively updated). Our tech- SAMT CON 52.6 44.4 41.1 28.7 nique differs in that we use no k-best approxima- Table 1: Performance of baseline systems. tions, have fewer parameters to learn (one consensus weight vector rather than one for each collaborating BLEU (%) decoder) and produce only one output, avoiding an ar-en zh-en additional system combination step at the end. Approach dev nist08 dev nist08 Best MAX system 51.7 43.9 40.8 28.4 ∗ 5 Experiments Best MBR system 52.7 44.5 41.1 28.8 MC Conjoin/SI 53.5 45.3 41.6 29.0∗ We report results on the constrained data track of the NIST 2008 Arabic-to-English (ar-en) and Chinese- Table 2: Performance from the best single system for each language pair without consensus decoding (Best to-English (zh-en) translation tasks.7 We train on all MAX system), the best system with minimum Bayes risk parallel and monolingual data allowed in the track. decoding (Best MBR system), and model combination We use the NIST 2004 eval set (dev) for optimiz- across three systems. ing parameters in model combination and test on the NIST 2008 evaluation set. We report results using the IBM implementation of the BLEU score For each system, we report the performance of which computes the brevity penalty using the clos- max-derivation decoding (MAX), hypergraph-based est reference translation for each segment (Papineni MBR (Kumar et al., 2009), and a linear version of et al., 2002). We measure statistical significance us- forest-based consensus decoding (CON) (DeNero et ing 95% confidence intervals computed using paired al., 2009). MBR and CON differ only in that the first bootstrap resampling. In all table cells (except for uses n-gram posteriors, while the second uses ex- Table 3) systems without statistically significant dif- pected n-gram counts. The two consensus decoding ferences are marked with the same superscript. approaches yield comparable performance. Hence, we report performance for hypergraph-based MBR 5.1 Base Systems in our comparison to model combination below.

We combine outputs from three systems. Our 5.2 Experimental Results phrase-based system is similar to the alignment tem- plate system described by Och and Ney (2004). Table 2 compares model combination (MC) to the Translation is performed using a standard left- best MAX and MBR systems. Model combination to-right beam-search decoder. Our hierarchical uses a conjoined search space wherein each hyper- systems consist of a syntax-augmented system edge is annotated with 21 features: 12 n-gram poste- n (SAMT) that includes target-language syntactic cat- rior features vi computed from the PB/Hiero/SAMT n egories (Zollmann and Venugopal, 2006) and a forests for n ≤ 4; 4 n-gram posterior features v Hiero-style system with a single non-terminal (Chi- computed from the conjoined forest; 1 length fea- ang, 2007). Each base system yields state-of-the-art ture `; 1 feature b for the score assigned by the base translation performance, summarized in Table 1. model; and 3 system indicator (SI) features αi that select which base system a derivation came from. 7http://www.nist.gov/speech/tests/mt We refer to this model combination approach as MC

980 BLEU (%) BLEU (%) ar-en zh-en ar-en zh-en Strategy dev nist08 dev nist08 Approach Base dev nist08 dev nist08 Best MBR system 52.7 44.5 41.1 28.8 Sent-level MAX 51.8∗ 44.4∗ 40.8∗ 28.2∗ MBR Conjoin 52.3 44.5 40.5 28.3 Word-level MAX 52.0∗ 44.4∗ 40.8∗ 28.1∗ MBR Conjoin/feats-best 52.7 44.9 41.2 28.8 Sent-level MBR 52.7+ 44.6∗ 41.2 28.8+ MBR Conjoin/SI 53.1 44.9 41.2 28.9 Word-level MBR 52.5+ 44.7∗ 40.9 28.8+ MC 1-best HG 52.7 44.6 41.1 28.7 MC-conjoin-SI 53.5 45.3 41.6 29.0+ MC Conjoin 52.9 44.6 40.3 28.1 MC Conjoin/base/SI 53.5 45.1 41.2 28.9 Table 4: BLEU performance for different system and MC Conjoin/SI 53.5 45.3 41.6 29.0 model combination approaches. Sentence-level and word-level system combination operate over the sentence Table 3: Model Combination experiments. output of the base systems, which are either decoded to maximize derivation score (MAX) or to minimize Bayes risk (MBR). Conjoin/SI. Model combination improves over the single best MAX system by 1.4 BLEU in ar-en and 0.6 BLEU in zh-en, and always improves over MBR. prisingly, n-gram features from the additional sys- This improvement could arise due to multiple rea- tems did not help select a better hypothesis within sons: a bigger search space, the consensus features the search space of a single system. from constituent systems, or the system indicator When we expand the search space to the con- features. Table 3 teases apart these contributions. joined hypergraph (MC Conjoin), it performs worse We first perform MBR on the conjoined hyper- relative to MC 1-best. Since these two systems are graph (MBR-Conjoin). In this case, each edge is identical in their feature set, we hypothesize that tagged with 4 conjoined n-gram features vn, along the larger search space has introduced erroneous hy- with length and base model features. MBR-Conjoin potheses. This is similar to the scenario where MBR is worse than MBR on the hypergraph from the Conjoin is worse than MBR 1-best. As in the MBR single best system. This could imply that either case, adding system indicator features helps (MC the larger search space introduces poor hypotheses Conjoin/base/SI). The result is comparable to MBR or that the n-gram posteriors obtained are weaker. on the conjoined hypergraph with SI features. When we now restrict the n-gram features to those We finally add extra n-gram features which are from the best system (MBR Conjoin/feats-best), computed from the conjoined hypergraph (MC Con- BLEU scores increase relative to MBR-Conjoin. join + SI). This gives the best performance although This implies that the n-gram features computed over the gains over MC Conjoin/base/SI are quite small. the conjoined hypergraph are weaker than the corre- Note that these added features are the same n-gram sponding features from the best system. features used in MBR Conjoin. Although they are Adding system indicator features (MBR Con- not strong by themselves, they provide additional join+SI) helps the MBR-Conjoin system consider- discriminative power by providing a consensus score ably; the resulting system is better than the best across all 3 base systems. MBR system. This could mean that the SI features guide search towards stronger parts of the larger 5.3 Comparison to System Combination search space. In addition, these features provide a Table 4 compares model combination to two sys- normalization of scores across systems. tem combination algorithms. The first, which we We next do several model-combination experi- call sentence-level combination, chooses among the ments. We perform model combination using the base systems’ three translations the sentence that search space of only the best MBR system (MC has the highest consensus score. The second, word- 1best HG). Here, the hypergraph is annotated with level combination, builds a “word sausage” from n-gram features from the 3 base systems, as well as the outputs of the three systems and chooses a path length and base model features. A total of 3 × 4 + through the sausage with the highest score under 1 + 1 = 14 features are added to each edge. Sur- a similar model (Macherey and Och, 2007). Nei-

981 BLEU (%) BLEU (%) ar-en zh-en ar-en zh-en Approach dev nist08 dev nist08 Posteriors dev nist08 dev nist08 HG-expand 52.7∗ 44.5∗ 41.1∗ 28.8∗ Exact 52.4∗ 44.6∗ 38.6∗ 27.3∗ HG-noexpand 52.7∗ 44.5∗ 41.1∗ 28.8∗ Approximate 52.5∗ 44.6∗ 38.6∗ 27.2∗

Table 5: MBR decoding on the syntax augmented system, Table 6: MBR decoding on the phrase-based system with with and without hypergraph expansion. either exact or approximate posteriors. ther system combination technique provides much have varied decoding strategies; we only require that benefit, presumably because the underlying systems each system produce a forest (or a lattice) of trans- all share the same data, pre-processing, language lations. This flexibility allows the technique to be model, alignments, and base. applied quite broadly. For instance, de Gispert et al. Comparing system combination when no consen- (2009) describe combining systems based on mul- sus (i.e., minimum Bayes risk) decoding is utilized tiple source representations using minimum Bayes at all, we find that model combination improves risk decoding—likewise, they could be combined upon the result by up to 1.1 BLEU points. Model via model combination. combination also performs slightly better relative to Model combination has two significant advan- system combination over MBR-decoded systems. In tages over current approaches to system combina- the latter case, system combination actually requires tion. First, it does not rely on hypothesis alignment more computation compared to model combination; between outputs of individual systems. Aligning consensus decoding is performed for each system translation hypotheses accurately can be challeng- rather than only once for model combination. This ing, and has a substantial effect on combination per- experiment validates our approach. Model combina- formance (He et al., 2008). Instead of aligning hy- tion outperforms system combination while avoid- potheses, we compute expectations of local features ing the challenge of aligning translation hypotheses. of n-grams. This is analogous to how BLEU score is 5.4 Algorithmic Improvements computed, which also views sentences as vectors of n-gram counts (Papineni et al., 2002) . Second, we Section 3 describes two improvements to comput- do not need to pick a backbone system for combina- ing n-gram posteriors: hypergraph expansion for n- tion. Choosing a backbone system can also be chal- gram locality and exact posterior computation. Ta- lenging, and also affects system combination perfor- ble 5 shows MBR decoding with and without expan- mance (He and Toutanova, 2009). Model combina- sion (Algorithm 2) in a decoder that collapses nodes tion sidesteps this issue by working with the con- due to language model back-off. These results show joined forest produced by the union of the compo- that while expansion is necessary for correctness, it nent forests, and allows the consensus model to ex- does not affect performance. press system preferences via weights on system in- Table 6 compares exact n-gram posterior compu- dicator features. tation (Algorithm 1) to the approximation described Despite its simplicity, model combination pro- by Kumar et al. (2009). Both methods yield identical vides strong performance by leveraging existing results. Again, while the exact method guarantees consensus, search, and training techniques. The correctness of the computation, the approximation technique outperforms MBR and consensus decod- suffices in practice. ing on each of the component systems. In addition, 6 Conclusion it performs better than standard sentence-based or word-based system combination techniques applied Model combination is a consensus decoding strat- to either max-derivation or MBR outputs of the indi- egy over a collection of forests produced by multi- vidual systems. In sum, it is a natural and effective ple machine translation systems. These systems can model-based approach to multi-system decoding.

982 References Zhifei Li, Jason Eisner, and Sanjeev Khudanpur. 2009b. Variational decoding for statistical machine transla- Ciprian Chelba and M. Mahajan. 2009. A dynamic tion. In Proceedings of the Association for Compu- programming algorithm for computing the posterior tational Linguistics and IJCNLP. probability of n-gram occurrences in automatic speech Yang Liu, Haitao Mi, Yang Feng, and Qun Liu. 2009. recognition lattices. Personal communication. Joint decoding with multiple translation models. In David Chiang. 2007. Hierarchical phrase-based transla- Proceedings of the Association for Computational Lin- tion. Computational Linguistics. guistics and IJCNLP. A. de Gispert, S. Virpioja, M. Kurimo, and W. Byrne. Wolfgang Macherey and Franz Och. 2007. An empirical 2009. Minimum bayes risk combination of translation study on computing consensus translations from mul- hypotheses from alternative morphological decompo- tiple machine translation systems. In EMNLP, Prague, sitions. In Proceedings of the North American Chapter Czech Republic. of the Association for Computational Linguistics. Franz J. Och and Hermann Ney. 2004. The Alignment John DeNero, David Chiang, and Kevin Knight. 2009. Template Approach to Statistical Machine Translation. Fast consensus decoding over translation forests. In Computational Linguistics, 30(4):417 – 449. Proceedings of the Association for Computational Lin- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- guistics and IJCNLP. Jing Zhu. 2002. BLEU: A method for automatic eval- Robert Frederking and Sergei Nirenburg. 1994. Three uation of machine translation. In Proceedings of the heads are better than one. In Proceedings of the Con- Association for Computational Linguistics. ference on Applied Processing. Antti-Veikko I. Rosti, Necip Fazil Ayan, Bing Xiang, Xiaodong He and Kristina Toutanova. 2009. Joint opti- Spyros Matsoukas, Richard Schwartz, and Bonnie J. mization for machine translation system combination. Dorr. 2007. Combining outputs from multiple ma- In Proceedings of the Conference on Empirical Meth- chine translation systems. In Proceedings of the North ods in Natural Language Processing . American Chapter of the Association for Computa- Xiaodong He, Mei Yang, Jianfeng Gao, Patrick Nguyen, tional Linguistics. and Robert Moore. 2008. Indirect-hmm-based hy- Roy Tromble, Shankar Kumar, Franz J. Och, and Wolf- pothesis alignment for combining outputs from ma- gang Macherey. 2008. Lattice minimum Bayes-risk Proceedings of the Con- chine translation systems. In decoding for statistical machine translation. In Pro- ference on Empirical Methods in Natural Language ceedings of the Conference on Empirical Methods in Processing . Natural Language Processing. Shankar Kumar and William Byrne. 2004. Minimum Fuliang Weng, Andreas Stolcke, and Ananth Sankar. Bayes-risk decoding for statistical machine transla- 1998. Efficient lattice representation and generation. Proceedings of the North American Chapter tion. In In Intl. Conf. on Spoken Language Processing. of the Association for Computational Linguistics. Yong Zhao and Xiaodong He. 2009. Using n-gram based Shankar Kumar, Wolfgang Macherey, Chris Dyer, and features for machine translation system combination. Franz Och. 2009. Efficient minimum error rate train- In Proceedings of the North American Chapter of the ing and minimum bayes-risk decoding for translation Association for Computational Linguistics. hypergraphs and lattices. In Proceedings of the Asso- Andreas Zollmann and Ashish Venugopal. 2006. Syntax ciation for Computational Linguistics and IJCNLP. augmented machine translation via chart . In Zhifei Li and Jason Eisner. 2009. First- and second-order Proceedings of the NAACL 2006 Workshop on statisti- expectation semirings with applications to minimum- cal machine translation. risk training on translation forests. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing. Zhifei Li and Sanjeev Khudanpur. 2008. A scalable decoder for parsing-based machine translation with equivalent language model state maintenance. In ACL Workshop on Syntax and Structure in Statistical Trans- lation. Mu Li, Nan Duan, Dongdong Zhang, Chi-Ho Li, and Ming Zhou. 2009a. Collaborative decoding: Partial hypothesis re-ranking using translation consensus be- tween decoders. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

983