<<

Training and Evaluating Error Minimization Rules for Statistical Machine

Ashish Venugopal Andreas Zollmann Alex Waibel School of Computer School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected]

Abstract As discussed in (Och, 2003), the direct translation model represents the probability of target sentence Decision rules that explicitly account for ’English’ e = e1 ... eI being the translation for a non-probabilistic evaluation metrics in source sentence ’French’ f = f1 ... fJ through an machine translation typically require spe- exponential, or log-linear model cial training, often to estimate parame- Pm ters in exponential models that govern the exp( k=1 λk ∗ hk(e, f)) pλ(e|f) = P Pm 0 (1) search space and the selection of candi- e0∈E exp( k=1 λk ∗ hk(e , f)) date . While the traditional Maximum A Posteriori (MAP) decision where e is a single candidate translation for f rule can be optimized as a piecewise lin- from the set of all English translations E, λ is the ear function in a greedy search of the pa- parameter vector for the model, and each hk is a rameter space, the Minimum Bayes Risk feature function of e and f. In practice, we restrict (MBR) decision rule is not well suited to E to the set Gen(f) which is a set of highly likely this technique, a condition that makes past translations discovered by a decoder (Vogel et al., results difficult to compare. We present a 2003). Selecting a translation from this model under training approach for non-tractable the Maximum A Posteriori (MAP) criteria yields decision rules, allowing us to compare and translλ(f) = arg max pλ(e|f) . (2) evaluate these and other decision rules on e a large scale translation task, taking ad- vantage of the high dimensional parame- This decision rule is optimal under the zero- ter space available to the phrase based one loss function, minimizing the Sentence Error decoder. This comparison is Rate (Mangu et al., 2000). Using the log-linear timely, and important, as decoders evolve form to model pλ(e|f) gives us the flexibility to to represent more complex search space introduce overlapping features that can represent decisions and are evaluated against in- global context while decoding (searching the space novative evaluation metrics of translation of candidate translations) and rescoring (ranking a quality. set of candidate translations before performing the arg max operation), albeit at the cost of the tradi- tional source-channel generative model of transla- 1 Introduction tion proposed in (Brown et al., 1993). State of the art statistical machine translation takes A significant impact of this paradigm shift, how- advantage of exponential models to incorporate a ever, has been the movement to leverage the flex- large set of potentially overlapping features to se- ibility of the exponential model to maximize per- lect translations from a set of potential candidates. formance with respect to automatic evaluation met-

208 Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 208–215, Ann Arbor, June 2005. c Association for Computational Linguistics, 2005 rics. Each evaluation metric considers different as- that this loss function is operating on a sequence of pects of translation quality, both at the sentence and sentences with the vector notation. To avoid overfit- corpus level, often achieving high correlation to hu- ting, and since MT researchers are generally blessed man evaluation (Doddington, 2002). It is clear that with an abundance of data, these sentences are from the decision rule stated in (1) does not reflect the a separate development set. choice of evaluation metric, and substantial work The optimization problem (3b) is hard since the has been done to correct this mismatch in crite- arg max of (3a) causes the error surface to change m ria. Approaches include integrating the metric into in steps in R , precluding the use of gradient based the decision rule, and learning λ to optimize the optimization methods. Smoothed error counts can performance of the decision rule. In this paper be used to approximate the arg max operator, but the we will compare and evaluate several aspects of resulting function still contains local minima. Grid- these techniques, focusing on Minimum Error Rate based line search approaches like Powell’s algorithm (MER) training (Och, 2003) and Minimum Bayes could be applied but we can expect difficultly when Risk (MBR) decision rules, within a novel training choosing the appropriate grid size and starting pa- environment that isolates the impact of each compo- rameters. In the following, we summarize the opti- nent of these methods. mization algorithm for the unsmoothed error counts presented in (Och, 2003) and the implementation de- 2 Addressing Evaluation Metrics tailed in (Venugopal and Vogel, 2005). We now describe competing strategies to address the ~ • Regard Loss(translλ(f),~r) as defined in (3b) problem of modeling the evaluation metric within as a function of the parameter vector λ to the decoding and rescoring process, and introduce optimize and take the arg max to compute our contribution towards training non-tractable error ~ translλ(f) over the translations Gen(f) accord- surfaces. The methods discussed below make use ing to the n-best list generated with an initial of Gen(f), the approximation to the complete can- estimate λ0. didate translation space E, referred to as an n-best list. Details regarding n-best list generation from • The error surface defined by Loss (as a func- decoder output can be found in (Ueffing et al., 2002). tion of λ) is piecewise linear with respect to a single model parameter λk, hence we can deter- 2.1 Minimum Error Rate Training mine exactly where it would be useful (values The predominant approach to reconciling the mis- that change the result of the arg max) to evalu- match between the MAP decision rule and the eval- ate λk for a given sentence using a simple line uation metric has been to train the parameters λ of intersection method. the exponential model to correlate the MAP choice with the maximum score as indicated by the evalu- • Merge the list of useful evaluation points for λk and evaluate the corpus level ation metric on a development set with known ref- ~ erences (Och, 2003). We differentiate between the Loss(translλ(f),~r) at each one. decision rule • Select the model parameter that represents the lowest Loss as k varies, set λk and consider the translλ(f) = arg max pλ(e|f) (3a) e∈Gen(f) parameter λj for another dimension j. and the training criterion This training algorithm, referred to as minimum er- ror rate (MER) training, is a greedy search in each ˆ ~ λ = arg min Loss(translλ(f),~r) (3b) dimension of λ, made efficient by realizing that λ within each dimension, we can compute the points where the Loss function returns an evaluation re- at which changes in λ actually have an impact on sult quantifying the difference between the English Loss. The appropriate considerations for termina- candidate translation translλ(f) and its correspond- tion and initial starting points relevant to any greedy ing reference r for a source sentence f. We indicate search procedure must be accounted for. From the

209 nature of the training procedure and the MAP de- training method to optimize the parameters λ in the cision rule, we can expect that the parameters se- exponential model as an explicit form for the condi- lected by MER training will strongly favor a few tional distribution in equation (1). The training task translations in the n-best list, namely for each source under the MBR criterion is sentence the one resulting in the best score, moving ∗ ~ most of the probability mass towards the translation λ = arg min Loss(translλ(f),~r) (5a) λ that it believes should be selected. This is due to the decision rule, rather than the training procedure, as where we will see when we consider alternative decision X 0 0 rules. translλ(f) = arg min Loss(e, e )pλ(e |f) . e∈ (f) Gen e0∈Gen(f) 2.2 The Minimum Bayes Risk Decision Rule (5b) We begin with several observations about this opti- The Minimum Bayes Risk Decision Rule as pro- mization criterion. posed by (Mangu et al., 2000) for the Word Error Rate Metric in speech recognition, and (Kumar and • The MAP optimal λ∗ are not the optimal para- Byrne, 2004) when applied to translation, changes meters for this training criterion. the decision rule in (2) to select the translation that has the lowest expected loss E[Loss(e, r)], which • We can expect the error surface of the MBR can be estimated by considering a weighted Loss training criterion to contain larger sections of between e and the elements of the n-best list, the similar altitude, since the decision rule empha- approximation to E, as described in (Mangu et al., sizes consensus. 2000). The resulting decision rule is: • The piecewise linearity observation made in X 0 0 (Papineni et al., 2002) is no longer applicable translλ(f) = arg min Loss(e, e )pλ(e |f) . e∈ (f) since we cannot move the log operation into the Gen e0∈Gen(f) (4) expected value. (Kumar and Byrne, 2004) explicitly consider select- 3 Score Sampling ing both e and a, an alignment between the Eng- lish and French sentences. Under a phrase based Motivated by the challenges that the MBR training translation model (Koehn et al., 2003; Marcu and criterion presents, we present a training method that Wong, 2002), this distinction is important and will is based on the assumption that the error surface is be discussed in more detail. The representation of locally non-smooth but consists of local regions of the evaluation metric or the Loss function is in the similar Loss values. We would like to focus the decision rule, rather than in the training criterion for search within regions of the parameter space that re- the exponential model. This criterion is hard to op- sult in low Loss values, simulating the effect that timize for the same reason as the criterion in (3b): the MER training process achieves when it deter- the objective function is not continuous in λ. To mines the merged error boundaries across a set of make things worse, it is more expensive to evalu- sentences. ate the function at a given λ, since the decision rule Let Score(λ) be some function of ~ involves a sum over all translations. Loss(translλ(f),~r) that is greater or equal zero, decreases monotonically with Loss, and for 2.3 MBR and the Exponential Model R 0 which (Score(λ) − minλ0 Score(λ ))dλ is finite; ~ Previous work has reported the success of the MBR e.g., 1 − Loss(translλ(f),~r) for the word-error decision rule with fixed parameters relating indepen- rate (WER) loss and a bounded parameter space. dent underlying models, typically including only the While sampling parameter vectors λ and estimating language model and the translation model as fea- Loss in these points, we will constantly refine tures in the exponential model. our estimate of the error surface and thereby of We extend the MBR approach by developing a the Score function. The main idea in our score

210 sampling algorithm is to make use of this Score • Sample λei from the discrete distribution estimate by constructing a probability distribution p(i−1)(λ ∈ P) obtained by the previous it- over the parameter space that depends on the Score eration. estimate in the current iteration step i and sample the parameter vector λi+1 for the next iteration from • Sample the new parameter vector λi by choos- (i) i i that distribution. More precisely, let Scc be the ing for each k ∈ {1, . . . , m}, λk := λfk + εk, estimate of Score in iteration i (we will explain how where εk is sampled uniformly from the inter- to obtain this estimate below). Then the probability val (−dk/2, dk/2) and dk is the distance be- distribution from which we sample the parameter tween neighboring pivot points along dimen- i vector to test in the next iteration is given by: sion k. Thus, λ is sampled from a region around the sampled pivot. (i) (i) 0 Scc (λ) − minλ0 Scc (λ ) p(λ) = . (i) (i) (6) i R 0 • Evaluate Score(λ ) and distribute this score to (Scc (λ) − minλ0 Scc (λ )) dλ (i) obtain new estimates Scc (λ) for all pivots λ ∈ 1 n This distribution produces a sequence λ , . . . , λ of P as described below. parameter vectors that are more concentrated in ar- (i) eas that result in a high Score. We can select the • Use the updated estimates Scc to generate the value from this sequence that generates the highest sampling distribution p(i) for the next iteration Score, just as in the MER training process. according to The exact method of obtaining the Score estimate Scc is crucial: If we are not careful enough and guess (i) (i) 0 Scc (λ) − minλ0 Scc (λ ) Sc(λ) p(i)(λ) = . too low values of c for parameter regions that (i) (i) P 0 are still unknown to us, the resulting sampling dis- λ∈P (Scc (λ) − minλ0 Scc (λ )) tribution p might be zero in those regions and thus potentially optimal parameters might never be sam- The score Score(λi) of the currently evaluated pa- pled. Rather than aiming for a consistent estimator rameter vector does not only influence the score es- of Score (i.e., an estimator that converges to Score timate at the pivot point of the respective region, but when the sample size goes to infinity), we design Scc the estimates at all pivot points. The closest piv- with regard to yielding a suitable sampling distribu- ots are influenced most strongly. More precisely, for tion p. (i) Assume that the parameter space is bounded such each pivot λ ∈ P, Scc (λ) is a weighted average of Score(λ1), . . . , Score(λi), where the weights that mink ≤ λk ≤ maxk for each dimension k, (i) We then define a set of pivots P, forming a grid of w (λ) are chosen according to m points in R that are evenly spaced between mink w(i)(λ) = infl(i)(λ) × corr(i)(λ) with and maxk for each dimension k. Each pivot repre- (i) i sents a region of the parameter space where we ex- infl (λ) = mvnpdf(λ, λ , Σ) and pect generally consistent values of Score. We do not corr(i)(λ) = 1/p(i−1)(λ) . restrict the values of λm to be at these pivot points as a grid search would do, rather we treat the pivots Here, mvnpdf(x, µ, Σ) denotes the m-dimensional as landmarks within the search space. multivariate-normal probability density function We approximate the distribution p(λ) with the with mean µ and covariance matrix Σ, evaluated discrete distribution p(λ ∈ P), leaving the problem at point x. We chose the covariance matrix Σ = 2 2 of estimating |P| parameters. Initially, we set p to diag(d1, . . . , dm), where again dk is the distance be- be uniform, i.e., p(0)(λ) = 1/|P|. For subsequent tween neighboring grid points along dimension k. iterations, we now need an estimate of Score(λ) for The term infl(i)(λ) quantifies the influence of the each pivot λ ∈ P in the discrete version of equation evaluated point λi on the pivot λ, while corr(i)(λ) (6) to obtain the new sampling distribution p. Each is a correction term for the bias introduced by hav- iteration i proceeds as follows. ing sampled λi from p(i−1).

211 Smoothing uncertain regions In the beginning of features in the exponential model. In this section the optimization process, there will be pivot regions we describe the experimental framework used in this that have not yet been sampled from and for which evaluation. not even close-by regions have been sampled yet. This will be reflected in the low sum of influence terms infl(1)(λ) + ··· + infl(i)(λ) of the respective pivot points λ. It is therefore advisable to discount 5.1 Data Sets and Resources some probability mass from p(i) and distribute it over pivots with low influence sums (reflecting low We perform our analysis on the data provided by the confidence in the respective score estimates) accord- 2005 ACL Workshop in Exploiting Parallel Texts for ing to some smoothing procedure. Statistical Machine Translation, working with the French-English Europarl corpus. This corpus con- 4 N-Best lists in Phrase Based Decoding sists of 688031 sentence pairs, with approximately The methods described above make extensive use of 156 million words on the French side, and 138 mil- n-best lists to approximate the search space of can- lion words on the English side. We use the data as didate translations. In phrase based decoding we of- provided by the workshop and run lower casing as ten interpret the MAP decision rule to select the top our only preprocessing step. We use the 15.5 mil- scoring path in the translation lattice. Selecting a lion entry phrase translation table as provided for the particular path means in fact selecting the pair he, si, shared workshop task for the French-English data where s is a segmentation of the the source sentence set. Each translation pair has a set of 5 associated f into phrases and alignments onto their translations phrase translation scores that represent the maxi- in e. Kumar and Byrne (2004) represent this deci- mum likelihood estimate of the phrase as well as in- sion explicitly, since the Loss metrics considered in ternal alignment probabilities. We also use the Eng- their work evaluate alignment information as well as lish language model as provided for the shared task. lexical (word) level output. When considering lexi- Since each of these decision rules has its respective cal scores as we do here, the decision rule minimiz- training process, we split the workshop test set of ing 0/1 loss actually needs to take the sum over all 2000 sentences into a development and test set using potential segmentations that can generate the same random splitting. We tried two decoders for trans- word sequence. In practice, we only consider the lating these sets. The first system is the Pharaoh de- high probability segmentation decisions, namely the coder provided by (Koehn et al., 2003) for the shared ones that were found in the n-best list. This gives data task. The Pharaoh decoder has support for mul- the 0/1 loss criterion shown below. tiple translation and language model scores as well as simple phrase distortion and word length models. X translλ(f) = arg max pλ(e, s|f) (7) The pruning and distortion limit parameters remain e s the same as in the provided initialization scripts, i.e., DistortionLimit = 4, BeamT hreshold = The 0/1 loss criterion favors translations that are 0.1, Stack = 100. For further information on supported by several segmentation decisions. In the these parameter settings, confer (Koehn et al., 2003). context of phrase-based translations, this is a useful Pharaoh is interesting for our optimization task be- criterion, since a given lexical target word sequence cause its eight different models lead to a search can be correctly segmented in several different ways, space with seven free parameters. Here, a princi- all of which would be scored equally by an evalua- pled optimization procedure is crucial. The second tion metric that only considers the word sequence. decoder we tried is the CMU Statistical Translation 5 Experimental Framework System (Vogel et al., 2003) augmented with the four translation models provided by the Pharaoh system, Our goal is to evaluate the impact of the three de- in the following called CMU-Pharaoh. This system cision rules discussed above on a large scale trans- also leads to a search space with seven free parame- lation task that takes advantage of multidimensional ters.

212 5.2 N-Best lists appropriately compare the MAP, 0/1 loss and MBR As mentioned earlier, the model parameters λ play decisions rules, they must all be trained with the a large role in the search space explored by a prun- same training method, here we use the Score Sam- ing beam search decoder. These parameters affect pling training method described above. We also re- the histogram and beam pruning as well as the fu- port MAP scores using the MER training described ture cost estimation used in the Pharaoh and CMU above to determine the impact of the training algo- decoders. The initial parameter file for Pharaoh pro- rithm for MAP. Note that the MER training approach vided by the workshop provided a very poor esti- cannot be performed on the MBR decision rule, as mate of λ, resulting in an n-best list of limited po- explained in Section 2.3. MER training is initialized tential. To account for this condition, we ran Min- at random values of λ and run (successive greedy imum Error Rate training on the development data search over the parameters) until there is no change to determine scaling factors that can generate a n- in the error for three complete cycles through the pa- best list with high quality translations. We realize rameter set. This process is repeated with new start- that this step biases the n-best list towards the MAP ing parameters as well as permutations of the para- criteria, since its parameters will likely cause more meter search order to ensure that there is no bias in aggressive pruning. However, since we have cho- the search towards a particular parameter. To im- sen a large N=1000, and retrain the MBR, MAP, and prove efficiency, pairwise scores are cached across 0/1 loss parameters separately, we do not feel that requests for the score at different values of λ, and the bias has a strong impact on the evaluation. for MBR only the E[Loss(e, r)] for the top twenty hypotheses as ranked by the model are computed. 5.3 Evaluation Metric This paper focuses on the BLEU metric as presented 6 Results in (Papineni et al., 2002). The BLEU metric is de- fined on a corpus level as follows. The results in Table 1 compare the BLEU score achieved by each training method on the develop- N 1 X ment and test data for both Pharaoh and CMU- Score(~e,~r) = BP (~e,~r) ∗ exp( (log p )) N n Pharaoh. Score-sampling training was run for 150 1 iterations to find λ for each decision rule. The MAP- where pn represent the precision of n-grams sug- MER training was performed to evaluate the effect gested in ~e and BP is a brevity penalty measur- of the greedy search method on the generalization ing the relative shortness of ~e over the whole cor- of the development set results. Each row represents pus. To use the BLEU metric in the candidate pair- an alternative training method described in this pa- wise loss calculation in (4), we need to make a de- per, while the test set columns indicate the criteria cision regarding cases where higher order n-grams used to select the final translation output ~e. The matches are not found between two candidates. Ku- bold face scores are the scores for matching train- mar and Byrne (2004) suggest that if any n-grams ing and testing methods. The underlined score is are not matched then the pairwise BLEU score is set the highest test set score, achieved by MBR decod- to zero. As an alternative we first estimate corpus- ing using the CMU-Pharaoh system trained for the wide n-gram counts on the development set. When MBR decision rule with the score-sampling algo- the pairwise counts are collected between sentences rithm. When comparing MER training for MAP- pairs, they are added onto the baseline corpus counts decoding with score-sampling training for MAP- to and scored by BLEU. This scoring simulates the decoding, score-sampling surprisingly outperforms process of scoring additional sentences after seeing MER training for both Pharaoh and CMU-Pharaoh, a whole corpus. although MER training is specifically tailored to the MAP metric. Note, however, that our score- 5.4 Training Environment sampling algorithm has a considerably longer run- It is important to separate the impact of the decision ning time (several hours) than the MER algorithm rule from the success of the training procedure. To (several minutes). Interestingly, within MER train-

213 training method Dev. set sc. test set sc. MAP test set sc. 0/1 loss test set sc. MBR MAP MER (Pharaoh) 29.08 29.30 29.42 29.36 MAP score-sampl. (Pharaoh) 29.08 29.41 29.24 29.30 0/1 loss sc.-s. (Pharaoh) 29.08 29.16 29.28 29.30 MBR sc.-s. (Pharaoh) 29.00 29.11 29.08 29.17 MAP MER (CMU-Pharaoh) 28.80 29.02 29.41 29.60 MAP sc.-s. (CMU-Ph.) 29.10 29.85 29.75 29.55 0/1 loss sc.-s. (CMU-Ph.) 28.36 29.97 29.91 29.72 MBR sc.-s. (CMU-Ph.) 28.36 30.18 30.16 30.28

Table 1. Comparing BLEU scores generated by alternative training methods and decision rules ing for Pharaoh, the 0/1 loss metric is the top per- the state of the art in machine translation, it will former; we believe the reason for this disparity be- become increasingly important to understand the tween training and test methods is the impact of nature and consistency of n-best list training ap- phrasal consistency as a valuable measure within the proaches. Our results are reported on a complete n-best list. package of translation tools and resources, allow- The relative performance of MBR score-sampling ing the reader to easily recreate and build upon our w.r.t. MAP and 0/1-loss score sampling is quite dif- framework. Further research might lie in finding ferent between Pharaoh and CMU-Pharaoh: While efficient representations of Bayes Risk loss func- MBR score-sampling performs worse than MAP tions within the decoding process (rather than just and 0/1-loss score sampling for Pharaoh, it yields the using MBR to rescore n-best lists), as well as best test scores across the board for CMU-Pharaoh. analyses on different language pairs from the avail- A possible reason is that the n-best lists generated by able Europarl data. We have shown score sam- Pharaoh have a large percentage of lexically iden- pling to be an effective training method to con- tical translations, differing only in their segmenta- duct these experiments and we hope to establish its tions. As a result, the 1000-best lists generated by use in the changing landscape of automatic trans- Pharaoh contain only a small percentage of unique lation evaluation. The source code is available at: translations, a condition that reduces the potential www.cs.cmu.edu/˜zollmann/scoresampling/ of the Minimum Bayes Risk methods. The CMU decoder, contrariwise, prunes away alternatives be- 8 Acknowledgments low a certain score-threshold during decoding and We thank Stephan Vogel, Ying Zhang, and the does not recover them when generating the n-best anonymous reviewers for their valuable comments list. The n-best lists of this system are therefore typi- and suggestions. cally more diverse and in particular contain far more unique translations. References 7 Conclusions and Further Work Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathemat- This work describes a general algorithm for the ef- ics of statistical machine translation: parameter esti- ficient optimization of error counts for an arbitrary mation. Computational Linguistics, 19(2):263–311. Loss function, allowing us to compare and evalu- George Doddington. 2002. Automatic evaluation of ma- ate the impact of alternative decision rules for sta- chine translation quality using n-gram co-occurrence tistical machine translation. Our results suggest statistics. In In Proc. ARPA Workshop on Human Lan- the value and sensitivity of the translation process guage Technology. Loss to the function at the decoding and reorder- Philipp Koehn, Franz Josef Och, and Daniel Marcu. ing stages of the process. As phrase-based trans- 2003. Statistical phrase-based translation. In Proceed- lation and reordering models begin to dominate ings of the Human Language Technology and North

214 American Association for Computational Linguistics Conference (HLT/NAACL), Edomonton, Canada, May 27-June 1. Shankar Kumar and William Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference (HLT/NAACL), Boston,MA, May 27-June 1. Lidia Mangu, Eric Brill, and Andreas Stolcke. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion net- works. CoRR, cs.CL/0010012. Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine transla- tion. In Proc. of the Conference on Empirical Meth- ods in Natural Language Processing, Philadephia, PA, July 6-7. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of the Associ- ation for Computational Linguistics, Sapporo, Japan, July 6-7. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In Proceedings of the Association of Computational Linguistics, pages 311– 318. Nicola Ueffing, Franz Josef Och, and Hermann Ney. 2002. Generation of word graphs in statistical ma- chine translation. In Proc. of the Conference on Empirical Methods in Natural Language Processing, Philadephia, PA, July 6-7. Ashish Venugopal and Stephan Vogel. 2005. Consider- ations in mce and mmi training for statistical machine translation. In Proceedings of the Tenth Conference of the European Association for Machine Translation (EAMT-05), Budapest, Hungary, May. The European Association for Machine Translation. Stephan Vogel, Ying Zhang, Fei Huang, Alicia Trib- ble, Ashish Venogupal, Bing Zhao, and Alex Waibel. 2003. The CMU statistical translation system. In Pro- ceedings of MT Summit IX, New Orleans, LA, Septem- ber.

215