Training and Evaluating Error Minimization Decision Rules For
Total Page:16
File Type:pdf, Size:1020Kb
Training and Evaluating Error Minimization Rules for Statistical Machine Translation Ashish Venugopal Andreas Zollmann Alex Waibel School of Computer Science School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected] Abstract As discussed in (Och, 2003), the direct translation model represents the probability of target sentence Decision rules that explicitly account for ’English’ e = e1 ... eI being the translation for a non-probabilistic evaluation metrics in source sentence ’French’ f = f1 ... fJ through an machine translation typically require spe- exponential, or log-linear model cial training, often to estimate parame- Pm ters in exponential models that govern the exp( k=1 λk ∗ hk(e, f)) pλ(e|f) = P Pm 0 (1) search space and the selection of candi- e0∈E exp( k=1 λk ∗ hk(e , f)) date translations. While the traditional Maximum A Posteriori (MAP) decision where e is a single candidate translation for f rule can be optimized as a piecewise lin- from the set of all English translations E, λ is the ear function in a greedy search of the pa- parameter vector for the model, and each hk is a rameter space, the Minimum Bayes Risk feature function of e and f. In practice, we restrict (MBR) decision rule is not well suited to E to the set Gen(f) which is a set of highly likely this technique, a condition that makes past translations discovered by a decoder (Vogel et al., results difficult to compare. We present a 2003). Selecting a translation from this model under novel training approach for non-tractable the Maximum A Posteriori (MAP) criteria yields decision rules, allowing us to compare and translλ(f) = arg max pλ(e|f) . (2) evaluate these and other decision rules on e a large scale translation task, taking ad- vantage of the high dimensional parame- This decision rule is optimal under the zero- ter space available to the phrase based one loss function, minimizing the Sentence Error Pharaoh decoder. This comparison is Rate (Mangu et al., 2000). Using the log-linear timely, and important, as decoders evolve form to model pλ(e|f) gives us the flexibility to to represent more complex search space introduce overlapping features that can represent decisions and are evaluated against in- global context while decoding (searching the space novative evaluation metrics of translation of candidate translations) and rescoring (ranking a quality. set of candidate translations before performing the arg max operation), albeit at the cost of the tradi- tional source-channel generative model of transla- 1 Introduction tion proposed in (Brown et al., 1993). State of the art statistical machine translation takes A significant impact of this paradigm shift, how- advantage of exponential models to incorporate a ever, has been the movement to leverage the flex- large set of potentially overlapping features to se- ibility of the exponential model to maximize per- lect translations from a set of potential candidates. formance with respect to automatic evaluation met- 208 Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 208–215, Ann Arbor, June 2005. c Association for Computational Linguistics, 2005 rics. Each evaluation metric considers different as- that this loss function is operating on a sequence of pects of translation quality, both at the sentence and sentences with the vector notation. To avoid overfit- corpus level, often achieving high correlation to hu- ting, and since MT researchers are generally blessed man evaluation (Doddington, 2002). It is clear that with an abundance of data, these sentences are from the decision rule stated in (1) does not reflect the a separate development set. choice of evaluation metric, and substantial work The optimization problem (3b) is hard since the has been done to correct this mismatch in crite- arg max of (3a) causes the error surface to change m ria. Approaches include integrating the metric into in steps in R , precluding the use of gradient based the decision rule, and learning λ to optimize the optimization methods. Smoothed error counts can performance of the decision rule. In this paper be used to approximate the arg max operator, but the we will compare and evaluate several aspects of resulting function still contains local minima. Grid- these techniques, focusing on Minimum Error Rate based line search approaches like Powell’s algorithm (MER) training (Och, 2003) and Minimum Bayes could be applied but we can expect difficultly when Risk (MBR) decision rules, within a novel training choosing the appropriate grid size and starting pa- environment that isolates the impact of each compo- rameters. In the following, we summarize the opti- nent of these methods. mization algorithm for the unsmoothed error counts presented in (Och, 2003) and the implementation de- 2 Addressing Evaluation Metrics tailed in (Venugopal and Vogel, 2005). We now describe competing strategies to address the ~ • Regard Loss(translλ(f),~r) as defined in (3b) problem of modeling the evaluation metric within as a function of the parameter vector λ to the decoding and rescoring process, and introduce optimize and take the arg max to compute our contribution towards training non-tractable error ~ translλ(f) over the translations Gen(f) accord- surfaces. The methods discussed below make use ing to the n-best list generated with an initial of Gen(f), the approximation to the complete can- estimate λ0. didate translation space E, referred to as an n-best list. Details regarding n-best list generation from • The error surface defined by Loss (as a func- decoder output can be found in (Ueffing et al., 2002). tion of λ) is piecewise linear with respect to a single model parameter λk, hence we can deter- 2.1 Minimum Error Rate Training mine exactly where it would be useful (values The predominant approach to reconciling the mis- that change the result of the arg max) to evalu- match between the MAP decision rule and the eval- ate λk for a given sentence using a simple line uation metric has been to train the parameters λ of intersection method. the exponential model to correlate the MAP choice with the maximum score as indicated by the evalu- • Merge the list of useful evaluation points for λk and evaluate the corpus level ation metric on a development set with known ref- ~ erences (Och, 2003). We differentiate between the Loss(translλ(f),~r) at each one. decision rule • Select the model parameter that represents the lowest Loss as k varies, set λk and consider the translλ(f) = arg max pλ(e|f) (3a) e∈Gen(f) parameter λj for another dimension j. and the training criterion This training algorithm, referred to as minimum er- ror rate (MER) training, is a greedy search in each ˆ ~ λ = arg min Loss(translλ(f),~r) (3b) dimension of λ, made efficient by realizing that λ within each dimension, we can compute the points where the Loss function returns an evaluation re- at which changes in λ actually have an impact on sult quantifying the difference between the English Loss. The appropriate considerations for termina- candidate translation translλ(f) and its correspond- tion and initial starting points relevant to any greedy ing reference r for a source sentence f. We indicate search procedure must be accounted for. From the 209 nature of the training procedure and the MAP de- training method to optimize the parameters λ in the cision rule, we can expect that the parameters se- exponential model as an explicit form for the condi- lected by MER training will strongly favor a few tional distribution in equation (1). The training task translations in the n-best list, namely for each source under the MBR criterion is sentence the one resulting in the best score, moving ∗ ~ most of the probability mass towards the translation λ = arg min Loss(translλ(f),~r) (5a) λ that it believes should be selected. This is due to the decision rule, rather than the training procedure, as where we will see when we consider alternative decision X 0 0 rules. translλ(f) = arg min Loss(e, e )pλ(e |f) . e∈ (f) Gen e0∈Gen(f) 2.2 The Minimum Bayes Risk Decision Rule (5b) We begin with several observations about this opti- The Minimum Bayes Risk Decision Rule as pro- mization criterion. posed by (Mangu et al., 2000) for the Word Error Rate Metric in speech recognition, and (Kumar and • The MAP optimal λ∗ are not the optimal para- Byrne, 2004) when applied to translation, changes meters for this training criterion. the decision rule in (2) to select the translation that has the lowest expected loss E[Loss(e, r)], which • We can expect the error surface of the MBR can be estimated by considering a weighted Loss training criterion to contain larger sections of between e and the elements of the n-best list, the similar altitude, since the decision rule empha- approximation to E, as described in (Mangu et al., sizes consensus. 2000). The resulting decision rule is: • The piecewise linearity observation made in X 0 0 (Papineni et al., 2002) is no longer applicable translλ(f) = arg min Loss(e, e )pλ(e |f) . e∈ (f) since we cannot move the log operation into the Gen e0∈Gen(f) (4) expected value. (Kumar and Byrne, 2004) explicitly consider select- 3 Score Sampling ing both e and a, an alignment between the Eng- lish and French sentences. Under a phrase based Motivated by the challenges that the MBR training translation model (Koehn et al., 2003; Marcu and criterion presents, we present a training method that Wong, 2002), this distinction is important and will is based on the assumption that the error surface is be discussed in more detail.