Minimum Error Rate Training in Statistical Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 [email protected] Abstract statistical approach and the final evaluation criterion used to measure success in a task. Often, the training procedure for statisti- Ideally, we would like to train our model param- cal machine translation models is based on eters such that the end-to-end performance in some maximum likelihood or related criteria. A application is optimal. In this paper, we investigate general problem of this approach is that methods to efficiently optimize model parameters there is only a loose relation to the final with respect to machine translation quality as mea- translation quality on unseen text. In this sured by automatic evaluation criteria such as word paper, we analyze various training criteria error rate and BLEU. which directly optimize translation qual- ity. These training criteria make use of re- 2 Statistical Machine Translation with cently proposed automatic evaluation met- Log-linear Models rics. We describe a new algorithm for effi- cient training an unsmoothed error count. Let us assume that we are given a source (‘French’) ¢¡¤£¦¥ ¡¨£ £ £ § © © © © § We show that significantly better results sentence ¥ , which is can often be obtained if the final evalua- to be translated into a target (‘English’) sentence ¡ ¡ © © © © § § tion criterion is taken directly into account Among all possible as part of the training procedure. target sentences, we will choose the sentence with the highest probability:1 ¡ "!#$ +* ')( 1 Introduction Pr (1) % & Many tasks in natural language processing have evaluation criteria that go beyond simply count- The argmax operation denotes the search problem, ing the number of wrong decisions the system i.e. the generation of the output sentence in the tar- makes. Some often used criteria are, for example, get language. The decision in Eq. 1 minimizes the F-Measure for parsing, mean average precision for number of decision errors. Hence, under a so-called ranked retrieval, and BLEU or multi-reference word zero-one loss function this decision rule is optimal error rate for statistical machine translation. The use (Duda and Hart, 1973). Note that using a differ- of statistical techniques in natural language process- ent loss function—for example, one induced by the ing often starts out with the simplifying (often im- BLEU metric—a different decision rule would be plicit) assumption that the final scoring is based on optimal. simply counting the number of wrong decisions, for instance, the number of sentences incorrectly trans- 1The notational convention will be as follows. We use the . symbol Pr ,'- to denote general probability distributions with lated in machine translation. Hence, there is a mis- (nearly) no specific assumptions. In contrast, for model-based . match between the basic assumptions of the used probability distributions, we use the generic symbol /0,'- . As the true probability distribution Pr ')( is un- that are the basis for our training procedure. In Sec- '( known, we have to develop a model that ap- tion 7, we evaluate the different training criteria in proximates Pr ')( . We directly model the posterior the context of several MT experiments. probability Pr ')( by using a log-linear model. In this framework, we have a set of ¡ feature functions 3 Automatic Assessment of Translation ¡¨§ ¢¤£ ' © ©¦¥ © © ¡ . For each feature function, Quality ¡ § £ ©¦¥ © © ¡ there exists a model parameter © . In recent years, various methods have been pro- The direct translation probability is given by: posed to automatically evaluate machine translation ¡ '( ')( quality by comparing hypothesis translations with Pr (2) reference translations. Examples of such methods £¢£ £ ' © § ¡ © exp are word error rate, position-independent word error "! £¢£ (3) £ © § § rate (Tillmann et al., 1997), generation string accu- © exp racy (Bangalore et al., 2000), multi-reference word In this framework, the modeling problem amounts error rate (Nießen et al., 2000), BLEU score (Pap- to developing suitable feature functions that capture ineni et al., 2001), NIST score (Doddington, 2002). the relevant properties of the translation task. The All these criteria try to approximate human assess- training problem amounts to obtaining suitable pa- ment and often achieve an astonishing degree of cor- § relation to human subjective evaluation of fluency © rameter values . A standard criterion for log- linear models is the MMI (maximum mutual infor- and adequacy (Papineni et al., 2001; Doddington, mation) criterion, which can be derived from the 2002). maximum entropy principle: In this paper, we use the following methods: ' ' - #%$ multi-reference word error rate (mWER): & ' ,+ ¡ +"!#$ ' ( § © (4) When this method is used, the hypothesis trans- §(*) lation is compared to various reference transla- tions by computing the edit distance (minimum The optimization problem under this criterion has number of substitutions, insertions, deletions) very nice properties: there is one unique global op- between the hypothesis and the closest of the timum, and there are algorithms (e.g. gradient de- given reference translations. scent) that are guaranteed to converge to the global optimum. Yet, the ultimate goal is to obtain good - multi-reference position independent error rate translation quality on unseen test data. Experience (mPER): This criterion ignores the word order shows that good results can be obtained using this by treating a sentence as a bag-of-words and approach, yet there is no reason to assume that an computing the minimum number of substitu- optimization of the model parameters using Eq. 4 tions, insertions, deletions needed to transform yields parameters that are optimal with respect to the hypothesis into the closest of the given ref- translation quality. erence translations. The goal of this paper is to investigate alterna- tive training criteria and corresponding training al- - BLEU score: This criterion computes the ge- gorithms, which are directly related to translation ometric mean of the precision of . -grams of quality measured with automatic evaluation criteria. various lengths between a hypothesis and a set of reference translations multiplied by a factor In Section 3, we review various automatic evalua- 0/ tion criteria used in statistical machine translation. BP that penalizes short sentences: In Section 4, we present two different training crite- & 8 ¡ $325476 0/ /1 ria which try to directly optimize an error count. In 9 : BLEU BP (*) § Section 5, we sketch a new training algorithm which 8 efficiently optimizes an unsmoothed error count. In 8 . Section 6, we describe the used feature functions and Here denotes the precision of -grams in the ¡<; our approach to compute the candidate translations hypothesis translation. We use 9 . - - NIST score: This criterion computes a It includes an argmax operation (Eq. 6). There- weighted precision of . -grams between a hy- fore, it is not possible to compute a gradient pothesis and a set of reference translations mul- and we cannot use gradient descent methods to 0/ tiplied by a factor BP’ that penalizes short perform optimization. sentences: - The objective function has many different local & ¡ 6 optima. The optimization algorithm must han- 0/ / NIST BP’ 8 dle this. §¡ 8 In addition, even if we manage to solve the optimiza- 8 Here denotes the weighted precision of . - ¡; 9 grams in the translation. We use . tion problem, we might face the problem of overfit- ting the training data. In Section 5, we describe an Both, NIST and BLEU are accuracy measures, efficient optimization algorithm. and thus larger values reflect better translation qual- To be able to compute a gradient and to make the ity. Note that NIST and BLEU scores are not addi- objective function smoother, we can use the follow- tive for different sentences, i.e. the score for a doc- ing error criterion which is essentially a smoothed ument cannot be obtained by simply summing over error count, with a parameter to adjust the smooth- scores for individual sentences. ness: ' ' ' ' ( & 4 Training Criteria for Minimum Error ' ¡ +"!©¨ £ ¦ ' § © ¦ ' Rate Training ( (7) ¦ ¦ In the following, we assume that we can measure # the number of errors in sentence by comparing it In the extreme case, for "! , Eq. 7 converges © ¢ with a reference sentence ¢ using a function E . to the unsmoothed criterion of Eq. 5 (except in the However, the following exposition can be easily case of ties). Note, that the resulting objective func- adapted to accuracy metrics and to metrics that make tion might still have local optima, which makes the use of multiple references. optimization hard compared to using the objective We assume that the number of errors for a set function of Eq. 4 which does not have different lo- $ of sentences § is obtained by summing the er- $ $ cal optima. The use of this type of smoothed error ¡ £ ' ' ' © § § rors for the individual sentences: ¢ count is a common approach in the speech commu- $ £ © § ¢ . nity (Juang et al., 1995; Schluter¨ and Ney, 2001). Our goal is to obtain a minimal error count on a Figure 1 shows the actual shape of the smoothed $ representative corpus § with given reference trans- and the unsmoothed error count for two parame- $ ' ' ' § lations and a set of ¤ different candidate transla- ters in our translation system.