BLEU: a Method for Automatic Evaluation of Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318. BLEU: a Method for Automatic Evaluation of Machine Translation Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA fpapineni,roukos,toddward,[email protected] Abstract the evaluation bottleneck. Developers would bene- fit from an inexpensive automatic evaluation that is Human evaluations of machine translation quick, language-independent, and correlates highly are extensive but expensive. Human eval- with human evaluation. We propose such an evalua- uations can take months to finish and in- tion method in this paper. volve human labor that can not be reused. We propose a method of automatic ma- 1.2 Viewpoint chine translation evaluation that is quick, How does one measure translation performance? inexpensive, and language-independent, The closer a machine translation is to a professional that correlates highly with human evalu- human translation, the better it is. This is the cen- ation, and that has little marginal cost per tral idea behind our proposal. To judge the quality run. We present this method as an auto- of a machine translation, one measures its closeness mated understudy to skilled human judges to one or more reference human translations accord- which substitutes for them when there is ing to a numerical metric. Thus, our MT evaluation need for quick or frequent evaluations.1 system requires two ingredients: 1 Introduction 1. a numerical “translation closeness” metric 1.1 Rationale 2. a corpus of good quality human reference trans- Human evaluations of machine translation (MT) lations weigh many aspects of translation, including ade- quacy, fidelity , and fluency of the translation (Hovy, We fashion our closeness metric after the highly suc- 1999; White and O’Connell, 1994). A compre- cessful word error rate metric used by the speech hensive catalog of MT evaluation techniques and recognition community, appropriately modified for their rich literature is given by Reeder (2001). For multiple reference translations and allowing for le- the most part, these various human evaluation ap- gitimate differences in word choice and word or- proaches are quite expensive (Hovy, 1999). More- der. The main idea is to use a weighted average of over, they can take weeks or months to finish. This is variable length phrase matches against the reference a big problem because developers of machine trans- translations. This view gives rise to a family of met- lation systems need to monitor the effect of daily rics using various weighting schemes. We have se- changes to their systems in order to weed out bad lected a promising baseline metric from this family. ideas from good ideas. We believe that MT progress In Section 2, we describe the baseline metric in stems from evaluation and that there is a logjam of detail. In Section 3, we evaluate the performance of fruitful research ideas waiting to be released from BLEU. In Section 4, we describe a human evaluation 1So we call our method the bilingual evaluation understudy, experiment. In Section 5, we compare our baseline BLEU. metric performance with human evaluations. 2 The Baseline BLEU Metric large collections of translations presented in Section 5 show that this ranking ability is a general phe- Typically, there are many “perfect” translations of a nomenon, and not an artifact of a few toy examples. given source sentence. These translations may vary The primary programming task for a BLEU imple- in word choice or in word order even when they use mentor is to compare n-grams of the candidate with the same words. And yet humans can clearly dis- the n-grams of the reference translation and count tinguish a good translation from a bad one. For ex- the number of matches. These matches are position- ample, consider these two candidate translations of independent. The more the matches, the better the a Chinese source sentence: candidate translation is. For simplicity, we first fo- Example 1. cus on computing unigram matches. Candidate 1: It is a guide to action which ensures that the military always obeys 2.1 Modified n-gram precision the commands of the party. The cornerstone of our metric is the familiar pre- Candidate 2: It is to insure the troops cision measure. To compute precision, one simply forever hearing the activity guidebook counts up the number of candidate translation words that party direct. (unigrams) which occur in any reference translation and then divides by the total number of words in Although they appear to be on the same subject, they the candidate translation. Unfortunately, MT sys- differ markedly in quality. For comparison, we pro- tems can overgenerate “reasonable” words, result- vide three reference human translations of the same ing in improbable, but high-precision, translations sentence below. like that of example 2 below. Intuitively the prob- Reference 1: It is a guide to action that lem is clear: a reference word should be considered ensures that the military will forever exhausted after a matching candidate word is iden- heed Party commands. tified. We formalize this intuition as the modified Reference 2: It is the guiding principle unigram precision. To compute this, one first counts which guarantees the military forces the maximum number of times a word occurs in any always being under the command of the single reference translation. Next, one clips the to- Party. tal count of each candidate word by its maximum reference count,2adds these clipped counts up, and Reference 3: It is the practical guide for divides by the total (unclipped) number of candidate the army always to heed the directions words. of the party. Example 2. It is clear that the good translation, Candidate 1, Candidate: the the the the the the the. shares many words and phrases with these three ref- Reference 1: The cat is on the mat. erence translations, while Candidate 2 does not. We will shortly quantify this notion of sharing in Sec- Reference 2: There is a cat on the mat. tion 2.1. But first observe that Candidate 1 shares Modified Unigram Precision = 2=7.3 "It is a guide to action" with Reference 1, In Example 1, Candidate 1 achieves a modified "which" with Reference 2, "ensures that the unigram precision of 17=18; whereas Candidate military" with Reference 1, "always" with Ref- 2 achieves a modified unigram precision of 8=14. erences 2 and 3, "commands" with Reference 1, and Similarly, the modified unigram precision in Exam- finally "of the party" with Reference 2 (all ig- ple 2 is 2=7, even though its standard unigram pre- noring capitalization). In contrast, Candidate 2 ex- cision is 7=7. hibits far fewer matches, and their extent is less. 2 It is clear that a program can rank Candidate 1 Countclip = min(Count;Max Re f Count). In other words, higher than Candidate 2 simply by comparing n- one truncates each word’s count, if necessary, to not exceed the largest count observed in any single reference for that word. gram matches between each candidate translation 3As a guide to the eye, we have underlined the important and the reference translations. Experiments over words for computing modified precision. Modified n-gram precision is computed similarly 2.1.2 Ranking systems using only modified for any n: all candidate n-gram counts and their n-gram precision corresponding maximum reference counts are col- To verify that modified n-gram precision distin- lected. The candidate counts are clipped by their guishes between very good translations and bad corresponding reference maximum value, summed, translations, we computed the modified precision and divided by the total number of candidate n- numbers on the output of a (good) human transla- grams. In Example 1, Candidate 1 achieves a mod- tor and a standard (poor) machine translation system ified bigram precision of 10/17, whereas the lower using 4 reference translations for each of 127 source quality Candidate 2 achieves a modified bigram pre- sentences. The average precision results are shown cision of 1/13. In Example 2, the (implausible) can- in Figure 1. didate achieves a modified bigram precision of 0. This sort of modified n-gram precision scoring cap- tures two aspects of translation: adequacy and flu- Figure 1: Distinguishing Human from Machine ency. A translation using the same words (1-grams) 0.9 0.8 as in the references tends to satisfy adequacy. The 0.7 4 n 0.6 longer n-gram matches account for fluency. o i s 0.5 i c 0.4 e r P 0.3 2.1.1 Modified n-gram precision on blocks of 0.2 text 0.1 0 1 2 3 4 How do we compute modified n-gram precision Phrase (n-gram) Length on a multi-sentence test set? Although one typically evaluates MT systems on a corpus of entire docu- ments, our basic unit of evaluation is the sentence. The strong signal differentiating human (high pre- A source sentence may translate to many target sen- cision) from machine (low precision) is striking. tences, in which case we abuse terminology and re- The difference becomes stronger as we go from un- fer to the corresponding target sentences as a “sen- igram precision to 4-gram precision. It appears that tence.” We first compute the n-gram matches sen- any single n-gram precision score can distinguish tence by sentence. Next, we add the clipped n-gram between a good translation and a bad translation. counts for all the candidate sentences and divide by To be useful, however, the metric must also reliably the number of candidate n-grams in the test corpus distinguish between translations that do not differ so greatly in quality.