Bleu: a Method for Automatic Evaluation of Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
RC22176 (W0109-022) September 17, 2001 Computer Science IBM Research Report Bleu: a Method for Automatic Evaluation of Machine Translation Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research Division Almaden - Austin - Beijing - Delhi - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home . Bleu: a Method for Automatic Evaluation of Machine Translation Kishore Papineni Salim Roukos Todd Ward Wei-Jing Zhu IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA papineni,roukos,toddward,weijing @us.ibm.com f g Abstract 1.2 Viewpoint Human evaluations of machine translation are How does one measure translation perfor- extensive but expensive. Human evaluations mance? The closer a machine translation is to can take months to finish and involve human a professional human translation, the better it labor that can not be reused. We propose a is. This is the central idea behind our proposal. method of automatic machine translation eval- To judge the quality of a machine translation, uation that is quick, inexpensive, and language- one measures its closeness by a numerical met- independent, that correlates highly with human ric to one or more reference human translations. evaluation, and that has little marginal cost per Thus, our MT evaluation system requires two run. We present this method as an automated ingredients: understudy to skilled human judges which sub- stitutes for them when there is need for quick 1. a numerical \translation closeness" metric 1 or frequent evaluations. 2. a corpus of good quality human reference translations 1 Introduction These reference translations can be reused over 1.1 Rationale and over again and incur only a one-time startup expense. Each evaluation can be ac- Human evaluations of machine translation complished in seconds. (MT) weigh many aspects of translation, in- We fashion our closeness metric after the cluding adequacy, fidelity, and fluency of the highly successful word error rate metric used translation[1][2]. Such evaluations are exten- by the speech recognition community, appro- sive, but also expensive[1]. Moreover, they can priately modified for multiple reference trans- take weeks or months to finish. This is a big lations and allowing for legitimate differences problem because developers of machine transla- in word choice and word order. The main idea tion systems need to monitor the effect of daily is to use a weighted average of variable length changes to their systems in order to weed out phrase matches against the reference transla- bad ideas from good ideas. We believe that MT tions. This view gives rise to a family of met- progress stems from evaluation and that there rics using various weighting schemes. We have is a logjam of fruitful research ideas waiting selected a promising baseline metric from this to be released from the evaluation bottleneck. family. Developers would benefit from an inexpensive automatic evaluation that is quick, language- Although our baseline metric correlates very independent, and correlates highly with human highly with human judgments, we do know that evaluation. We propose such an evaluation there are subtleties and stylistic variations that method in this note. are better appreciated by humans than ma- chines. For the foreseeable future, we believe 1So we call our method the bilingual evaluation these subtleties will remain relatively small ef- understudy, Bleu. fects compared with other MT phenomena. We 2 present our method as a virtual apprentice or these three reference translations, while Can- understudy to skilled human judges. We call didate 2 does not. We will shortly quan- this method Bleu, for BiLingual Evaluation tify this notion of sharing. But first observe Understudy. that Candidate 1 shares "It is a guide to with Reference 1, with Ref- In Section 2, we describe the baseline metric action" "which" erence 2, with in detail. In Section 3, we evaluate the perfor- "ensures that the military" Reference 1, with References 2 and 3, mance of Bleu. In Section 4, we describe a "always" with Reference 1, and finally human evaluation experiment. In Section 5, we "commands" "of with Reference 2 (all ignoring cap- compare our baseline metric performance with the party" italization). In contrast, Candidate 2 exhibits human evaluations. far fewer matches and their extent is less. 2 The Baseline Bleu Metric For this example, it is at once clear that a program can rank Candidate 1 higher than Can- Typically, there are many \perfect" translations didate 2 simply by comparing n-gram matches of a given source sentence. These translations between each candidate translation and the ref- may vary in word choice or in word order even erence translations. Experiments over large col- when they use the same words. And yet hu- lections of translations presented in Section 5 mans can clearly distinguish a good translation show that this ranking ability is a general phe- from a bad one. For example, consider these nomenon, and not an artifact of a few examples. two candidate translations of a Chinese source leu sentence. The primary programming task in a B im- plementation is to compare n-grams of the can- Example 1: didate with the n-grams of the reference trans- lation and count the number of matches. These Candidate 1: It is a guide to action matches are position-independent. The more which ensures that the military always the matches, the better the candidate transla- obeys the commands of the party. tion. For simplicity, we first focus on computing unigram matches. Candidate 2: It is to insure the troops forever hearing the activity guidebook 2.1 Modified n-gram precision that party direct. Although they appear to be about the same The cornerstone of our metric is the famil- subject, they differ markedly in quality. For iar precision measure. To compute precision, comparison, we provide three reference human one simply counts up the number of candi- translations of the same sentence below. date translation words (unigrams) which occur in any reference translation and then divides Reference 1: It is a guide to action that by the total number of words in the candidate ensures that the military will forever translation. Unfortunately, MT systems can heed Party commands. overgenerate \reasonable" words, resulting in improbable, but high-precision, translations like Reference 2: It is the guiding principle that of example 2 below. Intuitively the prob- which guarantees the military forces lem is clear: a reference word should be consid- always being under the command of the ered exhausted after a matching candidate word Party. is identified. We formalize this intuition as the modified unigram precision. To compute this, Reference 3: It is the practical guide one first counts the maximum number of times for the army always to heed the direc- a word occurs in any single reference transla- tions of the party. tion. Next, one clips the total count of each can- didate word by its maximum reference count, It is clear that the good translation, Candi- adds these clipped counts up, and divides by the date 1, shares many words and phrases with total (unclipped) number of candidate words. 3 Example 2. compute a modified precision score, pn, for the entire test corpus. Candidate: the the the the the the the. p = Reference 1: The cat is on the mat. n Countclip(n-gram) CandidatesP n-gramP Reference 2: There is a cat on the mat. C2f g 2 C : Count(n-gram) 2 CandidatesP n-gramP Modified Unigram Precision = 2=7. C2f g 2 C In other words, we use a word-weighted average In Example 1, Candidate 1 achieves a modi- of the sentence-level modified precisions rather fied unigram precision of 17=18; whereas Candi- than a sentence-weighted average. As an exam- date 2 achieves a modified unigram precision of ple, we compute word matches at the sentence 8=14. Similarly, the modified unigram precision level, but the modified unigram precision is the in Example 2 is 2=7, even though its standard fraction of words matched in the entire test cor- unigram precision is 7=7. pus. Modified n-gram precision is computed sim- ilarly for any n: all candidate n-gram counts 2.1.2 Ranking systems using only and their corresponding maximum reference modified n-gram precision counts are collected. The candidate counts are To verify that modified n-gram precision distin- clipped by their corresponding reference maxi- guishes between very good translations and bad mum value, summed, and divided by the total translations, we computed the modified preci- number of candidate n-grams. In Example 1, sion numbers on the output of a (good) human Candidate 1 achieves a modified bigram preci- translator and a standard (poor) machine trans- sion of 10/17, whereas the lower quality Candi- lation system against 4 reference translations for date 2 achieves a modified bigram precision of each of 127 source sentences.