Re-Evaluating the Role of BLEU in Machine Translation Research

Re-evaluating the Role of BLEU in Machine Translation Research Chris Callison-Burch Miles Osborne Philipp Koehn School on Informatics University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW [email protected] Abstract Bleu to the extent that it currently is. In this We argue that the machine translation paper we give a number of counterexamples for community is overly reliant on the Bleu Bleu’s correlation with human judgments. We machine translation evaluation metric. We show that under some circumstances an improve- show that an improved Bleu score is nei- ment in Bleu is not sufficient to reflect a genuine ther necessary nor sufficient for achieving improvement in translation quality, and in other an actual improvement in translation qual- circumstances that it is not necessary to improve ity, and give two significant counterex- Bleu in order to achieve a noticeable improvement amples to Bleu’s correlation with human in translation quality. judgments of quality. This offers new po- We argue that Bleu is insufficient by showing tential for research which was previously that Bleu admits a huge amount of variation for deemed unpromising by an inability to im- identically scored hypotheses. Typically there are prove upon Bleu scores. millions of variations on a hypothesis translation that receive the same Bleu score. Because not all 1 Introduction these variations are equally grammatically or se- Over the past five years progress in machine trans- mantically plausible there are translations which lation, and to a lesser extent progress in natural have the same Bleu score but a worse human eval- language generation tasks such as summarization, uation. We further illustrate that in practice a has been driven by optimizing against n-gram- higher Bleu score is not necessarily indicative of based evaluation metrics such as Bleu (Papineni better translation quality by giving two substantial et al., 2002). The statistical machine translation examples of Bleu vastly underestimating the trans- community relies on the Bleu metric for the pur- lation quality of systems. Finally, we discuss ap- poses of evaluating incremental system changes propriate uses for Bleu and suggest that for some and optimizing systems through minimum er- research projects it may be preferable to use a fo- ror rate training (Och, 2003). Conference pa- cused, manual evaluation instead. pers routinely claim improvements in translation 2 BLEU Detailed quality by reporting improved Bleu scores, while neglecting to show any actual example transla- The rationale behind the development of Bleu (Pa- tions. Workshops commonly compare systems us- pineni et al., 2002) is that human evaluation of ma- ing Bleu scores, often without confirming these chine translation can be time consuming and ex- rankings through manual evaluation. All these pensive. An automatic evaluation metric, on the uses of Bleu are predicated on the assumption that other hand, can be used for frequent tasks like it correlates with human judgments of translation monitoring incremental system changes during de- quality, which has been shown to hold in many velopment, which are seemingly infeasible in a cases (Doddington, 2002; Coughlin, 2003). manual evaluation setting. However, there is a question as to whether min- The way that Bleu and other automatic evalu- imizing the error rate with respect to Bleu does in- ation metrics work is to compare the output of a deed guarantee genuine translation improvements. machine translation system against reference hu- If Bleu’s correlation with human judgments has man translations. Machine translation evaluation been overestimated, then the field needs to ask it- metrics differ from other metrics that use a refer- self whether it should continue to be driven by ence, like the word error rate metric that is used Orejuela appeared calm as he was led to the 1-grams: American, Florida, Miami, Orejuela, appeared, as, being, calm, carry, escorted, he, him, in, led, American plane which will take him to Mi- plane, quite, seemed, take, that, the, to, to, to, was , was, ami, Florida. which, while, will, would, ,, . Orejuela appeared calm while being escorted 2-grams: American plane, Florida ., Miami ,, Miami in, Orejuela appeared, Orejuela seemed, appeared calm, to the plane that would take him to Miami, as he, being escorted, being led, calm as, calm while, carry Florida. him, escorted to, he was, him to, in Florida, led to, plane Orejuela appeared calm as he was being led that, plane which, quite calm, seemed quite, take him, that was, that would, the American, the plane, to Miami, to to the American plane that was to carry him carry, to the, was being, was led, was to, which will, while to Miami in Florida. being, will take, would take, , Florida Orejuela seemed quite calm as he was being 3-grams: American plane that, American plane which, Miami , Florida, Miami in Florida, Orejuela appeared led to the American plane that would take calm, Orejuela seemed quite, appeared calm as, appeared him to Miami in Florida. calm while, as he was, being escorted to, being led to, calm as he, calm while being, carry him to, escorted to the, he Appeared calm when he was taken to was being, he was led, him to Miami, in Florida ., led to the American plane, which will to Miami, the, plane that was, plane that would, plane which will, quite calm as, seemed quite calm, take him to, that was to, Florida. that would take, the American plane, the plane that, to Miami ,, to Miami in, to carry him, to the American, to the plane, was being led, was led to, was to carry, which Table 1: A set of four reference translations, and will take, while being escorted, will take him, would take a hypothesis translation from the 2005 NIST MT him, , Florida . Evaluation Table 2: The n-grams extracted from the refer- in speech recognition, because translations have a ence translations, with matches from the hypoth- degree of variation in terms of word choice and in esis translation in bold terms of variant ordering of some phrases. Bleu attempts to capture allowable variation in then the modified precisions would be p1 = .83, word choice through the use of multiple reference p2 = .59, p3 = .31, and p4 = .2. Each pn is com- translations (as proposed in Thompson (1991)). bined and can be weighted by specifying a weight In order to overcome the problem of variation in wn. In practice each pn is generally assigned an phrase order, Bleu uses modified n-gram precision equal weight. instead of WER’s more strict string edit distance. Because Bleu is precision based, and because Bleu’s n-gram precision is modified to elimi- recall is difficult to formulate over multiple refer- nate repetitions that occur across sentences. For ence translations, a brevity penalty is introduced to example, even though the bigram “to Miami” is compensate for the possibility of proposing high- repeated across all four reference translations in precision hypothesis translations which are too Table 1, it is counted only once in a hypothesis short. The brevity penalty is calculated as: translation. Table 2 shows the n-gram sets created ( 1 if c > r from the reference translations. BP = e1−r/c if c ≤ r Papineni et al. (2002) calculate their modified precision score, pn, for each n-gram length by where c is the length of the corpus of hypothesis summing over the matches for every hypothesis translations, and r is the effective reference corpus sentence S in the complete corpus C as: length.1 P P Thus, the Bleu score is calculated as S∈C ngram∈S Countmatched(ngram) pn = P P N S∈C ngram∈S Count(ngram) X Bleu = BP ∗ exp( wn logpn) Counting punctuation marks as separate tokens, n=1 the hypothesis translation given in Table 1 has 15 A Bleu score can range from 0 to 1, where unigram matches, 10 bigram matches, 5 trigram higher scores indicate closer matches to the ref- matches (these are shown in bold in Table 2), and erence translations, and where a score of 1 is as- three 4-gram matches (not shown). The hypoth- signed to a hypothesis translation which exactly esis translation contains a total of 18 unigrams, 1The effective reference corpus length is calculated as the 17 bigrams, 16 trigrams, and 15 4-grams. If the sum of the single reference translation from each set which is complete corpus consisted of this single sentence closest to the hypothesis translation. matches one of the reference translations. A score 3.1 Permuting phrases of 1 is also assigned to a hypothesis translation One way in which variation can be introduced is which has matches for all its n-grams (up to the by permuting phrases within a hypothesis trans- maximum n measured by Bleu) in the clipped ref- lation. A simple way of estimating a lower bound erence n-grams, and which has no brevity penalty. on the number of ways that phrases in a hypothesis The primary reason that Bleu is viewed as a use- translation can be reordered is to examine bigram ful stand-in for manual evaluation is that it has mismatches. Phrases that are bracketed by these been shown to correlate with human judgments of bigram mismatch sites can be freely permuted be- translation quality. Papineni et al. (2002) showed cause reordering a hypothesis translation at these that Bleu correlated with human judgments in points will not reduce the number of matching n- its rankings of five Chinese-to-English machine grams and thus will not reduce the overall Bleu translation systems, and in its ability to distinguish score. between human and machine translations. Bleu’s Here we denote bigram mismatches for the hy- correlation with human judgments has been fur- pothesis translation given in Table 1 with vertical ther tested in the annual NIST Machine Transla- bars: tion Evaluation exercise wherein Bleu’s rankings Appeared calm | when | he was | taken | of Arabic-to-English and Chinese-to-English sys- to the American plane | , | which will | tems is verified by manual evaluation.

Re-Evaluating the Role of BLEU in Machine Translation Research

BLEU Might Be Guilty but References Are Not Innocent

A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output

Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

Machine Translation Evaluation

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

A Survey of Evaluation Metrics Used for NLG Systems

English and Arabic Speech Translation

Automatic Evaluation Measures for Statistical Machine Translation System Optimization

Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst

Minimum Error Rate Training in Statistical Machine Translation

LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng

Bertscore: Evaluating Text Generation with Bert