BLEU Might Be Guilty but References Are Not Innocent

BLEU might be Guilty but References are not Innocent Markus Freitag, David Grangier, Isaac Caswell Google Research ffreitag,grangier,[email protected] Abstract past, especially when comparing rule-based and statistical systems (Bojar et al., 2016b; Koehn and The quality of automatic metrics for machine translation has been increasingly called into Monz, 2006; Callison-Burch et al., 2006). question, especially for high-quality systems. Automated evaluations are however of crucial This paper demonstrates that, while choice importance, especially for system development. of metric is important, the nature of the ref- Most decisions for architecture selection, hyper- erences is also critical. We study differ- parameter search and data filtering rely on auto- ent methods to collect references and com- mated evaluation at a pace and scale that would pare their value in automated evaluation by not be sustainable with human evaluations. Au- reporting correlation with human evaluation for a variety of systems and metrics. Mo- tomated evaluation (Koehn, 2010; Papineni et al., tivated by the finding that typical references 2002) typically relies on two crucial ingredients: exhibit poor diversity, concentrating around a metric and a reference translation. Metrics gen- translationese language, we develop a para- erally measure the quality of a translation by as- phrasing task for linguists to perform on exist- sessing the overlap between the system output and ing reference translations, which counteracts the reference translation. Different overlap metrics this bias. Our method yields higher correla- have been proposed, aiming to improve correlation with human judgment not only for the submissions of WMT 2019 English!German, tion between human and automated evaluations. but also for Back-translation and APE aug- Such metrics range from n-gram matching, e.g. mented MT output, which have been shown BLEU (Papineni et al., 2002), to accounting for syn- to have low correlation with automatic met- onyms, e.g. METEOR (Banerjee and Lavie, 2005), rics using standard references. We demon- to considering distributed word representation, e.g. strate that our methodology improves corre- BERTScore (Zhang et al., 2019). Orthogonal to lation with all modern evaluation metrics we metric quality (Ma et al., 2019), reference quality look at, including embedding-based methods. is also essential in improving correlation between To complete this picture, we reveal that multi- reference BLEU does not improve the corre- human and automated evaluation. lation for high quality output, and present an This work studies how different reference col- alternative multi-reference formulation that is lection methods impact the reliability of automatic more effective. evaluation. It also highlights that the reference sentences typically collected with current (human) 1 Introduction translation methodology are biased to assign higher Machine Translation (MT) quality has greatly im- automatic scores to MT output that share a similar proved in recent years (Bahdanau et al., 2015; style as the reference. Human translators tend to Gehring et al., 2017; Vaswani et al., 2017). This generate translation which exhibit translationese progress has cast doubt on the reliability of au- language, i.e. sentences with source artifacts (Kop- tomated metrics, especially in the high accuracy pel and Ordan, 2011). This is problematic because regime. For instance, the WMT English!German collecting only a single style of references fails evaluation in the last two years had a different top to reward systems that might produce alternative system when looking at automated or human eval- but equally accurate translations (Popovic´, 2019). uation (Bojar et al., 2018; Barrault et al., 2019). Because of this lack of diversity, multi-reference Such discrepancies have also been observed in the evaluations like multi-reference BLEU are also bi- 61 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 61–71, November 16–20, 2020. c 2020 Association for Computational Linguistics ased to prefer that specific style of translation. in the cost/quality trade-offs (Carpuat and Simard, As a better solution, we show that paraphras- 2012). Isolated sentence evaluation is generally ing translations, when done carefully, can improve more efficient but fails to penalize contextual mis- the quality of automated evaluations more broadly. takes (Tu et al., 2018; Hardmeier et al., 2015). Paraphrased translations increase diversity and Automatic evaluation typically collects human steer evaluation away from rewarding translation reference translations and relies on an automatic artifacts. Experiments with the official submissions metric to compare human references to system ! of WMT 2019 English German for a variety of outputs. Automatic metrics typically measure the different metrics demonstrate the increased correla- overlap between references and system outputs. A tion with human judgement. Further, we run addi- wide variety of metrics has been proposed, and tional experiments for MT systems that are known automated metrics is still an active area of re- to have low correlation with automatic metrics cal- search. BLEU (Papineni et al., 2002) is the most culated with standard references. In particular, we common metric. It measures the geometric aver- investigated MT systems augmented with either age of the precision over hypothesis n-grams with back-translation or automatic post-editing (APE). an additional penalty to discourage short transla- We show that paraphrased references overcome the tions. NIST (Doddington, 2002) is similar but problems of automatic metrics and generate the considers up-weighting rare, informative n-grams. same order as human ratings. TER (Snover et al., 2006) measures an edit dis- Our contributions are four-fold: (i) We collect tance, as a way to estimate the amount of work to different types of references on the same test set post-edit the hypothesis into the reference. ME- and show that it is possible to report strong corre- TEOR (Banerjee and Lavie, 2005) suggested re- lation between automated evaluation with human warding n-gram beyond exact matches, considering metrics, even for high accuracy systems. (ii) We synonyms. Others are proposing to use contextu- gather more natural and diverse valid translations alized word embeddings, like BERTscore (Zhang by collecting human paraphrases of reference trans- et al., 2019). Rewarding multiple alternative for- lations. We show that (human) paraphrases cor- mulations is also the primary motivation behind relate well with human judgments when used as multiple-reference based evaluation (Nießen et al., reference in automatic evaluations. (iii) We present 2000). Dreyer and Marcu(2012) introduced an an alternative multi-reference formulation that is annotation tool and process that can be used to cre- more effective than multi reference BLEU for high ate meaning-equivalent networks that encode an 1 quality output. (iv) We release a rich set of di- exponential number of translations for a given sen- verse references to encourage research in systems tence. Orthogonal to the number of references, the producing other types of translations, and reward a quality of the reference translations is also essen- wider range of generated language. tial to the reliability of automated evaluation (Zbib et al., 2013). This topic itself raises the question of 2 Related Work human translation assessment, which is beyond the Evaluation of machine translation is of crucial im- scope of this paper (Moorkens et al., 2018). portance for system development and deployment Meta-evaluation studies the correlation be- decisions (Moorkens et al., 2018). Human eval- tween human assessments and automatic evalua- uation typically reports adequacy of translations, tions (Callison-Burch et al., 2006, 2008; Callison- often complemented with fluency scores (White, Burch, 2009). Indeed, automatic evaluation is use- 1994; Graham et al., 2013). Evaluation by hu- ful only if it rewards hypotheses perceived as fluent man raters can be conducted through system com- and adequate by a human. Interestingly, previous parisons, rankings (Bojar et al., 2016a), or abso- work (Bojar et al., 2016a) has shown that a higher lute judgments, direct assessments (Graham et al., correlation can be achieved when comparing sim- 2013). Absolute judgments allow one to efficiently ilar systems than when comparing different types compare a large number of systems. The evalua- of systems, e.g. phrase-based vs neural vs rule- tion of translations as isolated sentences, full para- based. In particular, rule-based systems can be pe- graphs or documents is also an important factor nalized as they produce less common translations, 1https://github.com/google/ even when such translations are fluent and adequate. wmt19-paraphrased-references Similarly, recent benchmark results comparing neu- 62 ral systems on high resource languages (Bojar et al., most diverse paraphrases), automatic paraphrasing 2018; Barrault et al., 2019) have shown mismatches are still far from perfect (Roy and Grangier, 2019) between the systems with highest BLEU score and and mostly generate local changes that do not steer the systems faring the best in human evaluations. away from biases as e.g. introducing different sen- Freitag et al.(2019); Edunov et al.(2019) study tence structures. this mismatch in the context of systems trained with back-translation (Sennrich et al., 2016) and 3 Collecting High Quality

BLEU Might Be Guilty but References Are Not Innocent

A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine Translation Output

Why Word Error Rate Is Not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

Machine Translation Evaluation

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

A Survey of Evaluation Metrics Used for NLG Systems

English and Arabic Speech Translation

Automatic Evaluation Measures for Statistical Machine Translation System Optimization

Re-Evaluating the Role of BLEU in Machine Translation Research

Introduction to Natural Language Processing Computer Science 585—Fall 2009 University of Massachusetts Amherst

Minimum Error Rate Training in Statistical Machine Translation

LEPOR: an Augmented Machine Translation Evaluation Metric Lifeng

Bertscore: Evaluating Text Generation with Bert