Arxiv:2104.05700V1 [Cs.CL] 12 Apr 2021 Xylophone Peripatetic Defen- Relation with Human Judgments (Ma Et Al., 2019)

Macro-Average: Rare Types Are Important Too Thamme Gowda Weiqiu You Information Sciences Institute Dept of Computer and Information Science University of Southern California University of Pennsylvania [email protected] [email protected] Constantine Lignos Jonathan May Michtom School of Computer Science Information Sciences Institute Brandeis University University of Southern California [email protected] [email protected] Abstract (Sellam et al., 2020). Model-based metric scores are also opaque and can hide undesirable biases, as While traditional corpus-level evaluation met- can be seen in Table1. rics for machine translation (MT) correlate well with fluency, they struggle to reflect ad- Reference: You must be a doctor. equacy. Model-based MT metrics trained on Hypothesis: must be a doctor. segment-level human judgments have emerged He -0.735 as an attractive replacement due to strong cor- Joe -0.975 relation results. These models, however, re- Sue -1.043 She -1.100 quire potentially expensive re-training for new Reference: It is the greatest country in the world. domains and languages. Furthermore, their Hypothesis: is the greatest country in the world. decisions are inherently non-transparent and France -0.022 appear to reflect unwelcome biases. We ex- America -0.060 plore the simple type-based classifier metric, Russia -0.161 Canada -0.309 MACROF1, and study its applicability to MT evaluation. We find that MACROF1 is com- Table 1: A demonstration of BLEURT’s internal bi- petitive on direct assessment, and outperforms ases; model-free metrics like BLEU would consider others in indicating downstream cross-lingual each of the errors above to be equally wrong. information retrieval task performance. Fur- ther, we show that MACROF1 can be used to The source of model-based metrics’ (e.g. effectively compare supervised and unsuper- BLEURT) correlative superiority over model-free vised neural machine translation, and reveal significant qualitative differences in the meth- metrics (e.g. BLEU) appears to be the former’s ods’ outputs.1 ability to focus evaluation on adequacy, while the latter are overly focused on fluency. BLEU and 1 Introduction most other generation metrics consider each output token Model-based metrics for evaluating machine trans- equally. Since natural language is dominated lation such as BLEURT (Sellam et al., 2020), ESIM by a few high-count types, an MT model that con- if and but (Mathur et al., 2019), and YiSi (Lo, 2019) have re- centrates on getting its s, s and s right will cently attracted attention due to their superior cor- benefit from BLEU in the long run more than one arXiv:2104.05700v1 [cs.CL] 12 Apr 2021 xylophone peripatetic defen- relation with human judgments (Ma et al., 2019). that gets its s, s, and estrates right. Can we derive a metric with the However, BLEU (Papineni et al., 2002) remains the most widely used corpus-level MT metric. It corre- discriminating power of BLEURT that does not lates reasonably well with human judgments, and share its bias or expense and is as interpretable as moreover is easy to understand and cheap to cal- BLEU? culate, requiring only reference translations in the As it turns out, the metric may already exist and target language. By contrast, model-based metrics be in common use. Information extraction and require tuning on thousands of examples of human other areas concerned with classification have long evaluation for every new target language or domain used both micro averaging, which treats each token equally, and macro averaging, which instead 1 Tools and analysis are available at https://github. treats each type equally, when evaluating. The lat- com/thammegowda/007-mt-eval-macro. MT evaluation metrics are at https://github.com/isi-nlp/sacrebleu/tree/ ter in particular is useful when seeking to avoid macroavg-naacl21. results dominated by overly frequent types. In this work we take a classification-based approach to where C(c;a) counts the number of tokens of type evaluating machine translation in order to obtain an c in sequence a (Papineni et al., 2002). For each easy-to-calculate metric that focuses on adequacy class c ∈ Vh∩y, precision (Pc), recall (Rc), and Fb 2 as much as BLEURT but does not have the ex- measure (Fb;c) are computed as follows: pensive overhead, opacity, or bias of model-based methods. MATCH(c) MATCH(c) Pc = ; Rc = Our contributions are as follows: We con- PREDS(c) REFS(c) sider MT as a classification task, and thus ad- P ×R F = (1+ 2) c c b;c b 2 mit MACROF1 as a legitimate approach to eval- b ×Pc +Rc uation (Section2). We show that MACROF1 is competitive with other popular methods at track- The macro-average consolidates individual per- ing human judgments in translation (Section 3.2). formance by averaging by type, while the micro- We offer an additional justification of MACROF1 average averages by token: as a performance indicator on adequacy-focused downstream tasks such as cross-lingual informa- ∑c∈V Fb;c MACROF = tion retrieval (Section 3.3). Finally, we demonstrate b SVS that MACROF1 is just as good as the expensive ∑c∈V f (c)×Fb;c ICRO = BLEURT at discriminating between structurally M Fb ′ ∑c′∈V f (c ) different MT approaches in a way BLEU cannot, especially regarding the adequacy of generated text, where f (c) = REFS(c)+k for smoothing factor k.3 and provide a novel approach to qualitative analy- We scale MACROFb and MICROFb values to per- sis of the effect of metrics choice on quantitative centile, similar to BLEU, for the sake of easier evaluation (Section4). readability. 2 NMT as Classification 3 Justification for MACROF1 Neural machine translation (NMT) models are of- In the following sections, we verify and justify the ten viewed as pairs of encoder-decoder networks. utility of MACROF1 while also offering a compar- Viewing NMT as such is useful in practice for ison with popular alternatives such as MICROF1, 4 implementation; however, such a view is inade- BLEU, CHRF1, and BLEURT. We use Kendall’s quate for theoretical analysis. Gowda and May rank correlation coefficient, t, to compute the as- (2020) provide a high-level view of NMT as two sociation between metrics and human judgments. fundamental ML components: an autoregressor Correlations with p-values smaller than a = 0:05 and a classifier. Specifically, NMT is viewed as a are considered to be statistically significant. multi-class classifier that operates on representa- tions from an autoregressor. We may thus consider 3.1 Data-to-Text: WebNLG classifier-based evaluation metrics. We use the 2017 WebNLG Challenge dataset (Gar- (i) (i) (i) Consider a test corpus, T = {(x ;h ;y )Si = dent et al., 2017; Shimorina, 2018)5 to analyze (i) (i) (i) 1;2;3:::m} where x , h , and y are source, sys- the differences between micro- and macro- aver- tem hypothesis, and reference translation, respec- aging. WebNLG is a task of generating English (i) tively. Let x = {x ∀i} and similar for h and y. Let text for sets of triples extracted from DBPedia. Hu- Vh;Vy;Vh∩y; and V be the vocabulary of h, the vo- man annotations are available for a sample of 223 cabulary of y, Vh ∩Vy, and Vh ∪Vy, respectively. For records each from nine NLG systems. The human each class c ∈ V, 2 We consider Fb;c for c ∈~ Vh∩y to be 0. m 3 (i) We use k = 1. When k → ∞;MICROFb → MACROFb : PREDS(c) = QC(c;h ) 4 BLEU and CHRF1 scores reported in this work are i= 1 computed with SACREBLEU; see the Appendix for details. m (i) BLEURT scores are from the base model (Sellam et al., 2020). REFS(c) = QC(c;y ) We consider two varieties of averaging to obtain a corpus-level i=1 metric from the segment-level BLEURT: mean and median of m segment-level scores per corpus. (i) (i) 5 MATCH(c) = Qmin{C(c;h );C(c;y )} https://gitlab.com/webnlg/ i=1 webnlg-human-evaluation Name Fluency & Grammar Semantics 2019).7 We first compute scores from each MT × × BLEU .444 .500 × metric, and then calculate the correlation t with CHRF1 .278 .778 × human judgments. MACROF1 .222 .722 × MICROF1 .333 .611 As there are many language pairs and transla- × BLEURTmean .444 .833 tion directions in each year, we report only the BLEURTmedian .611 .667 mean and median of t, and number of wins per Table 2: WebNLG data-to-text task: Kendall’s t between metric for each year in Table3. We have excluded system-level MT metric scores and human judgments. Fluency and grammar are correlated identically by all metrics. Values BLEURT from comparison in this section since that are not significant at a = 0:05 are indicated by ×. the BLEURT models are fine-tuned on the same datasets on which we are evaluating the other methods.8 CHRF has the strongest mean and median judgments provided have three linguistic aspects— 1 agreement with human judgments across the years. fluency, grammar, and semantics6—which enable In 2018 and 2019, both MACROF and MICROF us to perform a fine grained analysis of our met- 1 1 mean and median agreements outperform BLEU rics. We compute Kendall’s t between metrics and whereas in 2017 BLEU was better than MACROF human judgments, which are reported in Table2. 1 and MICROF . As seen in Table2, the metrics exhibit much 1 As seen in Section 3.1, MACROF weighs to- variance in agreements with human judgments. For 1 wards semantics whereas MICROF1 and BLEU instance, BLEURTmedian is the best indicator of weigh towards fluency and grammar. This indi- fluency and grammar, however BLEURTmean is cates that recent MT systems are mostly fluent, and best on semantics.

Load more