Macro-Average: Rare Types Are Important Too Thamme Gowda Weiqiu You Information Sciences Institute Dept of Computer and Information Science University of Southern California University of Pennsylvania [email protected] [email protected]

Constantine Lignos Jonathan May Michtom School of Computer Science Information Sciences Institute Brandeis University University of Southern California [email protected] [email protected]

Abstract (Sellam et al., 2020). Model-based metric scores are also opaque and can hide undesirable biases, as While traditional corpus-level evaluation met- can be seen in Table1. rics for machine translation (MT) correlate well with fluency, they struggle to reflect ad- Reference: You must be a doctor. equacy. Model-based MT metrics trained on Hypothesis: must be a doctor. segment-level human judgments have emerged He -0.735 as an attractive replacement due to strong cor- Joe -0.975 relation results. These models, however, re- Sue -1.043 She -1.100 quire potentially expensive re-training for new Reference: It is the greatest country in the world. domains and languages. Furthermore, their Hypothesis: is the greatest country in the world. decisions are inherently non-transparent and France -0.022 appear to reflect unwelcome biases. We ex- America -0.060 plore the simple type-based classifier metric, Russia -0.161 Canada -0.309 MACROF1, and study its applicability to MT evaluation. We find that MACROF1 is com- Table 1: A demonstration of BLEURT’s internal bi- petitive on direct assessment, and outperforms ases; model-free metrics like BLEU would consider others in indicating downstream cross-lingual each of the errors above to be equally wrong. information retrieval task performance. Fur- ther, we show that MACROF1 can be used to The source of model-based metrics’ (e.g. effectively compare supervised and unsuper- BLEURT) correlative superiority over model-free vised neural machine translation, and reveal significant qualitative differences in the meth- metrics (e.g. BLEU) appears to be the former’s ods’ outputs.1 ability to focus evaluation on adequacy, while the latter are overly focused on fluency. BLEU and 1 Introduction most other generation metrics consider each output token Model-based metrics for evaluating machine trans- equally. Since natural language is dominated lation such as BLEURT (Sellam et al., 2020), ESIM by a few high-count types, an MT model that con- if and but (Mathur et al., 2019), and YiSi (Lo, 2019) have re- centrates on getting its s, s and s right will cently attracted attention due to their superior cor- benefit from BLEU in the long run more than one xylophone peripatetic defen- relation with human judgments (Ma et al., 2019). that gets its s, s, and estrates right. Can we derive a metric with the However, BLEU (Papineni et al., 2002) remains the most widely used corpus-level MT metric. It corre- discriminating power of BLEURT that does not lates reasonably well with human judgments, and share its bias or expense and is as interpretable as moreover is easy to understand and cheap to cal- BLEU? culate, requiring only reference translations in the As it turns out, the metric may already exist and target language. By contrast, model-based metrics be in common use. Information extraction and require tuning on thousands of examples of human other areas concerned with classification have long evaluation for every new target language or domain used both micro averaging, which treats each to- ken equally, and macro averaging, which instead 1 Tools and analysis are available at https://github. treats each type equally, when evaluating. The lat- com/thammegowda/007-mt-eval-macro. MT evaluation met- rics are at https://github.com/isi-nlp/sacrebleu/tree/ ter in particular is useful when seeking to avoid macroavg-naacl21. results dominated by overly frequent types. In this 1138

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1138–1157 June 6–11, 2021. ©2021 Association for Computational Linguistics work we take a classification-based approach to where C(c,a) counts the number of tokens of type evaluating machine translation in order to obtain an c in sequence a (Papineni et al., 2002). For each easy-to-calculate metric that focuses on adequacy class c ∈ Vh∩y, precision (Pc), recall (Rc), and Fβ 2 as much as BLEURT but does not have the ex- measure (Fβ;c) are computed as follows: pensive overhead, opacity, or bias of model-based methods. MATCH(c) MATCH(c) Pc = ; Rc = Our contributions are as follows: We con- PREDS(c) REFS(c) sider MT as a classification task, and thus ad- P ×R F = (1+ 2) c c β;c β 2 mit MACROF1 as a legitimate approach to eval- β ×Pc +Rc uation (Section2). We show that MACROF1 is competitive with other popular methods at track- The macro-average consolidates individual per- ing human judgments in translation (Section 3.2). formance by averaging by type, while the micro- We offer an additional justification of MACROF1 average averages by token: as a performance indicator on adequacy-focused downstream tasks such as cross-lingual informa- ∑c∈V Fβ;c MACROF = tion retrieval (Section 3.3). Finally, we demonstrate β SVS that MACROF1 is just as good as the expensive ∑c∈V f (c)×Fβ;c ICRO = BLEURT at discriminating between structurally M Fβ ′ ∑c′∈V f (c ) different MT approaches in a way BLEU cannot, especially regarding the adequacy of generated text, where f (c) = REFS(c)+k for smoothing factor k.3 and provide a novel approach to qualitative analy- We scale MACROFβ and MICROFβ values to per- sis of the effect of metrics choice on quantitative centile, similar to BLEU, for the sake of easier evaluation (Section4). readability.

2 NMT as Classification 3 Justification for MACROF1 Neural machine translation (NMT) models are of- In the following sections, we verify and justify the ten viewed as pairs of encoder-decoder networks. utility of MACROF1 while also offering a compar- Viewing NMT as such is useful in practice for ison with popular alternatives such as MICROF1, 4 implementation; however, such a view is inade- BLEU, CHRF1, and BLEURT. We use Kendall’s quate for theoretical analysis. Gowda and May rank correlation coefficient, τ, to compute the as- (2020) provide a high-level view of NMT as two sociation between metrics and human judgments. fundamental ML components: an autoregressor Correlations with p-values smaller than α = 0.05 and a classifier. Specifically, NMT is viewed as a are considered to be statistically significant. multi-class classifier that operates on representa- tions from an autoregressor. We may thus consider 3.1 Data-to-Text: WebNLG classifier-based evaluation metrics. We use the 2017 WebNLG Challenge dataset (Gar- (i) (i) (i) Consider a test corpus, T = {(x ,h ,y )Si = dent et al., 2017; Shimorina, 2018)5 to analyze (i) (i) (i) 1,2,3...m} where x , h , and y are source, sys- the differences between micro- and macro- aver- tem hypothesis, and reference translation, respec- aging. WebNLG is a task of generating English (i) tively. Let x = {x ∀i} and similar for h and y. Let text for sets of triples extracted from DBPedia. Hu- Vh,Vy,Vh∩y, and V be the vocabulary of h, the vo- man annotations are available for a sample of 223 cabulary of y, Vh ∩Vy, and Vh ∪Vy, respectively. For records each from nine NLG systems. The human each class c ∈ V, 2 We consider Fβ;c for c ∈~ Vh∩y to be 0. m 3 (i) We use k = 1. When k → ∞,MICROFβ → MACROFβ . PREDS(c) = QC(c,h ) 4 BLEU and CHRF1 scores reported in this work are i= 1 computed with SACREBLEU; see the Appendix for details. m (i) BLEURT scores are from the base model (Sellam et al., 2020). REFS(c) = QC(c,y ) We consider two varieties of averaging to obtain a corpus-level i=1 metric from the segment-level BLEURT: mean and median of m segment-level scores per corpus. (i) (i) 5 MATCH(c) = Qmin{C(c,h ),C(c,y )} https://gitlab.com/webnlg/ i=1 webnlg-human-evaluation 1139 Name Fluency & Grammar Semantics 2019).7 We first compute scores from each MT × × BLEU .444 .500 × metric, and then calculate the correlation τ with CHRF1 .278 .778 × human judgments. MACROF1 .222 .722 × MICROF1 .333 .611 As there are many language pairs and transla- × BLEURTmean .444 .833 tion directions in each year, we report only the BLEURTmedian .611 .667 mean and median of τ, and number of wins per Table 2: WebNLG data-to-text task: Kendall’s τ between metric for each year in Table3. We have excluded system-level MT metric scores and human judgments. Fluency and grammar are correlated identically by all metrics. Values BLEURT from comparison in this section since that are not significant at α = 0.05 are indicated by ×. the BLEURT models are fine-tuned on the same datasets on which we are evaluating the other meth- ods.8 CHRF has the strongest mean and median judgments provided have three linguistic aspects— 1 agreement with human judgments across the years. fluency, grammar, and semantics6—which enable In 2018 and 2019, both MACROF and MICROF us to perform a fine grained analysis of our met- 1 1 mean and median agreements outperform BLEU rics. We compute Kendall’s τ between metrics and whereas in 2017 BLEU was better than MACROF human judgments, which are reported in Table2. 1 and MICROF . As seen in Table2, the metrics exhibit much 1 As seen in Section 3.1, MACROF weighs to- variance in agreements with human judgments. For 1 wards semantics whereas MICROF1 and BLEU instance, BLEURTmedian is the best indicator of weigh towards fluency and grammar. This indi- fluency and grammar, however BLEURTmean is cates that recent MT systems are mostly fluent, and best on semantics. BLEURT, being a model-based adequacy is the key discriminating factor amongst measure that is directly trained on human judg- them. BLEU served well in the early era of sta- ments, scores relatively higher than others. Consid- tistical MT when fluency was a harder objective. ering the model-free metrics, CHRF does well on 1 Recent advancements in neural MT models such semantics but poorly on fluency and grammar com- as Transformers (Vaswani et al., 2017) produce flu- pared to BLEU. Not surprisingly, both MICROF 1 ent outputs, and have brought us to an era where and MACROF , which rely solely on unigrams, are 1 semantic adequacy is the focus. poor indicators of fluency and grammar compared to BLEU, however MACROF1 is clearly a better in- 3.3 Cross-Lingual Information Retrieval dicator of semantics than BLEU. The discrepancy In this section, we determine correlation between between MICROF1 and MACROF1 regarding their agreement with fluency, grammar, and semantics MT metrics and downstream cross-lingual infor- is expected: micro-averaging pays more attention mation retrieval (CLIR) tasks. CLIR is a kind of to function words (as they are frequent types) that information retrieval (IR) task in which documents contribute to fluency and grammar whereas macro- in one language are retrieved given queries in an- averaging pays relatively more attention to the con- other (Grefenstette, 2012). A practical solution tent words that contribute to semantic adequacy. to CLIR is to translate source documents into the The take away from this analysis is as follows: query language using an MT model, then use a monolingual IR system to match queries with trans- MACROF1 is a strong indicator of semantic ade- quacy, however, it is a poor indicator of fluency. lated documents. Correlation between MT and IR metrics is accomplished in the following steps: We recommend using either MACROF1 or CHRF1 when semantic adequacy and not fluency is a de- 1. Build a set of MT models and measure their sired goal. performance using MT metrics. 2. Using each MT model in the set, translate all 3.2 Machine Translation: WMT Metrics source documents to the target language, build an IR model, and measure IR performance on In this section, we verify how well the metrics translated documents. agree with human judgments using Workshop on 3. For each MT metric, find the correlation be- Machine Translation (WMT) metrics task datasets tween the set of MT scores and their correspond- for 2017–2019 (Bojar et al., 2017; Ma et al., 2018, ing set of IR scores. The MT metric that has a 6Fluency and grammar, which are elicited with nearly 7 identical directions (Gardent et al., 2017), are identically cor- http://www.statmt.org/wmt19/metrics-task.html 8 related. https://github.com/google-research/bleurt 1140 Year Pairs ⋆BLEU BLEU MACROF1 MICROF1 CHRF1 Mean .751 .771 .821 .818 .841 2019 18 Median .782 .752 .844 .844 .875 Wins 3 3 6 3 5 Mean .858 .857 .875 .873 .902 2018 14 Median .868 .868 .901 .879 .919 Wins 1 2 3 2 6 Mean .752 .713 .714 .742 .804 2017 13 Median .758 .733 .735 .728 .791 Wins 5 4 2 2 6 Table 3: WMT 2017–19 Metrics task: Mean and median Kendall’s τ between MT metrics and human judgments. Correlations that are not significant at α = 0.05 are excluded from the calculation of mean, and median, and wins. See Appendix Tables9, 10, and 11 for full details. ⋆BLEU is pre-computed scores available in the metrics packages. In 2018 and 2019, both MACROF1 and MICROF1 outperform BLEU,MACROF1 outperforms MICROF1.CHRF1 has strongest mean and median agreements across the years. Judging based on the number of wins, MACROF1 has steady progress over the years, and outperforms others in 2019.

stronger correlation with the IR metric(s) is more CLIR task performance as well as MACROF1, as useful than the ones with weaker correlations. CLIR tasks require faithful meaning equivalence 4. Repeat the above steps on many languages to across the language boundary, and human transla- verify the generalizability of findings. tors can mistake fluent output for proper transla- An essential resource of this analysis is a dataset tions (Callison-Burch et al., 2007). with human annotations for computing MT and 3.3.2 Europarl Datasets IR performances. We conduct experiments on two datasets: firstly, on data from the 2020 workshop We perform a similar analysis to Section 3.3.1 but on Cross-Language Search and Summarization of on another cross-lingual task set up by Lignos et al. Text and Speech (CLSSTS) (Zavorin et al., 2020), (2019) for Czech → English (CS-EN) and German and secondly, on data originally from Europarl, → English (DE-EN), using publicly available data prepared by Lignos et al.(2019) (Europarl). from the Europarl v7 corpus (Koehn, 2005). This task differs from the CLSSTS task (Section 3.3.1) 3.3.1 CLSSTS Datasets in several ways. Firstly, MT metrics are computed CLSSTS datasets contain queries in English (EN), on test sets from the news domain, whereas IR and documents in many source languages along metrics are from the Europarl domain. The do- with their human translations, as well as query- mains are thus intentionally mismatched between document relevance judgments. We use three MT and IR tests. Secondly, since there are no source languages: Lithuanian (LT), Pashto (PS), queries specifically created for the Europarl do- and Bulgarian (BG). The performance of this CLIR main, GOV2 TREC topics 701–850 are used as task is evaluated using two IR measures: Actual domain-relevant English queries. And lastly, since Query Weighted Value (AQWV) and Mean Av- there are no query-document relevance human judg- erage Precision (MAP). AQWV9 is derived from ments for the chosen query and document sets, the Actual Term Weighted Value (ATWV) metric (Weg- documents retrieved by BM25 (Jones et al., 2000) mann et al., 2013). on the English set for each query are treated as We use a single CLIR system (Boschee et al., relevant documents for computing the performance 2019) with the same IR settings for all MT models of the CS-EN and DE-EN CLIR setup. As a result, in the set, and measure Kendall’s τ between MT IR metrics that rely on boolean query-document and IR measures. The results, in Table4, show that relevance judgments as ground truth are less infor- MACROF1 is the strongest indicator of CLIR down- mative, and we use Rank-Based Overlap (RBO; stream task performance in five out of six settings. p = 0.98) (Webber et al., 2010) as our IR metric. AQWV and MAP have a similar trend in agree- We perform our analysis on the same experi- 10 ment to the MT metrics. CHRF1 and BLEURT, ments as Lignos et al.(2019). NMT models which are strong contenders when generated text for CS-EN and DE-EN translation are trained us- is directly evaluated by humans, do not indicate ing a convolutional NMT architecture (Gehring

9 10 https://www.nist.gov/system/files/documents- https://github.com/ConstantineLignos/ /2017/10/26/aqwv_derivation.pdf mt-clir-emnlp-2019 1141 Domain IR Score BLEU MACROF1 MICROF1 CHRF1 BLEURTmean BLEURTmedian AQWV .429 ×.363 .508 ×.385 .451 .420 In MAP .495 .429 .575 .451 .473 .486 LT-EN AQWV ×.345 .527 .491 .491 .491 .477 In+Ext MAP ×.273 ×.455 ×.418 ×.418 ×.418 ×.404 AQWV .559 .653 .574 .581 .584 .581 In MAP .493 .632 .487 .494 .558 .554 PS-EN AQWV .589 .682 .593 .583 .581 .571 In+Ext MAP .519 .637 .523 .482 .536 .526 AQWV ×.455 .550 .527 ×.382 ×.418 .418 In MAP .491 .661 .564 .491 .527 .527 BG-EN AQWV ×.257 .500 ×.330 ×.404 ×.367 ×.367 In+ext MAP ×.183 ×.426 ×.257 ×.330 ×.294 ×.294 Table 4: CLSSTS CLIR task: Kendall’s τ between IR and MT metrics under study. The rows with Domain=In are where MT and IR scores are computed on the same set of documents, whereas Domain=In+Ext are where IR scores are computed on a larger set of documents that is a superset of segments on which MT scores are computed. Bold values are the best correlations achieved in a row-wise setting; values with × are not significant at α = 0.05.

È BLEU MACROF1 MICROF1 CHRF1 BT BT tion (SNMT) systems. In this section we leverage CS-EN .850 .867 .850 .850 .900 .867 MACROF to investigate differences in the transla- DE-EN .900 .900 .900 .912 .917 .900 1 tions from UNMT and SNMT systems that have Table 5: Europarl CLIR task: Kendall’s between τ similar BLEU. MT metrics and RBO. BT and BTÈ are short for BLEURTmean and BLEURTmedian. All correlations We compare UNMT and SNMT for English ↔ are significant at α = 0.05. German (EN-DE, DE-EN), English ↔ French (EN- FR, FR-EN), and English ↔ Romanian (EN-RO, RO-EN). All our UNMT models are based on XLM et al., 2017) implemented in the FAIRSeq (Ott et al., (Conneau and Lample, 2019), pretrained by Yang 2019) toolkit. For each of CS-EN and DE-EN, a (2020). We choose SNMT models with similar total of 16 NMT models that are based on different BLEU on common test sets by either selecting from quantities of training data and BPE hyperparame- systems submitted to previous WMT News Trans- ter values are used. The results in Table5 show lation shared tasks (Bojar et al., 2014, 2016) or by that BLEURT has the highest correlation in both building such systems.12 Specific SNMT models cases. Apart from the trained BLEURTmedian met- chosen are in the Appendix (Table 12). ric, MACROF1 scores higher than the others on Table6 shows performance for these three lan- CS-EN, and is competitive on CS-EN. MACROF1 is not the metric with highest IR task correlation guage pairs using a variety of metrics. Despite in this setting, unlike in Section 3.3.1, however it comparable scores in BLEU and only minor dif- ferences in MICROF1 and CHRF1, SNMT models is competitive with BLEU and CHRF1, and thus a safe choice as a downstream task performance have consistently higher MACROF1 and BLEURT indicator. than the UNMT models for all six translation direc- tions. 4 Spotting Qualitative Differences In the following section, we use a pairwise maxi- Between Supervised and Unsupervised mum difference discriminator approach to compare NMT with MACROF1 corpus-level metrics BLEU and MACROF1 on a seg- ment level. Qualitatively, we take a closer look at Unsupervised neural machine translation (UNMT) the behavior of the two metrics when comparing systems trained on massive monolingual data a translation with altered meaning to a translation without parallel corpora have made significant with differing word choices using the metric. progress recently (Artetxe et al., 2018; Lample et al., 2018a,b; Conneau and Lample, 2019; Song et al., 2019; Liu et al., 2020). In some cases, 12We were unable to find EN-DE and DE-EN systems with UNMT yields a BLEU score that is comparable comparable BLEU in WMT submissions so we built standard 11 Transformer-base (Vaswani et al., 2017) models for these using with strong supervised neural machine transla- appropriate quantity of training data to reach the desired BLEU performance. We report EN-RO results with diacritic removed 11though not, generally, the strongest to match the output of UNMT. 1142 BLEU MACROF1 MICROF1 CHRF1 BLEURTmean BLEURTmedian SN UN ∆ SN UN ∆ SN UN ∆ SN UN ∆ SN UN ∆ SN UN ∆ DE-EN 32.7 33.9 -1.2 38.5 33.6 4.9 58.7 57.9 0.8 59.9 58.0 1.9 .211 -.026 .24 .285 .067 .22 EN-DE 24.0 24.0 0.0 24.0 23.5 0.5 47.7 48.1 -0.4 53.3 52.0 1.3 -.134 -.204 .07 -.112 -.197 .09 FR-EN 31.1 31.2 -0.1 41.6 33.6 8.0 60.5 58.3 2.2 59.1 57.3 1.8 .182 .066 .17 .243 .154 .09 EN-FR 25.6 27.1 -1.5 31.9 27.3 4.6 53.0 52.3 0.7 56.0 57.7 -1.7 .104 .042 .06 .096 .063 .03 RO-EN 30.8 29.6 1.2 40.3 33.0 7.3 59.8 56.5 3.3 58.0 54.7 3.3 .004 -.058 .06 .045 -.004 .04 EN-RO 31.2 31.0 0.2 34.6 31.0 3.6 55.4 53.4 2.0 59.3 56.7 2.6 .030 -.046 .08 .027 -.038 .07 Table 6: For each language direction, UNMT (UN) models have similar BLEU to SNMT (SN) models, and CHRF1 and MICROF1 have small differences. However, MACROF1 scores differ significantly, consistently in favor of SNMT. Both corpus-level interpretations of BLEURT support the trend reflected by MACROF1, but the value differences are difficult to interpret.

δMACROF1 Fav Analysis δBLEU Fav Analysis .071 S S: synonym; U: untranslation, noun .048 S S: word order; U: word order, untranslation, ending .064 S S: synonym; U: untranslation .046 S S: spelling variation; U: synonym, word order, punctuation -.055 U U: no issues; S: untranslation .044 S S: extra determiner; U: paraphrase, synonym, number, untranslation .052 S S: synonym; U: untranslation, noun .042 S S: synonym; U: synonym, punctuation, extra adverb -.045 U U: no issues; S: untranslation -.039 U U: no issues; S: noun, verb .044 S S: synonym, word order; U: subject, trunca- -.037 U U: no issues; S: punctuation tion, word order .044 S S: synonym, tense; U: untranslation -.034 U U: no issues; S: symbol .043 S S: inflection, word order; U: number -.032 U U: no issues; S: adjective, noun -.041 U U: adjective, verb; S: omitted verb, untransla- -.032 U U: untranslation; S: tense, word order, mean- tion ing, active/passive voice .041 S S: time, word order; U: time, nouns -.031 U U: untranslation; S: word order, synonym, ex- tra_conj Table 7: Analysis of the ten DE-EN test set segments with the most favoritism in SNMT (S) or UNMT (U), according to MACROF1 (left) and BLEU (right). Fav is the favored system by metrics. The complete text of the sentences is in the Appendix, Tables 15 and 16.

4.1 Pairwise Maximum Difference level score. We define the favoritism of M toward i Discriminator as δM(i;hS,hU ): We consider cases where a metric has a strong (i;h ,h ) = (i;h )− (i;h ) (1) opinion of one translation system over another, and δM S U δM S δM U analyze whether the opinion is well justified. In If δ (i;h ,h ) > 0 then M favors the translation of order to obtain this analysis, we employ a pairwise M S U x(i) by system S over that in system U. segment-level discriminator from within a corpus- Table7 reflects the results of a manual examina- level metric, which we call favoritism. tion of the ten sentences in the DE-EN test set with We extend the definition of T from Section2 greatest magnitude favoritism; complete results to T = {x,h ,h ,y} where each of h and h is a S U S U are in the Appendix, Tables 15 and 16. Meaning- separate system’s hypothesis set for x.13 Let M altering changes such as ‘untranslation’, (wrong) be a corpus-level measure such that M(h,y) ∈ R ‘time’, and (wrong) ‘translation’ are marked in ital- and a higher value implies better translation quality. ics, while changes that do not fundamentally alter M(h(−i),y(−i)) is the corpus-level score obtained the meaning, such as ‘synonym,’ (different) ‘inflec- by excluding h(i) and y(i) from h and y, respectively. tion,’ and (different) ‘word order’ are marked in We define the benefit of segment i, δ (i;h): M plain text.14 (−i) (−i) The results indicate that MACROF generally δM(i;h) = M(h,y)−M(h ,y ) 1 favors SNMT, and with good reasons, as the fa- If δM(i;h) > 0, then i is beneficial to h with respect vored translation does not generally alter sentence to M, as the inclusion of h(i) increases the corpus- meaning, while the disfavored translation does. On

13The subscripts represent SNMT and UNMT in this case, 14Some changes, such as ‘word order’ may change meaning; though the definition is general. these are italicized or not on a case-by-case basis. 1143 th 6 δMACROF1(i,hS,hU ): .044, δBLEU(i,hS,hU ): -.00087, δBLEURT (i,hS,hU ): .97 Ref Ever since I joined Labour 32 years ago as a school pupil, provoked by the Thatcher government’s neglect that had left my comprehensive school classroom literally falling down, I’ve sought to champion better public services for those who need them most - whether as a local councillor or government minister. SNMT 32 years ago, I joined Labour as a student because of the neglect of the Thatcher government, which had led to my classroom literally collapsed, and as a result I tried to promote better public services for those who need it most, whether as a local council or ministers. UNMT Last 32 years ago, as a student, because of the disdain for the Thatcher-era government, Labour joined Labour. Problems SNMT: synonym, word_order UNMT: subject, truncation , word_order

Table 8: An example of favoritism that illustrates the differences between MACROF1 and BLEU. Translations of the DE-EN test sentence with sixth largest magnitude favoritism according to MACROF1, along with the favoritism according to BLEU (not in the top ten). UNMT’s translation does not include the second half of the sentence. MACROF1 favors SNMT, but BLEU favors UNMT. the other hand, for the ten most favored sentences set with references. according to BLEU, four do not contain meaning- Mathur et al.(2020) have recently evaluated the altering divergences in the disfavored translation. utility of popular metrics and recommend the use Importantly, none of the sentences with greatest of either CHRF1 or a model-based metric instead of favoritism according to MACROF1, all of which BLEU. We compare our MACROF1 and MICROF1 having meaning altering changes in the disfavored metrics with BLEU, CHRF1, and BLEURT (Sel- alternatives, appears in the list for BLEU. This indi- lam et al., 2020). While Mathur et al.(2020) use cates relatively bad judgment on the part of BLEU. Pearson’s correlation coefficient (r) to quantify the One case of good judgment from MACROF1 and correlation between automatic evaluation metrics bad judgment from BLEU regarding truncation is and human judgements, we instead use Kendall’s shown in Table8. rank coefficient (τ), since τ is more robust to out- From our qualitative examinations, MACROF1 liers than r (Croux and Dehon, 2010). is better than BLEU at discriminating against un- translations and trucations in UNMT. The case is 5.2 Rare Words are Important similar for FR-EN and RO-EN, except that RO- That natural language word types roughly follow a EN has more untranslations for both SNMT and Zipfian distribution is a well known phenomenon UNMT, possibly due to the smaller training data. (Zipf, 1949; Powers, 1998). The frequent types are Complete tables and annotated sentences are in the mainly so-called “stop words,” function words, and Appendix, in SectionC. other low-information types, while most content words are infrequent types. To counter this natu- 5 Related Work ral frequency-based imbalance, statistics such as inverted document frequency (IDF) are commonly 5.1 MT Metrics used to weigh the input words in applications such Many metrics have been proposed for MT eval- as information retrieval (Jones, 1972). In NLG uation, which we broadly categorize into model- tasks such as MT, where words are the output of free or model-based. Model-free metrics compute a classifier, there has been scant effort to address scores based on translations but have no signifi- the imbalance. Doddington(2002) is the only work cant parameters or hyperparameters that must be we know of in which the ‘information’ of an n- tuned a priori; these include BLEU (Papineni et al., gram is used as its weight, such that rare n-grams 2002), NIST (Doddington, 2002), TER (Snover attain relatively more importance than in BLEU. et al., 2006), and CHRF1 (Popovic´, 2015). Model- We abandon this direction for two reasons: Firstly, based metrics have a significant number of pa- as noted in that work, large amounts of data are rameters and, sometimes, external resources that required to estimate n-gram statistics. Secondly, must be set prior to use. These include METEOR unequal weighing is a bias that is best suited to (Banerjee and Lavie, 2005), BLEURT (Sellam datasets where the weights are derived from, and et al., 2020), YiSi (Lo, 2019), ESIM (Mathur et al., such biases often do not generalize to other datasets. 2019), and BEER (Stanojevic´ and Sima’an, 2014). Therefore, unlike Doddington(2002), we assign Model-based metrics require significant effort and equal weights to all n-gram classes, and in this resources when adapting to a new language or do- work we limit our scope to unigrams only. main, while model-free metrics require only a test While BLEU is a precision-oriented measure, 1144 METEOR (Banerjee and Lavie, 2005) and CHRF tic adequacy is a key discriminating factor, macro- (Popovic´, 2015) include both precision and recall, averaged measures such as MACROF1 are better similar to our methods. However, neither of these at judging the generation quality of MT models measures try to address the natural imbalance of (Section 3.2). We have found that another popular class distribution. BEER (Stanojevic´ and Sima’an, metric, CHRF1, also performs well on direct assess- 2014) and METEOR (Denkowski and Lavie, 2011) ment, however, being an implicitly micro-averaged make an explicit distinction between function and measure, it does not perform as well as MACROF1 content words; such a distinction inherently cap- on downstream CLIR tasks (Section 3.3.1). Un- tures frequency differences since function words like BLEURT, which is also adequacy-oriented, are often frequent and content words are often in- MACROF1 is directly interpretable, does not re- frequent types. However, doing so requires the quire retuning on expensive human evaluations construction of potentially expensive linguistic re- when changing language or domain, and does not sources. This work does not make any explicit dis- appear to have uncontrollable biases resulting from tinction and uses naturally occurring type counts to data effects. It is both easy to understand and to effect a similar result. calculate, and is inspectable, enabling fine-grained analysis at the level of individual word types. These 5.3 F-measure as an Evaluation Metric attributes make it a useful metric for understand- F-measure (Rijsbergen, 1979; Chinchor, 1992) is ing and addressing the flaws of current models. extensively used as an evaluation metric in classifi- For instance, we have used MACROF1 to compare cation tasks such as part-of-speech tagging, named supervised and unsupervised NMT models at the entity recognition, and sentiment analysis (Der- same operating point measured in BLEU, and deter- czynski, 2016). Viewing MT as a multi-class clas- mined that supervised models have better adequacy sifier is a relatively new paradigm (Gowda and than the current unsupervised models (Section4). May, 2020), and evaluating MT solely as a multi- Macro-average is a useful technique for address- class classifier as proposed in this work is not an ing the importance of the long tail of language, and established practice. However, we find that the MACROF1 is our first step in that direction; we an- F1 measure is sometimes used for various analy- ticipate the development of more advanced macro- ses when BLEU and others are inadequate: The averaged metrics that take advantage of higher- compare-mt tool (Neubig et al., 2019) supports order and character n-grams in the future. comparison of MT models based on F1 measure of individual types. Gowda and May(2020) use F1 of 7 Ethical Consideration individual types to uncover frequency-based bias in MT models. Sennrich et al.(2016) use corpus- Since many ML models including NMT are them- selves opaque and known to possess data-induced level unigram F1 in addition to BLEU and CHRF, biases (Prates et al., 2019), using opaque and bi- however, corpus-level F1 is computed as MICROF1. To the best of our knowledge, there is no previ- ased evaluation metrics in concurrence makes it ous work that clearly formulates the differences even harder to discover and address the flaws in between micro- and macro- averages, and justifies modeling. Hence, we have raised concerns about the opaque nature of the current model-based evalu- the use of MACROF1 for MT evaluation. ation metrics, and demonstrated examples display- 6 Discussion and Conclusion ing unwelcome biases in evaluation. We advocate the use of the MACROF1 metric, as it is easily in- We have evaluated NLG in general and MT specifi- terpretable and offers the explanation of score as cally as a multi-class classifier, and illustrated the a composition of individual type performances. In differences between micro- and macro- averages addition, MACROF1 treats all types equally, and using MICROF1 and MACROF1 as examples (Sec- has no parameters that are directly or indirectly esti- tion2). MACROF1 captures semantic adequacy mated from data sets. Unlike MACROF1, MICROF1 better than MICROF1 (Section 3.1). BLEU, being and other implicitly or explicitly micro-averaged a micro-averaged measure, served well in an era metrics assign lower importance to rare concepts when generating fluent text was at least as difficult and their associated rare types. The use of micro- as generating adequate text. Since we are now in an averaged metrics in real world evaluation could era in which fluency is taken for granted and seman- lead to marginalization of rare types. 1145 Failure Modes: The proposed MACROF1 metric References is not the best measure of fluency of text. Hence Mikel Artetxe, Gorka Labaka, Eneko Agirre, and we suggest caution while using MACROF1 to draw Kyunghyun Cho. 2018. Unsupervised neural ma- fluency related decisions. MACROF1 is inherently chine translation. In 6th International Conference concerned with words, and assumes the output lan- on Learning Representations, ICLR 2018, Vancou- ver, BC, Canada, April 30 - May 3, 2018, Confer- guage is easily segmentable into word tokens. Us- ence Track Proceedings. OpenReview.net. ing MACROF1 to evaluate translation into alphabet- ical languages such as Thai, Lao, and Khmer, that Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with im- do not use white space to segment words, requires proved correlation with human judgments. In Pro- an effective tokenizer. Absent this the method may ceedings of the ACL Workshop on Intrinsic and Ex- be ineffective; we have not tested it on languages trinsic Evaluation Measures for Machine Transla- beyond those listed in SectionB. tion and/or Summarization, pages 65–72, Ann Ar- bor, Michigan. Association for Computational Lin- Reproducibility: Our implementation of guistics. MACROF and MICROF has the same user 1 1 Ondrej Bojar, Christian Buck, Christian Federmann, experience as BLEU as implemented in SACRE- Barry Haddow, Philipp Koehn, Johannes Leveling, BLEU; signatures are provided in SectionA. In Christof Monz, Pavel Pecina, Matt Post, Herve addition, our implementation is computationally Saint-Amand, Radu Soricut, Lucia Specia, and Aleš efficient, and has the same (minimal) software Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the and hardware requirements as BLEU. All data Ninth Workshop on Statistical Machine Translation, for MT and NLG human correlation studies is pages 12–58, Baltimore, Maryland, USA. Associa- publicly available and documented. Data for tion for Computational Linguistics. reproducing the IR experiments in Section 3.3.2 is Ondrejˇ Bojar, Yvette Graham, and Amir Kamran. 2017. also publicly available and documented. The data Results of the WMT17 metrics shared task. In for reproducing the IR experiments in Section 3.3.1 Proceedings of the Second Conference on Machine is only available to participants in the CLSSTS Translation, pages 489–513, Copenhagen, Denmark. Association for Computational Linguistics. shared task. Ondrejˇ Bojar, Rajen Chatterjee, Christian Federmann, Climate Impact: Our proposed metrics are on par Yvette Graham, Barry Haddow, Matthias Huck, with BLEU and such model-free methods, which Antonio Jimeno Yepes, Philipp Koehn, Varvara consume significantly less energy than most model- Logacheva, Christof Monz, Matteo Negri, Aure- based evaluation metrics. lie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Spe- cia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference Acknowledgements on machine translation. In Proceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computational The authors thank Shantanu Agarwal, Joel Barry, Linguistics. and Scott Miller for their help with CLSSTS CLIR experiments, and Daniel Cohen for discussions on Elizabeth Boschee, Joel Barry, Jayadev Billa, Mar- jorie Freedman, Thamme Gowda, Constantine Lig- IR evaluation methods. This research is based upon nos, Chester Palen-Michel, Michael Pust, Ban- work supported by the Office of the Director of Na- riskhem Kayang Khonglah, Srikanth Madikeri, tional Intelligence (ODNI), Intelligence Advanced Jonathan May, and Scott Miller. 2019. SARAL: A Research Projects Activity (IARPA), via AFRL low-resource cross-lingual domain-focused informa- tion retrieval system for effective rapid document Contract FA8650-17-C-9116. The views and con- triage. In Proceedings of the 57th Annual Meeting of clusions contained herein are those of the authors the Association for Computational Linguistics: Sys- and should not be interpreted as necessarily repre- tem Demonstrations, pages 19–24, Florence, Italy. senting the official policies or endorsements, either Association for Computational Linguistics. expressed or implied, of the ODNI, IARPA, or the Chris Callison-Burch, Cameron Fordyce, Philipp U.S. Government. The U.S. Government is autho- Koehn, Christof Monz, and Josh Schroeder. 2007. rized to reproduce and distribute reprints for Gov- (Meta-) evaluation of machine translation. In Pro- ceedings of the Second Workshop on Statistical Ma- ernmental purposes notwithstanding any copyright chine Translation, pages 136–158, Prague, Czech annotation thereon. Republic. Association for Computational Linguis- tics. 1146 Nancy Chinchor. 1992. MUC-4 Evaluation Metrics. In Karen Spärck Jones, Steve Walker, and Stephen E. Proceedings of the 4th Conference on Message Un- Robertson. 2000. A probabilistic model of informa- derstanding, MUC4 ’92, page 22–29, USA. Associ- tion retrieval: development and comparative exper- ation for Computational Linguistics. iments: Part 2. Information processing & manage- ment, 36(6):809–840. Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining. In H. Wal- Philipp Koehn. 2005. Europarl: A parallel corpus for lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, statistical machine translation. Proc. 10th Machine E. Fox, and R. Garnett, editors, Advances in Neu- Translation Summit (MT Summit), 2005, pages 79– ral Information Processing Systems 32, pages 7059– 86. 7069. Curran Associates, Inc. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, Christophe Croux and Catherine Dehon. 2010. Influ- and Marc’Aurelio Ranzato. 2018a. Unsupervised ence functions of the spearman and kendall correla- machine translation using monolingual corpora only. tion measures. Statistical methods & applications, In 6th International Conference on Learning Rep- 19(4):497–515. resentations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceed- Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: ings. OpenReview.net. Automatic metric for reliable optimization and eval- uation of machine translation systems. In Proceed- Guillaume Lample, Myle Ott, Alexis Conneau, Lu- ings of the Sixth Workshop on Statistical Machine dovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Translation, pages 85–91, Edinburgh, Scotland. As- Phrase-based & neural unsupervised machine trans- sociation for Computational Linguistics. lation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Leon Derczynski. 2016. Complementarity, F-score, pages 5039–5049, Brussels, Belgium. Association and NLP evaluation. In Proceedings of the Tenth In- for Computational Linguistics. ternational Conference on Language Resources and Evaluation (LREC’16), pages 261–266, Portorož, Constantine Lignos, Daniel Cohen, Yen-Chieh Lien, Slovenia. European Language Resources Associa- Pratik Mehta, W. Bruce Croft, and Scott Miller. tion (ELRA). 2019. The challenges of optimizing machine trans- lation for low resource cross-language information George Doddington. 2002. Automatic evaluation retrieval. In Proceedings of the 2019 Conference on of machine translation quality using n-gram co- Empirical Methods in Natural Language Processing occurrence statistics. In Proceedings of the Second and the 9th International Joint Conference on Natu- International Conference on Human Language Tech- ral Language Processing (EMNLP-IJCNLP), pages nology Research, HLT ’02, page 138–145, San Fran- 3497–3502, Hong Kong, China. Association for cisco, CA, USA. Morgan Kaufmann Publishers Inc. Computational Linguistics. Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating train- Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey ing corpora for NLG micro-planners. In Proceed- Edunov, Marjan Ghazvininejad, Mike Lewis, and ings of the 55th Annual Meeting of the Associa- Luke Zettlemoyer. 2020. Multilingual denoising tion for Computational Linguistics (Volume 1: Long pre-training for neural machine translation. Papers), pages 179–188. Association for Computa- tional Linguistics. Chi-kiu Lo. 2019. YiSi - a unified semantic MT quality evaluation and estimation metric for languages with Jonas Gehring, Michael Auli, David Grangier, De- different levels of available resources. In Proceed- nis Yarats, and Yann N Dauphin. 2017. Convolu- ings of the Fourth Conference on Machine Transla- tional sequence to sequence learning. In Proceed- tion (Volume 2: Shared Task Papers, Day 1), pages ings of the 34th International Conference on Ma- 507–513, Florence, Italy. Association for Computa- chine Learning-Volume 70, pages 1243–1252. tional Linguistics.

Thamme Gowda and Jonathan May. 2020. Finding the Qingsong Ma, Ondrejˇ Bojar, and Yvette Graham. 2018. optimal vocabulary size for neural machine transla- Results of the WMT18 metrics shared task: Both tion. In Findings of the Association for Computa- characters and embeddings achieve good perfor- tional Linguistics: EMNLP 2020, pages 3955–3964, mance. In Proceedings of the Third Conference on Online. Association for Computational Linguistics. Machine Translation: Shared Task Papers, pages 671–688, Belgium, Brussels. Association for Com- Gregory Grefenstette. 2012. Cross-language informa- putational Linguistics. tion retrieval, volume 2. Springer Science & Busi- ness Media. Qingsong Ma, Johnny Wei, Ondrejˇ Bojar, and Yvette Graham. 2019. Results of the WMT19 metrics Karen Spärck Jones. 1972. A statistical interpretation shared task: Segment-level and strong MT sys- of term specificity and its application in retrieval. tems pose big challenges. In Proceedings of the Journal of Documentation, 28:11–21. Fourth Conference on Machine Translation (Volume 1147 2: Shared Task Papers, Day 1), pages 62–90, Flo- pages 7881–7892, Online. Association for Computa- rence, Italy. Association for Computational Linguis- tional Linguistics. tics. Rico Sennrich, Barry Haddow, and Alexandra Birch. Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2016. Neural machine translation of rare words 2019. Putting evaluation in context: Contextual with subword units. In Proceedings of the 54th An- embeddings improve machine translation evaluation. nual Meeting of the Association for Computational In Proceedings of the 57th Annual Meeting of the Linguistics (Volume 1: Long Papers), pages 1715– Association for Computational Linguistics, pages 1725, Berlin, Germany. Association for Computa- 2799–2808, Florence, Italy. Association for Compu- tional Linguistics. tational Linguistics. Anastasia Shimorina. 2018. Human vs automatic met- Nitika Mathur, Timothy Baldwin, and Trevor Cohn. rics: on the importance of correlation design. CoRR, 2020. Tangled up in BLEU: Reevaluating the eval- abs/1805.11474. uation of automatic machine translation evaluation Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- Proceedings of the 58th Annual Meet- metrics. In nea Micciulla, and John Makhoul. 2006. A study of ing of the Association for Computational Linguistics , translation edit rate with targeted human annotation. pages 4984–4997, Online. Association for Computa- In Proceedings of association for machine transla- tional Linguistics. tion in the Americas, volume 200. Cambridge, MA. Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Danish Pruthi, and Xinyi Wang. 2019. compare-mt: Yan Liu. 2019. MASS: Masked sequence to se- A tool for holistic comparison of language genera- quence pre-training for language generation. In Pro- tion systems. In Proceedings of the 2019 Confer- ceedings of Machine Learning Research, volume 97, ence of the North American Chapter of the Asso- pages 5926–5936, Long Beach, California, USA. ciation for Computational Linguistics (Demonstra- PMLR. tions), pages 35–41, Minneapolis, Minnesota. Asso- ciation for Computational Linguistics. Miloš Stanojevic´ and Khalil Sima’an. 2014. BEER: BEtter evaluation as ranking. In Proceedings of the Myle Ott, Sergey Edunov, Alexei Baevski, Angela Ninth Workshop on Statistical Machine Translation, Fan, Sam Gross, Nathan Ng, David Grangier, and pages 414–419, Baltimore, Maryland, USA. Associ- Michael Auli. 2019. fairseq: A fast, extensible ation for Computational Linguistics. toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Kaiser, and Illia Polosukhin. 2017. Attention is all Jing Zhu. 2002. Bleu: a method for automatic eval- you need. In Advances in neural information pro- uation of machine translation. In Proceedings of cessing systems, pages 5998–6008. the 40th Annual Meeting of the Association for Com- putational Linguistics, pages 311–318, Philadelphia, William Webber, Alistair Moffat, and Justin Zobel. Pennsylvania, USA. Association for Computational 2010. A similarity measure for indefinite rankings. Linguistics. ACM Transactions on Information Systems (TOIS), 28(4):1–38. Maja Popovic.´ 2015. chrF: character n-gram f-score for automatic MT evaluation. In Proceedings of the Steven Wegmann, Arlo Faria, Adam Janin, Korbinian Tenth Workshop on Statistical Machine Translation, Riedhammer, and Nelson Morgan. 2013. The TAO pages 392–395, Lisbon, Portugal. Association for of ATWV: Probing the mysteries of keyword search Computational Linguistics. performance. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 192– David M. W. Powers. 1998. Applications and explana- 197. IEEE. tions of Zipf’s law. In New Methods in Language Hans Yang. 2020. XLM-UNMT-Models. https:// Processing and Computational Natural Language github.com/Hansxsourse/XLM-UNMT-Models. Learning. Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Mor- Marcelo O.R. Prates, Pedro H. Avelar, and Luís C. rison, Audrey Tong, and Richard Tong. 2020. Cor- Lamb. 2019. Assessing gender bias in machine pora for cross-language information retrieval in six translation: a case study with google translate. Neu- less-resourced languages. In Proceedings of the ral Computing and Applications, pages 1–19. workshop on Cross-Language Search and Summa- C. J. Van Rijsbergen. 1979. Information Retrieval, 2nd rization of Text and Speech (CLSSTS2020), pages edition. Butterworth-Heinemann, USA. 7–13, Marseille, France. European Language Re- sources Association. Thibault Sellam, Dipanjan Das, and Ankur Parikh. George Kingsley Zipf. 1949. Human behaviour and 2020. BLEURT: Learning robust metrics for text the principle of least-effort. Cambridge MA edn. generation. In Proceedings of the 58th Annual Meet- Addison-Wesley. ing of the Association for Computational Linguistics, 1148 A Metrics Reproducibility ⋆BLEU BLEU MACROF1 MICROF1 CHRF1 DE-CS .855 .745 .964 .917 .982 BLEU scores reported in this work are com- DE-EN .571 .655 .723 .695 .742 DE-FR .782 .881 .927 .844 .915 puted with the SACREBLEU library and have EN-CS .709 .954 .927 .927 .908 signature BLEU+case.mixed+lang.-+numrefs.1 EN-DE .540 .752 .741 .773 .824 +smooth.exp+tok.+version.1.4.13, where EN-FI .879 .818 .879 .848 .923 EN-GU .709 .709 .600 .734 .709 is zh for Chinese, and 13a for all other languages. EN-KK .491 .527 .685 .636 .661 MACROF1 and MICROF1 use the same tokenizer as EN-LT .879 .848 .970 .939 .881 BLEU. CHRF is also obtained using SACREBLEU EN-RU .870 .848 .939 .879 .930 1 FI-EN .788 .809 .909 .901 .875 and has signature chrF1+lang.-+numchars.6 FR-DE .822 .733 .733 .764 .815 +space.false +version.1.4.13. BLUERT scores are GU-EN .782 .709 .855 .891 .945 from the base model of Sellam et al.(2020), KK-EN .891 .844 .796 .844 .881 LT-EN .818 .855 .844 .855 .833 which is fine-tuned on WMT Metrics ratings data RU-EN .692 .729 .714 .780 .757 from 2015-2018. The BLEURT model is re- ZH-EN .695 .695 .752 .676 .715 Median .782 .752 .844 .844 .875 trieved from https://storage.googleapis.com/ Mean .751 .771 .821 .818 .841 bleurt-oss/bleurt-base-128.zip. SD .124 .101 .112 .093 .095 × MACROF1 and MICROF1 are computed using EN-ZH .606 .606 .424 .595 .594 Wins 3 3 6 3 5 our fork of SACREBLEU as: sacrebleu $REF -m macrof microf < $HYP. Table 9: WMT19 Metrics task: Kendall’s τ between metrics and human judgments. B Agreement with WMT Human Judgments C UNMT and SNMT Models Tables9, 10, and 11 provide τ between MT metrics and human judgments on WMT Metrics task 2017– The UNMT models follow XLM’s standard archi- 2019. ⋆BLEU is based on pre-computed scores in tecture and are trained with 5 million monolin- WMT metrics package, whereas BLEU is based gual sentences for each language using a vocab- on our recalculation using SACREBLEU. Values ulary size of 60,000. We train SNMT models for marked with ×are not significant at α = 0.05, and EN↔DE and select models with the most similar hence corresponding rows are excluded from the (or a slightly lower) BLEU as their UNMT counter- calculation of mean, median, and standard devia- parts on newstest2019. The DE-EN model selected tion. is trained with 1 million sentences of parallel data Since MACROF1 is the only metric that does not and a vocabulary size of 64,000, and the EN-DE achieve statistical significance in the WMT 2019 model selected is trained with 250,000 sentences of EN-ZH setting, we carefully inspected it. Human parallel data and a vocabulary size of 48,000. For scores for this setting are obtained without look- EN↔FR and EN↔RO, we select SNMT models ing at the references by bilingual speakers (Ma from submitted systems to WMT shared tasks that et al., 2019), but the ZH references are found to have similar or slightly lower BLEU scores to corre- have a large number of bracketed EN phrases, es- sponding UNMT models, based on NewsTest2014 pecially proper nouns that are rare types. When for EN↔FR and NewsTest2016 for EN↔RO. the text inside these brackets is not generated by an Figure1, which is a visualization of MACROF1 MT system, MACROF1 naturally penalizes heavily for SNMT and UNMT models, shows that UNMT due to the poor recall. Since other metrics assign is generally better than SNMT on frequent types, lower importance to poor recall of such rare types, however, SNMT outperforms UNMT on the rest they achieve relatively better correlation to human leading to a crossover point in MACROF1 curves. scores than MACROF1. However, since the τ val- Since MACROF1 assigns relatively higher weights ues for EN-ZH are relatively lower than the other to infrequent types than in BLEU, SNMT gains language pairs, we conclude that poor correlation higher MACROF1 than UNMT while both have ap- of MACROF1 in EN-ZH is due to poor quality ref- proximately the same BLEU, as reported in Table6. erences. Some settings did not achieve statistical A complete comparison of UNMT vs SNMT in significance due to a smaller sample set as there different languages is in Table 12. A manual analy- were fewer MT systems submitted, e.g. 2017 CS- sis of the ten sentences with the largest magnitude EN. favoritism according to MACROF1 and BLEU in 1149 ⋆BLEU BLEU MACROF MICROF CHRF 1 1 1 SNMT vs UNMT: FREN 90 DE-EN .828 .845 .917 .883 .919 SNMT EN-DE .778 .750 .850 .783 .848 85 UNMT EN-ET .868 .868 .934 .906 .949 80 e l i EN-FI .901 .848 .901 .879 .945 t n

e 75 c EN-RU .889 .889 .944 .889 .930 r e P

70 EN-ZH .736 .729 .685 .833 .827 1 F o r

ET-EN .884 .900 .884 .878 .904 c 65 a

FI-EN .944 .944 .889 .915 .957 M 60 RU-EN .786 .786 .929 .857 .869 ZH-EN .824 .872 .738 .780 .820 55 EN-CS 1.000 1.000 .949 1.000 .949 50 the do too put Median .868 .868 .901 .879 .919 investigation SNMT vs TypeUNMT: ENFR Mean .858 .857 .875 .873 .902 90 SD .077 .080 .087 .062 .052 SNMT 85 TR-EN ×.200 ×.738 ×.400 ×.316 ×.632 UNMT × × × 80 e l

EN-TR .571 .400 .837 .571 .849 i t

× × × × × n

e 75

CS-EN .800 .800 .600 .800 .738 c r e P

Wins 1 2 3 2 6 70 1 F o r

c 65

Table 10: WMT18 Metrics task: Kendall’s τ between a metrics and human judgments. M 60 55

⋆BLEU BLEU MACROF1 MICROF1 CHRF1 50 DE-EN .564 .564 .734 .661 .744 de lors ça 1 filles SNMT vs TypeUNMT: DEEN EN-CS .758 .751 .767 .758 .878 90 EN-DE .714 .767 .562 .593 .720 SNMT 85 UNMT EN-FI .667 .697 .769 .718 .782 80 e l

EN-RU .556 .556 .778 .648 .669 i t n

e 75

EN-ZH .911 .911 .600 .854 .899 c r e P

LV-EN .905 .714 .905 .905 .905 70 1 F

RU-EN .778 .611 .611 .722 .800 o r

c 65 TR-EN .911 .778 .674 .733 .907 a ZH-EN .758 .780 .736 .824 .732 M 60 Median .758 .733 .735 .728 .791 55 Mean .752 .713 .714 .742 .804 50 SD .132 .110 .103 .097 .088 the those american attack virginia × FI-EN .867 .867 .733 .867 .867 SNMT vs TypeUNMT: ENDE × 90 EN-TR .857 .714 .571 .643 .849 SNMT CS-EN ×1.000 ×1.000 ×.667 ×.667 ×.913 85 UNMT 80

Wins 5 4 2 2 6 e l i t n

e 75 c

Table 11: WMT17 Metrics task: Kendall’s τ between r e P

70 metrics and human judgments. 1 F o r

c 65 a Translation SNMT UNMT SNMT Name M 60

DE-EN NewsTest2019 32.7 33.9 Our Transformer 55 Our Transformer EN-DE NewsTest2019 24.0 24.0 50 , FR-EN NewsTest2014 31.1 31.2 OnlineA.0 seiner obersten aufgrund britische SNMT vs TypeUNMT: ROEN EN-FR NewsTest2014 25.6 27.1 PROMT-Rule-based.3083 90 RO-EN NewsTest2016 30.8 29.6 Online-A.0 SNMT EN-RO NewsTest2016 31.2 31.0 uedin-pbmt.4362 85 UNMT 80 e l i

Table 12: SNMT systems are selected such that their t n

e 75 c LEU r B scores are approximately the same as the avail- e P

70 1

able pretrained UNMT models. F o r

c 65 a

M 60 the FR-EN and RO-EN test sets is in Table 13 and 55 Table 14. The complete texts of these sentences, 50 go the according state early their reference translations, and the system transla- Type tions (including DE-EN mentioned in Sec4), are Figure 1: SNMT vs UNMT MACROF1 on the most shown in Tables 15, 16, 17, 18, 19, and 20. frequent 500 types. UNMT outperforms SNMT on fre- quent types that are weighed heavily by BLEU how- ever, SNMT is generally better than UNMT on rare types; hence, SNMT has a higher MACROF1.

1150 δMACROF1 Fav Analysis δBLEU Fav Analysis .044 S S: synonym; U: untranslation, synonym -.026 U U: synonym; S: omitted adv, word order -.038 U U: no issues; S: synonym .025 S S: no issues; U: determiner, word order .035 S S: synonym; U: untranslation, synonym .024 S S: no issues; U: repetition, form -.034 U U: no issues; S: synonym; word_order .021 S S: verb, synonym; U: untranslation, noun, time, synonym -.034 U U: synonym; S: word order, verb_ref .021 S S: synonym; U: synonym -.033 U U: no issues; S: synonym -.021 U U: omitted NER; S: synonym, word order .033 S S: word order; U: untranslation, NER, word -.021 U U: untranslation; S: verb, word order order .032 S S: synonym; U: number, omitted noun, un- -.021 U U: synonym; S: extra preposition, synonym, translation, verb word order .030 S S: adj; U: untranslation .021 S S: no issues; U: NER .030 S S: noun, synonym; U: noun, synonym -.020 U U: synonym; S: synonym, word order Table 13: Analysis of the ten FR-EN test set segments with the most favoritism in SNMT (S) or UNMT (U), according to MACROF1 (left) and BLEU (right). Fav is the favored system by metrics. Actual examples are shown in Appendix Tables 17 and 18.

δMACROF1 Fav Analysis δBLEU Fav Analysis .131 S S: word order; U: repetition, word order .114 S S: word order; U: repetition, word order .063 S S: noun, word order; U: repetition, untrans- .089 S S: no issues; U: omitted noun, omitted time, lation, noun NER .062 S S: extra, untranslation; U: untranslation, -.072 U U: country, untranslation; S: noun, word copy order -.052 U U: untranslation x 3, synonym; S: untrans- -.045 U U: synonym; S: synonym, word order lation, synonym -.052 U U: untranslation, NER, synonym; S: NER, -.041 U U: untranslation; S: word order, subject synonym -.052 U U: extra; S: untranslation -.040 U U: no issues; S: number, omitted preposition -.050 U U: adv; S: incoherent, adv .039 S S: extra, untranslation; U: untranslation, copy -.050 U U: active/passive voice, name; S: name .036 S S: no issues; U: extra verb -.049 U U: untranslation; S: untranslation, word or- -.035 U U: repetition, untranslation, S: verb, syn- der onym, word order .048 S S: no issues; U: NER .034 S S: synonym; U: untranslation Table 14: Analysis of the ten RO-EN test set segments with the most favoritism in SNMT (S) or UNMT (U), according to MACROF1 (left) and BLEU (right). Fav is the favored system by metrics. Actual examples are shown in Appendix Tables 19 and 20.

1151 δMACROF1 Source Reference SNMT UNMT .071 Es wird davon ausgegan- It is understood they will It is assumed that they have It is understood they have a gen, dass sie über eine feature a powerful cannon, a powerful cannon, a series powerful canon, a number of leistungsstarke Kanone, eine an array of anti-aircraft and of anti-aircraft and anti-ship fluke and ship fire systems Reihe von Flugabwehr- und anti-ship missiles as well missiles, as well as some and some stealth-controlled Schiffsabwehrraketen sowie as some stealth technologies, steam technologies, such as technologies, such as re- einige Stealth-Technologien such as reduced radar, in- reduced radar, infrared and duced radar, infrarot and verfügen, wie z. B. re- frared and acoustic signa- acoustic signatures. akustic signposts. duzierte Radar-, Infrarot- und tures. akustische Signaturen. .064 Eine Gruppe maskierter Pro- A group of masked pro- A group of masked pro- A group of masked pro- Separatisten, die von der Bere- separatists held back by riot separatists held hostage by independence separforces, itschaftspolizei zurückgehal- police pelted them with eggs the riot police brought them who were kept away by ten wurden, bewarfen sie mit and hurled powder paint, to eggs and ignited pow- the Bereitschaftpolice, be- Eiern und schleuderte Pulver- creating dark clouds of dust der paint and produced dark headed them with evocative farbe und erzeugte in den in streets that would usually clouds in the streets that paint and poured pulver- Straßen, die normalerweise be thronged with tourists. were usually crowded by force paint and created von Touristen überfüllt waren, tourists. dark Staubations in the dunkle Staubwolken. streets normally clogged by tourists. -.055 Il faut bien le faire. Il faut bien le faire. La gentillesse du personnel Il faut bien faire. et la disponibil. .052 In einem Abschnitt gibt es In one section there is an In a section there is an im- In one section, there is a ein Bild eines Schlafsaals, in image of a dorm room, age of a bedroom where stu- picture of a sleeping sauna dem die Studenten auf Kaf- where students click on cof- dents click on coffee cups, where students click on cof- feetassen, Vorhänge, Trainer fee cups, curtains, trainers curtains, trainers, and books fee cups, forecourts, coaches und Bücher klicken, um über and books to be told about to be informed about the ef- and books to be educated die Auswirkungen von Kof- the effects of caffeine and fects of caffeine and light, about the impact of Koffein fein und Licht informiert zu light and how athletic perfor- and about how sporting per- and light and about how ath- werden und darüber, wie sich mance is impacted by sleep formance is affected by lack letic performance is influ- die sportliche Leistung durch deficiency, and the impor- of sleep and the importance enced by sleep loss and the Schlafmangel und die Bedeu- tance of a bedtime routine. of sleeping time routine. importance of a sleep day tung einer Schlafenszeitrou- routine. tine beeinflusst. -.045 Nickelbergbau ist auch in Nickel mining is also impor- Nickelergbau is also impor- Nickel mining is also im- der Provinz wichtig, wird tant in the province, but is tant in the province, but is portant in the province, but aber hauptsächlich in Mo- mostly concentrated in Mo- mainly operated in More- is mostly operated in Mo- rowali betrieben, an der rowali, on the opposite coast wali, on the opposite coast rowali, on the opposite coast gegenüberliegenden Küste of Sulawesi. of Sulawesi. of Sulawesi. von Sulawesi. .044 Vor 32 Jahren schloss ich Ever since I joined Labour 32 years ago, I joined Last 32 years ago, as a stu- mich als Schüler, wegen 32 years ago as a school Labour as a student be- dent, because of the disdain der Vernachlässigung der pupil, provoked by the cause of the neglect of the for the Thatcher-era govern- Thatcher-Regierung, Labour Thatcher government’s Thatcher government, which ment, Labour joined Labour. an. Diese Vernachlässigung neglect that had left my had led to my classroom lit- hatte dazu geführt, dass comprehensive school class- erally collapsed, and as a re- mein Klassenzimmer buch- room literally falling down, sult I tried to promote bet- stäblich zusammengebrochen I’ve sought to champion ter public services for those war. Infolgedessen habe ich better public services for who need it most, whether as versucht, mich für bessere those who need them most - a local council or ministers. öffentliche Dienstleistungen whether as a local councillor für diejenigen einzusetzen, or government minister. die sie am meisten brauchen. Egal ob als Gemeinderat oder Minister. .044 UN-Gesandter Staffan de UN envoy Staffan de Mis- UN envoy Staffan de Mis- U.N. Secretary General Mistura hofft, bald die er- tura is hoping to soon con- tura hopes to convene soon Staffan de Mistura hopes sten Treffen eines neuen vene the first meetings of the first meetings of a new to soon join the first meet- Ausschusses aus Regierungs- a new committee comprised committee of government ings of a new committee und Oppositionsmitgliedern of government and opposi- and opposition members to of government and oppo- einzuberufen, um eine tion members to draft a post- draw up a post-war constitu- sition leaders to design Nachkriegsverfassung für war constitution for Syria tion for Syria and pave the a Nachkriegsrewrite for Syrien zu entwerfen und den and pave the way to elec- way for elections. Syria and clear the path to Weg zu Wahlen zu ebnen. tions. elections. .043 CBS hatte 3,1 Millio- CBS had 3.1 million, NBC CBS had 3.1 million, NBC CBS had 3.8 million, NBC nen, NBC 2,94 Millionen, had 2.94 million, MSNBC 2.94 million, MSNBC 2.89 3.94 million, MSNBC 3.89 MSNBC 2,89 Millionen und had 2.89 million and CNN million and CNN 2.52 mil- million and CNN 3.52 mil- CNN 2,52 Millionen, so had 2.52 million, Nielsen lion, says Nielsen. lion, Nielsen said. Nielsen. said. -.041 Den Rangers gelangen nur Rangers managed just two The Ranners only reach two The Rangers managed only zwei Schüsse in der ersten first-half shots on target shots in the first half, but two shots in the first half Hälfte, aber der ehemalige but former Ibrox goalkeeper the former Ibrox-Torkeeper but former Ibrox goalkeeper Ibrox-Torhüter Liam Kelly Liam Kelly was barely trou- Liam Kelly was hardly the Liam Kelly was unlikely war kaum von Lassana bled by Lassana Coulibaly’s head of Lassanna Coulibys to be helped by Lassana Coulibalys Kopfsprung und header and a tame Ovie and the hit of a bissloze Ovi Coulibaly’s headfirst tackle dem Treffer eines bisslosen Ejaria strike. Ejaria. and the goal of a bisected Ovie Ejaria aus der Ruhe zu Ovie Ejaria. bringen. .041 Liverpool tritt am MIttwoch Liverpool battles Napoli in Liverpool will take place Liverpool v Napoli at the um 15.00 Uhr im Stadio the group stage of the Cham- at 3 p.m. at the Stadio MItch Stadium at 15.00 pm San Paolo in Neapel, Italien, pions League at 3 p.m. on San Paolo in Naples, Italy, in Neapel, Italy, on MItch. gegen Napoli an. Wednesday at Stadio San against Napoli. Paolo in Naples, Italy.

Table 15: Top 10 segments by SδMACROF1 (i,hS,hU )S on DE-EN.

1152 δBLEU Source Reference SNMT UNMT .048 In der letzten Woche wurden mittlere Medium concentrations in or Last week, average concen- In the last week, moder- Konzentrationen in Küstennähe und offshore of Pinellas County trations were reported on the ate to high Konzentrof lead auf offener See in Pinellas County have been reported in the coast and open seas in Pinel- in Küstas County were re- gemeldet, geringe bis hohe Konzen- past week, low to high las County, low to high lev- ported in Pinellas County, trationen auf offener See in Hillsbor- concentrations offshore of els at open sea in Hillsbor- low to high Konzentrof lead ough County, Hintergrund- bis hohe Hillsborough County, back- ough County, background to levels on open water in Konzentrationen in Manatee County, ground to high concentra- high concentrations in Man- Hillsborough County, Hinter- Hintergrund- bis hohe Konzentratio- tions in Manatee County, tee County, high concentra- grundto high levels in Man- nen in Küstennähe und auf offener background to high concen- tions in coastal and open atee County, Hintergrundto See in Sarasota County, Hintergrund- trations in or offshore of seas in Sarsota County, back- high to high Konzentrin Küs- bis mittlere Konzentrationen in Char- Sarasota County, background ground to medium concen- tas and on open water in Sara- lotte County, Hintergrund- bis hohe to medium concentrations trations in Charlotte County, sota County and low Konzen- Konzentrationen in Küstennähe und in Charlotte County, back- background to high shore and trationen in Charlotte County, auf hoher See in Lee County sowie ground to high concentra- high sea levels in Lee County, Hintergrundto high to high geringe Konzentrationen in Collier tions in or offshore of Lee and low concentrations in Konzentrin Küstennähe and County. County, and low concentra- Collier County. on open water in Sarasota tions in Collier County. County. .046 Moskau hat wiederholt betont, dass Moscow has repeatedly Moscow has repeatedly Moscow has repeatedly die 11-Milliarden-Dollar-Pipeline stressed that the $11 billion stressed that the $11 billion insisted that the 11-billion Nord Stream 2, die die bestehende Nord Stream 2 pipeline, Nord Stream 2 pipeline, pipeline, Nord Stream 2, Pipeline-Kapazität auf 110 Milliar- which is set to double the which is supposed to double which will double the exist- den Kubikmeter verdoppeln soll, ein existing pipeline capacity to the existing pipeline capacity ing Pipeline-capacity to 110 rein wirtschaftliches Projekt ist. 110 billion cubic meters, is a to 110 billion cubic metres, billion cubic feet, is a purely purely economic project. is a purely economic project. commercial project. .044 Der NTS, der für die Betreuung von The NTS, which is respon- The NTS, which is responsi- The NTS, responsible for mehr als 270 historischen Gebäu- sible for the care of more ble for the care of more than managing more than 270 his- den, 38 wichtigen Gärten und 76.000 than 270 historical build- 270 historic buildings, 38 im- toric buildings, 38 key gar- Hektar Land rund um das Land ve- ings, 38 important gardens portant gardens and 76,000 dens and 74,000 acres of land rantwortlich ist, nimmt die Fleder- and 76,000 hectares of land hectares of land around the around the country, said the mäuse sehr ernst. around the country, takes bats country, takes the bats very Fledermäuse are very impor- very seriously. seriously. tant. .042 George W. Bush telefonierte mit Sen- George W. Bush has been George W. Bush contacted George W. Bush spoke to sen- atoren, um diese zu überreden, Herrn picking up the phone to call senators to persuade them to ators to help him overture Kavanaugh zu unterstützen, der im Senators, lobbying them to support Mr Kavanaugh, who to support Mr. Kavanaugh, Weißen Haus für Herrn Bush gear- support Mr Kavanaugh, who worked in the White House who had worked in the White beitet hatte und durch ihn seine worked in the White House for Mr Bush and met his wife House for Mr. Bush and met Frau Ashley traf, die die persönliche for Mr Bush and through him Ashley, who was Mr Bush’s through him his wife, Ashley, Sekretärin von Herrn Bush war. met his wife Ashley, who personal secretary. who was the personal secre- was Mr Bush’s personal sec- tary to Mr. Bush. retary. -.039 Eine Woche nachdem eine of- A week after an official Chi- A week after an official Chi- A week after an official Chi- fizielle chinesische Zeitung eine nese newspaper ran a four- nese newspaper published a nese newspaper published a vierseitige Anzeige in einer US- page ad in a U.S. daily tout- four-page display in a US four-page ad on the mutual amerikanischen Tageszeitung auf ing the mutual benefits of daily on the mutual benefit of benefit of the US-China trade, den gegenseitigen Nutzen des US- U.S.-China trade, the U.S. US China trade, the US am- the U.S. ambassador to China China-Handels gestellt hatte, warf ambassador to China accused bassador to China published accused Beijing of using the der US-amerikanische Botschafter Beijing of using the Amer- in Beijing to use the Ameri- American press to spread pro- in China Peking vor, die amerikanis- ican press to spread propa- can press for propaganda. paganda. che Presse zur Verbreitung von ganda. Propaganda zu verwenden. -.037 Sie kümmern sich nicht darum, wen They don’t care who they They do not care about who They don’t care who they sie verletzen, wen sie überfahren hurt, who they have to run they hurt whom they must hurt, who they have to pass to müssen, um Macht und Kontrolle zu over in order to get power pass over to gain power and get power and control, that’s bekommen, das ist, was sie wollen, and control, that’s what they control, that is what they what they want, power and Macht und Kontrolle, wir werden es want is power and control, want, power and control, we control, we won’t give it to ihnen nicht geben. we’re not going to give it to will not give them. them. them." -.034 Mayorga behauptet, Ronaldo sei Mayorga claims Ronaldo fell Mayorga claims that Ronaldo Mayorga claims Ronaldo fell nach dem angeblichen Vorfall auf to his knees after the alleged fell to the knees after the al- on his knee after the alleged die Knie gefallen und habe ihr incident and told her he was leged incident, saying that he incident and told her he was gesagt, er sei „zu 99 Prozent“ ein "99 percent" a "good guy" let was “99% ” a“ good guy ” "to 99 percent" a "good guy" „guter Kerl“, der von den „ein down by the "one percent." left in the lurch by the “one who was left in the dark by Prozent“ im Stich gelassen wurde. percent ”. the "one percent." -.032 Palin, 29, aus Wasilla, Alaska, Palin, 29, of Wasilla, Alaska, Palin, 29, from Wasilla, Palin, 29, of Wasilla, Alaska, wurde wegen des Verdachts auf häus- was arrested on suspicion of Alaska, was arrested for was arrested on charges of do- liche Gewalt verhaftet. Gegen ihn domestic violence, interfer- alleged domestic violence, mestic violence. – Against liegt bereits ein Bericht über häus- ing with a report of domes- and a report on domestic him, a report of domestic vi- liche Gewalt und Widerstand bei der tic violence and resisting ar- violence and opposition olence and resistance in ar- Festnahme vor, so eine Meldung, die rest, according to a report to arrest has already been rest was already released Sat- am Samstag von den Alaska State released Saturday by Alaska published on Saturday by the urday, according to a report Troopers veröffentlicht wurde. State Troopers. Alaska State Trooperator. released Saturday by Alaska State Troopers. -.032 "Ich habe [...] nicht versteckt "I did not hide Dr. Ford’s "I have [...] not hidden Ford’s "I did not hide [Forman’s Fords Behauptungen, ich habe ihre allegations, I did not leak claims that I have not lived claims, I didn’t geleast Geschichte nicht geleakt“, erzählte her story," Feinstein told their history," told Finestein her story," Feinstein told Feinstein dem Komitee, berichtete the committee, The Hill re- the committee, reported The the committee, The Hill The Hill. ported. Hill. reported. -.031 Briefings werden immer noch stat- Briefings will still happen, briefing is still going to take Briefings will still take place, tfinden, sagte Sanders, aber sollte Sanders said, but "if the press place, Sanders said, but the Sanders said, but if the press die Presse die Chance haben, dem has the chance to ask the press should have the oppor- has the chance to ask the pres- Präsidenten der Vereinigten Staaten president of the United States tunity to put the questions di- ident of the United States di- die Fragen direkt zu stellen, so sei questions directly, that’s in- rectly to the President of the rectly, so that is unendlich das unendlich besser, als mit ihr zu finitely better than talking to United States, if that is in- better than talking to her. sprechen. me. finitely better than to talk to her.

Table 16: Top 10 segments by SδBLEU(i,hS,hU )S on DE-EN. 1153 δMACROF1 Source Reference SNMT UNMT .044 Il ne fallait qu’en déployer les ac- All it took was to highlight its It should deploy the accidents, It only took to deploy the ac- cidents, et l’affaire, jacobinisme mistakes and, in keeping with and the case, Jacobinism cidents, and the matter, jacobi- oblige, était confiée aux préfets et Jacobinism, the issue would obliges, was entrusted to the nite oblige, was handed to the aux sous-préfets, interprètes au- be entrusted to prefects and prefects and the sub-prefects, préfets and the sous-préfets, torisés. sub-prefects - the authorised authorized interpreters. authorized interprètes. interpreters. -.038 Les spécialistes disent que les Experts say confessions are The experts say that people Experts say people are rou- personnes sont systématique- still routinely coerced, despite are systematically forced to tinely forced to make their ment contraintes à faire leurs a change in the law earlier make their confessions, de- confessions, despite a change aveux, malgré un changement this year banning the authori- spite a change in the law in the law that was passed ear- dans la loi qui a été voté plus ties from forcing anyone to in- which was passed earlier this lier in the year banning offi- tôt dans l’année interdisant aux criminate themselves. year prohibiting the authori- cials from forcing anyone to autorités de forcer quiconque à ties to force anyone to incrim- incriminate themselves. s’incriminer lui-même. inating himself. .035 Ils sont intersexués, They are intersex, part of a They are intersex, intersex They are intersexuzed, with l’intersexualité faisant partie group of about 60 conditions forming part of the Group of intersexuality making up the du groupe de la soixantaine de that fall under the diagno- 60 diseases diagnosed as dis- group of the soixantaine of maladies diagnostiquées comme sis of disorders of sexual de- orders of sexual development, diseases diagnosed as disor- désordres du développement velopment, an umbrella term a generic term for people with dered sexual development, a sexuel, un terme générique for those with atypical chro- chromosomes or atypical go- generic term dissignant peo- désignant les personnes possé- mosomes, gonads (ovaries or nads (ovaries or testes) or ab- ple possessing chromosomes dant des chromosomes ou des testes), or unusually deve- normally developed sexual or- or gonades (ovaries or testic- gonades (ovaires ou testicules) loped genitalia. gans. ules) atypiques or anormally atypiques ou des organes sexuels developed sexual organs. anormalement développés. -.034 Ces violences sont de plus The violence is becoming Such violence are more lethal Those violence is increasingly en plus meurtrières en dépit more and more deadly in spite despite measures enhanced se- deadly in the face of increased de mesures de sécurité renfor- of reinforced security mea- curity and large-scale military security measures and major cées et d’opérations militaires sures and large-scale military operations launched by the military operations launched d’envergure lancées depuis des operations undertaken in re- Government of Nouri Al Ma- for months by Nouri Al Ma- mois par le gouvernement de cent months by Nouri Al Ma- liki, the Shia-dominated for liki’s government, dominated Nouri Al Maliki, dominé par les liki’s government, which is months. by Shiites. chiites. dominated by Shiites. -.034 Du côté du gouvernement, on On the government side, it On the side of the Govern- On the government side, one estime que 29 954 membres said 29,954 are members of ment, it is estimated that 29 estimate says 29,954 mem- des forces armées du prési- President Bashar Assad’s 954 members of the armed bers of President Bachar al- dent Bachar el-Assad ont trouvé armed forces, 18,678 are forces of president Bachar Al- Assad’s armed forces have la mort, dont 18 678 étaient pro-government fighters and Assad died, whose 18 678 found their way, including des combattants des forces pro- 187 are Lebanese Hezbollah were 187 Lebanese Hezbollah 18,678 were fighters from pro- gouvernementales et 187 des mil- militants. militants and fighters of the government forces and 187 itants du Hezbollah libanais. pro-Government forces. from Lebanese Hezbollah mil- itants. -.033 Mercredi, le Centre américain de On Wednesday, the Centers Wednesday, the US Centre of Wednesday, the U.S. Centers contrôle et de prévention des mal- for Disease Control and Pre- disease prevention and control for Disease Control and Pre- adies a publié une série de di- vention released a set of guide- issued a set of guidelines in- vention issued a series of di- rectives indiquant comment gérer lines to manage children’s dicating how to manage food rectives indicating how to han- les allergies alimentaires des en- food allergies at school. allergies of children at the dle children’s food allergies at fants à l’école. school. school. .033 N’est-il pas surprenant de lire And is it not surprising to read Is it not surprising to read Isn’t it surprising to read in the dans les colonnes du Monde à in the pages of Le Monde, in the columns of the world Times’ pages just weeks apart quelques semaines d’intervalle on the one hand, a reproduc- a few weeks apart on the of one side’s reproduction of d’une part la reproduction de tion of diplomatic correspon- one hand the reproduction the American diplomatic cor- la correspondance diplomatique dence with the US and, on of American diplomatic cor- respondance and of another a américaine et d’autre part une the other, condemnation of the respondence and on the other condamnation of the Quai d’ condamnation des écoutes du NSA’s spying on the Ministry hand a condemnation of the Orsay’s écoutes by the NSA? Quai d’Orsay par la NSA ? of Foreign Affairs on the Quai Quai d’Orsay by the NSA lis- d’Orsay, within a matter of tens? weeks? .032 Les ministères appellent à The ministries are currently Departments now call peo- The at present call on people présent les personnes qui au- asking anyone who might ple who have been bitten, who may have been morbid, raient été mordues, griffées, have been bitten, clawed, scratched, scratched or licked griffon, egregious or layed on égratignées, ou léchées sur une scratched or licked on a mu- on mucous membranes or skin a mug or on a skin léché muqueuse ou sur une peau lésée cous membrane or on dam- injured by this kitten or where by this chateau or whose an- par ce chaton ou dont l’animal aged skin by the kitten, or who the animal would have been imal may have been in con- aurait été en contact avec ce own an animal that may have in contact with this kitten be- tact with that chateau between chaton entre le 8 et le 28 octobre been in contact with the kit- tween 8 and 28 October to 8 and 28 October to contact à contacter le 08.11.00.06.95 ten between 08 to 28 October, contact the 08.11.00.06.95 be- 08.11.00.095 between 10 and entre 10 heures et 18 heures à to contact them on 08 11 00 tween 10 a.m. and 6 p.m. 18 November. partir du 1er novembre. 06 95 between 10am and 6pm from November 1. from 01 November. .030 A cette IIIe République, moment Pierre Nora has shown has This third Republic, while At this IIIe République, central et créateur, Pierre Nora shown great interest and even central and creator, Pierre central and creator moment, a montré beaucoup d’intérêt et tenderness for this Third Re- Nora has shown great interest Pierre Nora showed much même de tendresse: saluant ceux public: he salutes those who and even tenderness: saluting interest and even tendresse: qui se sont alors employés à ré- tried at the time to repair the those who then worked to re- praising those who then parer la fracture révolutionnaire, divide created by the Rev- pair the revolutionary divide, helped to repair the revolu- en enseignant aux écoliers tout olution by teaching students by teaching students what in tionary fracture, teaching ce qui dans l’ancienne France about everything in the for- the former France preparing schoolchildren everything in préparait obscurément la France mer France that obscurely darkly modern France and of- the former French Republic moderne et en leur proposant une paved the way for the modern fering them a version unified that obscurantly prepared version unifiée de leur histoire. France, and by offering them in their history. modern France and offering a unified version of their his- them a unifying version of tory. their history. .030 La théorie dominante sur la façon The prevailing theory on how The prevailing theory about The dominant theory about de traiter les enfants pourvus to treat children with ambigu- how to treat children with how to treat children armed d’organes sexuels ambigus a été ous genitalia was put forward ambiguous sexual organs was with ambigüous sex organs lancée par le Dr John Money, de by Dr. John Money at Johns launched by Dr. John Money was launched by Dr. John l’université Johns-Hopkins, qui Hopkins University, who held of the Johns Hopkins Univer- Money, of Johns-Hopkins considérait que le genre est mal- that gender was malleable. sity, who considered that the University, who considered léable. genre is malleable. the genre maudlin.

Table 17: Top 10 segments by SδMACROF1 (i,hS,hU )S on FR-EN. 1154 δBLEU Source Reference SNMT UNMT -.026 Mais le représentant Bill Shuster But Rep. Bill Shuster (R-Pa.), But Congressman Bill Shus- But Rep. Bill Shuster (R-Pa.), (R-Pa.), président du Comité chairman of the House Trans- ter (R - PA.), Chairman chairman of the House Trans- des transports de la Chambre portation Committee, has said of the House of representa- portation Committee, said he des représentants, a déclaré he, too, sees it as the most vi- tives Transportation Commit- also considered it the most vi- qu’il le considérait aussi comme able long-term alternative. tee, said that he considered as able long-term alternative. l’alternative la plus viable à long the most viable alternative in terme. the long term. .025 Les neuf premiers épisodes de The first nine episodes of Sher- The first nine episodes of Sher- Sheriff first nine episodes of Sheriff Callie’s Wild West seront iff Callie’s Wild West will iff Callie’s Wild West will Sheriff Callie’ s Wild West disponibles à partir du 24 novem- be available from November be available from November will be available as of Novem- bre sur le site watchdisneyju- 24 on the site watchdisneyju- 24 on the site watchdisneyju- ber 24 on the watchdisneyju- nior.com ou via son application nior.com or via its application nior.com or via its application nior.com website or via its ap- pour téléphones et tablettes. for mobile phones and tablets. for phones and tablets. plication for phones and com- puters. .024 Le président Xi Jinping, qui a President Xi Jinping, who President Xi Jinping, who President Xi Jinping, who pris ses fonctions en mars dernier, took office last March, has took office in March, has took office in March, has a fait de la lutte contre la cor- made the fight against corrup- made the fight against corrup- made fighting corruption a na- ruption une priorité nationale, es- tion a national priority, believ- tion a national priority, be- tional priority, saying the phe- timant que le phénomène consti- ing that the phenomenon is a lieving that the phenomenon nomenon posed a threat to the tuait une menace à l’existence- threat to the very existence of posed a threat to the very exis- Communist Party’s existence- même du Parti communiste. the Communist Party. tence of the Communist Party. free existence. .021 Un peu plus tôt, sur la route A little earlier, on the road Earlier, on the road leading A day earlier, on the road menant à Bunagana, poste- to Bunagana, the frontier post to Bunagana border post with leading to Bunagana, a post- frontière avec l’Ouganda, with Uganda, soldiers assisted Uganda, soldiers helped civil- code with Uganda, military des militaires aidés de civils by civilians loaded up a mul- ians loaded a multiple rocket personnel aided by civilians chargeaient un lance-roquettes tiple rocket launcher mounted launcher mounted on a truck were loading a multiple rocket multiple monté sur un camion on a brand new truck belong- brand new FARDC, to ensure lance-roquettes fire on a flam- flambant neuf des FARDC, de- ing to the FARDC, intended to succession of another engine bant neuf FARDC truck, ex- vant assurer la relève d’un autre take over from another device pounding the positions of the pected to provide the lead for engin pilonnant les positions du pounding the positions of the M23 in the hills. another device pilfering M23 M23 sur les collines. M23 in the hills. positions on the hills. .021 Il y a, avec la crémation, "une vi- With cremation, there is a There, with the cremation, "a There is, with cremation, "vio- olence faite au corps aimé", qui sense of "violence committed violence made to the beloved lence done to the loved one," va être "réduit à un tas de cen- against the body of a loved body", which will be "reduced which is going to be "reduced dres" en très peu de temps, et one", which will be "reduced to a pile of ashes" in a very to a tas of ashes" in very lit- non après un processus de décom- to a pile of ashes" in a very short time, and not after a pro- tle time, and not after a pro- position, qui "accompagnerait les short time instead of after a cess of decomposition, which cess of disablement, which phases du deuil". process of decomposition that "would accompany the phases would "accompany the phases "would accompany the stages of mourning". of grief." of grief". .021 Scott Brown, le capitaine du Scott Brown, Glasgow Celtic Scott Brown, the captain of Scott Brown, the Celtic cap- Celtic Glasgow, a vu son appel captain, has had his appeal re- the Glasgow Celtic, saw his tain, has had his appeal re- rejeté et sera bien suspendu pour jected and will miss his club’s appeal dismissed and will be jected and will be well sus- les deux prochains matches de next two Champion’s League well suspended for the next pended for his club’s next two Ligue des champions de son club, matches, against Ajax and AC two matches of the champions Champions League matches, contre l’Ajax et l’AC Milan. Milan. League for his club against against Ajax and AC Milan. Ajax and AC Milan. -.021 Les irréductibles du M23, soit The diehards of the M23, The irreducible m23, or a few The irréductibles M23, or quelques centaines de combat- who are several hundreds in hundred fighters, were cut off some hundred fighters, were tants, étaient retranchés à près number, had entrenched them- to nearly 2000 metres above retranchés at nearly 2000 feet de 2000 mètres d’altitude sur selves at an altitude of almost sea level on the hills agricul- of altitude on the agricultural les collines agricoles de Chanzu, 2,000 metres in the farmland tural Chanzu, Runyonyi and hills of Chanzu, Runyonyi Runyonyi et Mbuzi, proches de hills of Chanzu, Runyonyi Mbuzi, near Bunagana and and Mbuzi, close to Buna- Bunagana et Jomba, deux local- and Mbuzi, close to Buna- Jomba, located about 80 km gana and Jomba, two towns ités situées à environ 80 km au gana and Jomba, two towns north of Goma, the capital of located about 80 miles north nord de Goma, la capitale de la located around 80km north of the province of North Kivu. of Goma, the capital of North province du Nord-Kivu. Goma, the capital of North Kivu province. Kivu province. -.021 Il a indiqué que le nouveau tri- He said the new media tri- He said as the new media tri- He said the new media tri- bunal des médias « sera toujours bunal "will always be bi- bunal ’ will be always par- bunal "will always be partial partial car il s’agit d’un prolonge- ased because it’s an exten- tial because it is an exten- because it is a extension of ment du gouvernement » et que sion of the government," and sion of the Government "and the government" and that re- les restrictions relatives au con- that restrictions on content content and advertising restric- strictions relating to content tenu et à la publicité nuiraient à la and advertising would damage tions hurt instead of the Kenya and advertising would hurt place du Kenya dans l’économie Kenya’s place in the global into the world economy. Kenya’s place in the global mondiale. economy. economy. .021 Dans "Les Fous de Benghazi", In "Les Fous de Benghazi", he In "Les Fous de Benghazi", he In "The Facts of Libya," he il avait été le premier à révéler was the first to reveal the ex- was the first to reveal the ex- had been the first to reveal the l’existence d’un centre de com- istence of a secret CIA com- istence of a secret CIA com- existence of a secret CIA com- mandement secret de la CIA dans mand centre in the city, the mand center in this town, cra- mand center in that city, the cette ville, berceau de la révolte cradle of the Libyan revolt. dle of the Libyan revolt. birthplace of the Libyan upris- libyenne. ing. -.020 Le Sénat américain a approuvé The U.S. Senate approved a The US Senate has approved The U.S. Senate approved a un projet pilote de 90 M$ l’année $90-million pilot project last a pilot project of 90 M$ last $90 million pilot project last dernière qui aurait porté sur envi- year that would have involved year which would have cov- year that would have focused ron 10 000 voitures. about 10,000 cars. ered about 10,000 cars. on about 10,000 cars.

Table 18: Top 10 segments by SδBLEU(i,hS,hU )S on FR-EN.

1155 δMACROF1 Source Reference SNMT UNMT .131 David Grimal are o cariera interna- David Grimal has an international David Grimal has an international David Grimal has an international tionala de violonist solo, de-a lun- career as a solo violinist. Dur- career as a solo violinist, along solo violinist, Peter Csaba, Heinrich gul careia a sustinut regulat con- ing the last 20 years he held reg- which claimed concerts regularly Schiff, , Emmanuel certe în ultimii 20 de ani pe princi- ular concerts on the main clas- in the past 20 years in key scenes Krivine, , Mikhail palele scene de muzica clasica ale sical music stages of the world of classical music of the world and Fruyhbeck Orchestra of Burgos and lumii s, i cu orchestre prestigioase with prestigious orchestras such as with prestigious orchestras such Peter Eotvos, Andris Philharmonique cum ar fi , the Orchestre de Paris, Orchestre as the Orchestre de Paris, Or- de Radio France, Russian National Orchestre Philharmonique de Ra- Philharmonique de Radio France, chestre Philharmonique de Radio Orchestra, the French National Or- dio France, Russian National Or- Russian National Orchestra, Or- France, Russian National Orches- chestra of Lyon, the Berliner Sym- chestra, Orchestre National de chestre National de Lyon, Cham- tra, Orchestre National de Lyon, phoniker, the New Japan Philharmonic, Lyon, Chamber Orchestra of Eu- ber Orchestra of Europe, Berliner the Chamber Orchestra of Eu- the Orchestre de l’Opera of Lyon, the rope, Berliner Symphoniker, New Symphoniker, New Japan Phil- rope, the Berliner Symphoniker, Mozarteum Orchestra of Lyon, the Japan Philharmonic, Orchestre de harmonic, Orchestre de l’Opera the New Japan Philharmonic, Or- Jerusalem Symphony Orchestra and l’Opera de Lyon, Mozarteum Or- de Lyon, Salzburg Mozarteum chestre de l’Opera de Lyon The Sinfonia of Paris, under the direction chestra Salzburg, Jerusalem Sym- Orchestra, Jerusalem Symphony Salzburg Mozarteum Orchestra, of conductor like Orchestra: the Na- phony Orchestra s, i Sinfonia Varso- Orchestra and Sinfonia Varsovia the Jerusalem Symphony Orches- tional Orchestra, the National Orches- via, sub conducerea unor diri- under the direction of conduc- tra, and Sinfonia Varsovia, un- tra, the French Chamber Orchestra, the jori precum Christoph Eschen- tors such as Christoph Eschen- der the direction of conductors Berliner Symphoniology Orchestra and bach, , Michael bach, Michel Plasson, Michael such as , Sinfonia Sinfonia and Sinfonia Sin- Schnwandt, Peter Csaba, Hein- Schnwandt, Peter Csaba, Hein- Michel Plasson, Michael Schn- fonia Sinfonia, the Orchestra of the rich Schiff, Lawrence Foster, Em- rich Schiff, Lawrence Foster, Em- wandt, Peter Csaba, Heinrich Paris Symphony Orchestra, the Russian manuel Krivine, Mikhail Pletnev, manuel Krivine, Mikhail Pletnev, Schiff, Lawrence Foster, Em- National Orchestra, the National Or- Rafael Fruhbeck de Burgos s, i Pe- Rafael Fruhbeck de Burgos and manuel Krivine, Mikhail Pletnev, chestra, the National Opera of Europe, ter Eotvos, Andris Nelsons, Chris- Peter Eotvos, Andris Nelsons, Rafael Fruhbeck de Burgos, and the Berliner Symphoniker, the Berliner tian Arming. Christian Arming. Peter Eotvos, Andris Nelsons, Symphoniker Orchestra, the Orchestra Christian Arming. of Lyon, the Mozarteum and Sinfonia Sinfonia Sinfonia. .063 O parte importanta˘ a magazinului A good portion of the store An important part of the store A significant part of the store is devoted este dedicata˘ raionului urias, is given over to the massive is devoted to giant rayon of to the huge frozen fried chicken county, de congelatoare, unde vet,i gasi˘ freezer section, where you’ll find freezers, where you’ll find curry where you will find curry leaves, peach frunze de curry, pepene s, i ghimbir frozen curry leaves, bitter melon leaves, melon and Ginger frozen and frozen ghrelish, whole fish, fish, congelate, rat,e întregi, pes, te, and galangal, whole ducks, fish, whole fish, ducks, beef bile and blood and bone stock, pork carcass, tra- sânge s, i bila˘ de vita,˘ carcase de beef blood and bile, pork cas- blood, carcasses of pork, fish balls, ditional fish carnouts, semi-cooked car- porc, chiftele de pes, te, cârnat,i ings, fish balls, regional sausages, sausages, traditional dishes and nacies and many others. tradit,ionali, semi-preparate s, i commercially-prepared foods and more. multe altele. more. .062 Persoanele interesate vor putea sa˘ Those interested will find out how Persons interested will be able to People interested will be able to learn afle cum sa˘ realizeze creat,ii sculp- to make harmonious sculptural flo- learn how to achieve harmonious how to complete armonious creative turale florale armonioase (ike- ral creations (ikebana), how to sculptural floral design (ikebana), contrasting (grafic, drawing, painting) bana), cum sa˘ surprinda˘ sufle- capture the soul of the surround- how to capture the soul of the sur- Creations, how to capture the hearts of tul elementelor înconjuratoare˘ în ing elements in plastic compo- rounding elements in plastic com- surrounding elements in artistic com- compozit,ii plastice folosind arta sitions using the Japanese ink positions using the Japanese art positions using the Japanese art of japoneza˘ a pictarii˘ în tus, (Sumie) painting art (Sumie) or how to of painting in India ink (Sumie) painting in tus (Sumie) or how to ex- sau cum sa˘ îs, i exprime propria express their own individuality or how to express their own in- press their own individualistic iden- individualitate s, i creativitate prin and creativity through the mul- dividuality and creativity through tity through the plurivalent universe of universul plurivalent al artelor fru- tifaceted universe of Fine Arts the universe plurivalent of fine beautiful arts (grafic, drawing, painting, moase (grafic, desen, pictura)˘ , a (graphic, drawing, painting) said arts (painting, drawing, graphics), painting), said Mr. Mazilu, the profes- declarat Sorin Mazilu, profesorul Sorin Mazilu, the teacher of these said Sorin Mazilu, teacher of such sor of these courses. acestor cursuri. courses. courses. -.052 Umorul îmbrat˘ ,is, arilor˘ frecvente The humour of Joey and Chan- His humor frequently asked Ummaline frequent imbratisations of ale lui Joey s, i Chandler, mo- dler’s frequent hugs, moments îmbrat˘ ,is, arilor˘ of Joey and Chan- Joey and Chandler, the moments when mentele în care se uita˘ la fotbal watching football on the comfy dler, the times when looking at he looks at football in his comfy fo- în fotoliile comode s, i pasiunea lui chairs, and Ross’s pining for football in comfortable armchairs toland Ross’s passion for Rachel came Ross pentru Rachel au venit din Rachel, came from the knowledge and Ross’s passion for Rachel from knowing that men can assimilate faptul ca˘ se s, tie ca˘ barbat˘ ,ii se pot that yes, men can all relate to this, came from the fact that it is known with the situation, even though often asocia cu aceasta˘ situat,ie, chiar even if they often hold back from that men may be associated with they are shy about fully exploring their daca˘ adesea se feresc sa˘ îs, i ex- fully exploring their feelings. this case, even if often keep out to feelings. ploreze complet sentimentele. explore fully the sentiments. -.052 În plus, zilele urmatoare˘ vom face In addition, in the coming days In addition, the coming days we In addition, next week we will make recept,ia lucrarilor˘ de la pasarela we will accept the works from will make the reception of the the receptia work on the pasarela link- care leaga˘ UPU de clinicile de the bridge linking the ER to the works from the footbridge link- ing the UPU to the cardiology, hepatol- cardiologie, hepatologie s, i gas- cardiology, hepatology and gas- ing the UPU of Cardiology clin- ogy and gastroenterology clinicians to troenterologie pentru a asigura troenterology clinics to ensure the ics, Hepatology and gastroenterol- ensure the transport in decent manner transportul în condit,ii decente a proper transfer of patients from ogy to ensure decent conditions of of the patient from the UPU to neigh- bolnavului din UPU în clinicile the ER to the neighbouring clinics transportation of the patient from bouring clinicians, the hospital’s chief vecine, a mai explicat managerul explained the manager of the med- the UPU in neighboring clinics, executive explained. unitat˘ ,ii medicale. ical unit. explained Manager medical unit. -.052 Desi presa araba anunta ca despar- Although Arab media announced Although the Arabic press an- While the Arab media say San- tirea lui Sanmartean de Ittihad e that the separation between San- nounces that the breakup of his martean’s departure from Ittihad is im- iminenta, impresarul antreorului martean and Ittihad is imminent, impending Sanmartean Ittihad e, minent, impressions of coach Ladislau Ladislau Boloni, Arcadie Zaporo- coach Ladislau Boloni’s agent, Ar- impresario, Arcadie antreorului Boloni’s agent Arcadie Zaporojanu say janu, spune ca acest lucru nu se va cadia Zaporojanu, says this will no Baghery Isaac says that this will this will never happen. mai intampla. longer happen. no longer happen. -.05 Cei mai putini sunt cei care au fi- The least numerous are those who Most are few people who have Most are those who completed their de- nalizat doctoratul - 0,67 - s, i tinerii completed the doctorate courses completed their Ph.d.-0.67-and gree - 0.67 - and young people who fin- care au terminat liceul ori scoala - 0.67 - and young people who young people who have completed ished college or high school - 6.20 per- generala - 6,20%. have finished high school or mid- grade school-high school times cent. dle school - 6.20%. 6,20%. -.05 Politica clubului a facut din Vi- The club’s policy made Viitorul The Club’s policy has made the The club’s policy made the Viitorul a itorul o echipa de urmarit pentru a team to be watched by foreign team watched for foreign teams, team to watch for overseas teams, so echipele din strainatate, astfel ca teams, and so Hagi has sold this such as Hagi has sold in this sum- Hagi sold him for £1.5 million - Ianis Hagi a vandut în aceasta vara de summer in the amount of 1.5 mil- mer of 1.5 million euros-Ibarra Hagi (Fiorentina - £1 million) and John 1,5 milioane de euro - Ianis Hagi lion Euros - Ianis Hagi (Fiorentina H (Fiorentina-EUR 1 million) and Mitrita (Bari - £500 million). (Fiorentina - 1 milion de euro) s, i - 1 million euro) and Alexandru Alexander Duane (Bari-500 thou- Alexandru Mitrita (Bari - 500 mii Mitrita (Bari - 500 thousand euro). sand euros). euro). -.049 Au inclus amestecul de pungas, i They included the medley of tax- They included a mixture of tax rev- They included the mixture of unpaid neplatitori˘ de taxe, straini˘ care shy rascals, phone-hacking for- enue, pungas, i foreign fraudau tele- tax pungas, foreigners who telephone fraudau telefonic s, i pseudo eigners and pseudo non-doms who phone and non-pseudo domi who and pseudo non-doms who own most non-domi care det,in majoritatea own most of our great newspapers. hold most major newspapers. of the key newspapers. ziarelor importante. .048 Constantin Brâncus, i se nas, te la 19 Constantin Brancusi was born on Constantin Brâncus, i February 19, Mr. Brancusi was born Feb. 19, 1876, februarie 1876, în Hobit,a, un mic 19 February 1876 in Hobit,a, a was born in 1876, a small Certi- in Hobson, a small village in the town sat din comuna Pes, tis, ani, judet,ul small village from Pes, tis, ani, Gorj fied village in Pes, tis, ani commune, of Peston, Pa., at the foot of the moun- Gorj, la poalele Carpat,ilor. county, at the foot of the Carpathi- Gorj County, at the foot of the tains. ans. Carpathians.

Table 19: Top 10 segments by SδMACROF1 (i,hS,hU )S on RO-EN. 1156 δBLEU Source Reference SNMT UNMT .114 David Grimal are o cariera interna- David Grimal has an international ca- David Grimal has an international ca- David Grimal has an international solo tionala de violonist solo, de-a lun- reer as a solo violinist. During the reer as a solo violinist, along which violinist, Peter Csaba, Heinrich Schiff, gul careia a sustinut regulat concerte last 20 years he held regular con- claimed concerts regularly in the past Lawrence Foster, , în ultimii 20 de ani pe principalele certs on the main classical music 20 years in key scenes of classical Mikhail Pletnev, Mikhail Fruyhbeck Or- scene de muzica clasica ale lumii s, i stages of the world with prestigious music of the world and with pres- chestra of Burgos and Peter Eotvos, An- cu orchestre prestigioase cum ar fi orchestras such as the Orchestre tigious orchestras such as the Or- dris Philharmonique de Radio France, Orchestre de Paris, Orchestre Phil- de Paris, Orchestre Philharmonique chestre de Paris, Orchestre Philhar- Russian National Orchestra, the French harmonique de Radio France, Rus- de Radio France, Russian National monique de Radio France, Russian National Orchestra of Lyon, the Berliner sian National Orchestra, Orchestre Orchestra, Orchestre National de National Orchestra, Orchestre Na- Symphoniker, the New Japan Philhar- National de Lyon, Chamber Orches- Lyon, Chamber Orchestra of Europe, tional de Lyon, the Chamber Or- monic, the Orchestre de l’Opera of Lyon, tra of Europe, Berliner Symphoniker, Berliner Symphoniker, New Japan chestra of Europe, the Berliner Sym- the Mozarteum Orchestra of Lyon, the New Japan Philharmonic, Orchestre Philharmonic, Orchestre de l’Opera phoniker, the New Japan Philhar- Jerusalem Symphony Orchestra and Sinfo- de l’Opera de Lyon, Mozarteum Or- de Lyon, Salzburg Mozarteum Or- monic, Orchestre de l’Opera de Lyon nia of Paris, under the direction of conduc- chestra Salzburg, Jerusalem Sym- chestra, Jerusalem Symphony Or- The Salzburg Mozarteum Orches- tor like Orchestra: the National Orchestra, phony Orchestra s, i Sinfonia Varso- chestra and Sinfonia Varsovia un- tra, the Jerusalem Symphony Or- the National Orchestra, the French Cham- via, sub conducerea unor dirijori pre- der the direction of conductors such chestra, and Sinfonia Varsovia, un- ber Orchestra, the Berliner Symphoniol- cum Christoph Eschenbach, Michel as Christoph Eschenbach, Michel der the direction of conductors such ogy Orchestra and Sinfonia Sinfonia and Plasson, Michael Schnwandt, Peter Plasson, Michael Schnwandt, Peter as Christoph Eschenbach, Michel Sinfonia Sinfonia Sinfonia, the Orchestra Csaba, Heinrich Schiff, Lawrence Csaba, Heinrich Schiff, Lawrence Plasson, Michael Schnwandt, Peter of the Paris Symphony Orchestra, the Rus- Foster, Emmanuel Krivine, Mikhail Foster, Emmanuel Krivine, Mikhail Csaba, Heinrich Schiff, Lawrence sian National Orchestra, the National Or- Pletnev, Rafael Fruhbeck de Bur- Pletnev, Rafael Fruhbeck de Burgos Foster, Emmanuel Krivine, Mikhail chestra, the National Opera of Europe, the gos s, i Peter Eotvos, Andris Nelsons, and Peter Eotvos, Andris Nelsons, Pletnev, Rafael Fruhbeck de Burgos, Berliner Symphoniker, the Berliner Sym- Christian Arming. Christian Arming. and Peter Eotvos, Andris Nelsons, phoniker Orchestra, the Orchestra of Lyon, Christian Arming. the Mozarteum and Sinfonia Sinfonia Sin- fonia. .089 Editia de toamna va avea loc la The autumn edition will take place Autumn edition will take place in The fall season will be in Washington Chisinau (Casa Armatei, 2-3 oc- in Chisinau (Army House 2-3 Octo- Chisinau (Army House, 2-3 Oc- (House, 2-3), Washington (Palas, Oct. 6- tombrie), Iasi (Palas, 6-7 octombrie), ber), Iasi (Palas 6-7 October), Brasov tober), Iasi (Palas, 6-7 October), 7), Brasov (House, Oct. 14-15), San Fran- Brasov (Casa Armatei, 14-15 oc- (Army House 14-15 October), Cluj Brasov (Army House, 14-16 Octo- cisco Global (East Coast, Oct. 20-21), tombrie), Cluj Global (Sala Poli- Global (Polyvalent Hall 20-21 Octo- ber), Global (Sala Polivalenta, 20-21 San Francisco IT (East Coast, Oct. 22- valenta, 20-21 octombrie), Cluj IT ber) Cluj IT (Polyvalent Hall 22-23 October), Cluj (Sala Polivalenta, 22- 23), Sibiu (Center for Business, Oct. 28- (Sala Polivalenta, 22-23 octombrie), October), Sibiu (Business Centre 28- 23 October), Sibiu (Business Cen- 29) and Targu-Mures (the National Center, Sibiu (Centrul de Afaceri, 28-29 oc- 29 October) and Targu-Mures (Na- tre, 28-29 October) and Targu-Mures Nov. 5). tombrie) s, i Targu-Mures (Teatrul Na- tional Theatre, November 5). (National Theatre, November 5). tional, 5 noiembrie). -.072 Realizatorii studiului mai transmit The study’s conductors transmit that The filmmakers may study trans- Reporters also deliver that "Americans ca "romanii simt nevoie de ceva "Romanians feel the need for a little mitted as "Romanians feel in need feel need for some more adventure in their mai multa aventura în viata lor more adventure in their lives (24%), of something more adventure in lives (24%), followed by affection (21%), (24%), urmat de afectiune (21%), followed by affection (21%), money their lives (24%), followed by affec- money (21%), safety (20%), new (19%), bani (21%), siguranta (20%), nou (21%), safety (20%), new things tion (21%), money (21%), security sex (19%), respect 18%, confidence 17%, (19%), sex (19%), respect 18%, in- (19%), sex ( 19%) respect 18%, con- (20%), new (19%), sex (19%), re- pleasure 17%, connection 17%, knowl- credere 17%, placere 17%, conectare fidence 17%, pleasure 17%, connec- spect, trust and 18% 17% 17% plea- edge 14%, protection 14%, importance 17%, cunoastere 16%, protectie 14%, tion 17%, knowledge 16%, protec- sure, 17%, 16%, knowledge protec- 14%, learning 12%, freedom 11%, auto- importanta 14%, invatare 12%, liber- tion 14%, importance 14%, learning tion 14%, 14%, 12%, liberty learning cunoastere 10% and control 7%." tate 11%, autocunoastere 10% s, i con- 12%, freedom 11%, self-awareness 11%, self-awareness and control 10% trol 7%". 10% and control 7% ". 7%". -.045 Fed ar trebui sa˘ puna˘ problema The Fed should put financial stabil- The Fed should put the issue of finan- The Fed should put financial stability first stabilitat˘ ,ii financiare pe primul loc ity concerns first only during a major cial stability in the first place only in only in a major crisis, such as the 2008 doar în cazul unei crize majore, cum crisis, such as the 2008 market melt- the event of major crises such as the credit crunch, said Adam S. Posen, a for- a fost cutremurul de pe piat,a˘ din down, said Adam S. Posen, a former earthquake in the market since 2008, mer member of the Bank of England’s sta- 2008, a declarat Adam S. Posen, fost member of the Bank of England’s "said Adam s. Posen, a former mem- bilisation committee. membru al comisiei de stabilire a rate-setting committee. ber of the Commission for determin- ratei dobânzii din cadrul Bank of ing the interest rate within the Bank England. of England. -.041 Atunci când sust,inatorilor˘ lui Clinton When Clinton’s supporters are asked When Clinton supporters are asked When Clinton supporters are asking an li se adreseaza˘ o întrebare deschisa˘ in an open-ended question why they to answer an open question regarding open question as to why she would like to referitor la motivul pentru care ar want her to be the nominee, the top why you want her to win the race, the win the race, the majoritar answer is that dori ca ea sa˘ câs, tige cursa, raspunsul˘ answer is that she has the right expe- answer is that majority holds appro- she holds the right experience (16 percent), majoritar este ca˘ det,ine experient,a rience (16 percent), followed by it’s priate experience (16%), followed by followed by the fact that it has come time adecvata˘ (16%), urmat de faptul ca˘ time for a woman president (13 per- the fact that the time has come for for a woman to be president (13 percent) a venit momentul ca o femeie sa˘ fie cent), and that she is the best candi- a woman to be President (13%), and and that she is the best candidate for the pres, edinte (13%) s, i ca˘ este cel mai date for the job (10 percent). that is the best candidate for this po- job (10 percent). bun candidat pentru aceasta˘ funct,ie sition (10%). (10%). -.040 Clinton este acum sust,inuta˘ de 47% Clinton now has the backing of 47 Clinton is now supported by 47% of Clinton is now backed by 47 percent of din alegatorii˘ democrat,i (în scadere˘ percent of Democratic primary vot- the voters Democrats (down 62%), Democratic voters (down from 58 per- de la 58%), în timp ce Sanders este ers (down from 58 percent), while while Sanders is in second place with cent), while Sanders is second with 27 per- pe locul doi, cu 27% (în urcare fat,a˘ Sanders comes in second, with 27 27% (in relation to climb 17%). cent (up from 17 percent). de 17%). percent (up from 17 percent). .039 Persoanele interesate vor putea Those interested will find out how Persons interested will be able to People interested will be able to learn how sa˘ afle cum sa˘ realizeze creat,ii to make harmonious sculptural flo- learn how to achieve harmonious to complete armonious creative contrast- sculpturale florale armonioase ral creations (ikebana), how to cap- sculptural floral design (ikebana), ing (grafic, drawing, painting) Creations, (ikebana), cum sa˘ surprinda˘ sufle- ture the soul of the surrounding ele- how to capture the soul of the sur- how to capture the hearts of surrounding tul elementelor înconjuratoare˘ în ments in plastic compositions using rounding elements in plastic com- elements in artistic compositions using the compozit,ii plastice folosind arta the Japanese ink painting art (Sumie) positions using the Japanese art of Japanese art of painting in tus (Sumie) or japoneza˘ a pictarii˘ în tus, (Sumie) or how to express their own indi- painting in India ink (Sumie) or how to express their own individualistic sau cum sa˘ îs, i exprime propria viduality and creativity through the how to express their own individu- identity through the plurivalent universe individualitate s, i creativitate prin multifaceted universe of Fine Arts ality and creativity through the uni- of beautiful arts (grafic, drawing, painting, universul plurivalent al artelor fru- (graphic, drawing, painting) said verse plurivalent of fine arts (paint- painting), said Mr. Mazilu, the professor moase (grafic, desen, pictura)˘ , a Sorin Mazilu, the teacher of these ing, drawing, graphics), said Sorin of these courses. declarat Sorin Mazilu, profesorul courses. Mazilu, teacher of such courses. acestor cursuri. .036 Companiile au vacante cateva mii Companies have thousands of vacant Companies have thousands of vacant Companies have filled several thou- de locuri de munca, oportunitati jobs, internship opportunities and in- jobs, internship opportunities, and in- sand jobs, internships and training de internship s, i stagii de prac- ternships, and some of them are al- ternships, and some of them are al- trips, and a number of them are al- tica, iar o parte dintre ele sunt ready posted on the official website ready announced on the official web- ready announced on the official website deja anuntate pe site-ul oficial www.targuldecariere.ro. site www.targuldecariere.ro. www.targuldecarier.com. www.targuldecariere.ro. -.035 Antrenorul celor de la Brisbane Brisbane Broncos coach Wayne Ben- The coach of the Brisbane Broncos, Brisbane Broncos coach Wayne Bennett Broncos, Wayne Bennett, a facut˘ o nett made a thinly veiled reference Wayne Bennett, made a slight refer- made a slight reference to the Storm af- us, oara˘ referire la Storm dupa˘ victo- to the Storm after his side’s qualify- ence to the Storm after his team’s ter their team’s victory in the qualifying ria echipei sale în meciul de califi- ing final win over North Queensland victory in the qualifying match with match against the North Queensland Cow- care cu North Queensland Cowboys, Cowboys on Saturday night when North Queensland Cowboys, played boys on Saturday night when he called jucat sâmbat˘ a˘ seara,˘ când a numit he called that game a "showcase" of Saturday night, when he called the the game a "demonstration" of the rugby meciul respectiv o „demonstrat,ie” a the rugby league and said the two match a "demonstration" of rugby league league league and said the two ligii de rugby s, i a declarat ca˘ cele Queensland weren’t "too big" into league and said that the two teams Queensland sides were not really pricep to doua˘ echipe din Queensland nu se wrestling. from Queensland is not too good at wrestling. prea pricep la wrestling. wrestling. .034 David Grimal recunoaste ca Les Dis- David Grimal admits that Les Dis- David Grimal admits that Les Dis- David Grimal admits Les Dissonaces rep- sonaces reprezinta "un lux", pre- sonaces is "a luxury", adding that sonaces represents "a luxury", stat- resent "a luxury," saying none of the musi- cizand ca niciunul dintre muzicienii none of the musicians who are part ing that none of the musicians who cians who make up the orchestra depinde care fac parte din orchestra nu de- of the orchestra does not depend on are part of the orchestra does not de- of the success of this project to live. pinde de succesul acestui proiect pen- the success of this project to make a pend on the success of this project for tru a trai. living. living.

Table 20: Top 10 segments by SδBLEU(i,hS,hU )S on RO-EN. 1157