arXiv:1901.09115v1 [cs.CL] 25 Jan 2019 ek oee,i h aaiyt eeaecontextual bottle- leverage remaining to A capacity the is parity. as however, claims human neck, such to achieves leading it cases levels, that favorable unprecedented reached news most has German-English the or of domains, English-French specific quality for or the pairs language low-resource online of often systems. quality MT the media, in general improvements visible the to related from increased receives scien- the MT in related as attention of well as numbers 2017, the since publications of tific is burst revolution the NMT in The character- apparent back-off). and as data translation through level monolingual addressed been of has opti- back-translation (which large by very datasets for mitigated need training its been or software) and has hardware com- mized (which higher its cost despite putational and to thanks performance, adopted superior was its (NMT) from translation was machine Neu- which ral models. one, MT statistical previous phrase-based to the rule-based years than considerably three time a took past shorter shift the paradigm in This (MT) (2016–2018). translation machine for mainstream Years become Two have Past architectures network the Neural on Back Looking 1 hl uhrmist edn,epcal for especially done, be to remains much While hsrve ae icse how discusses paper review This n rmtoeta osdrawdrunstructured wider a context. consider that those from phenom- ena specific of translation the improve to aim that methods improving distinguishes and for quality, context translation that leverage methods to main proposed the were present then will paper the phenomena. of translation contextual the of limitations emphasis analyzing those with on perspectives, various NMT from evaluate that output studies reviews then models, paper NMT the of evolution brief rapid the a on with retrospect Starting the (2017–2018). in years (NMT) two translation past machine neural in used nvriyo ple cecso etr wteln (HES Western of Sciences Applied of University colo aaeetadEgneigVu (HEIG–VD) Vaud Engineering and Management of School Abstract otx nNua ahn Translation: Machine Neural in Context nasbeun version, subsequent a In eiwo oesadEvaluations and Models of Review A 41YednlsBis Switzerland Yverdon-les-Bains, 1401 context [email protected] ot eCeeu ,C 521 CP 1, Cheseaux de Route a been has nriPopescu-Belis Andrei aur 9 2019 29, January zdM (Section revolution- MT have ized that main- toolkits the and models of neural recap these stream brief measure a to with and start 2018 NMT We in and improvements. context 2017 of in use the devoted improve studies of amount icant antecedents. knowledge to about thanks choice information, pronoun constraining topical by lexical document-level or better to a thanks favoring choice by e.g. transla- certain decisions, improve to translation may context comes model to it capacity the When of tion, topics pragmatics. the and under linguistics discourse which by sentences, studied or typically are phrases, words, between dencies paper, comparisons models. controlled competing enable with to metrics, and data ing translation. semantic pronoun and then disam- (WSD) sense biguation and evaluated word including ones, the ones, studies discourse-level grammatical of and these complexity first discuss We of aspects: order ones. increasing new by match- devise n-gram either or beyond which ing) (going NMT, metrics of existing Section studies apply in evaluation Therefore, review we models. these of unobserved properties previously highlight to datasets, and/or rics previous terms, to side comparison left-hand ones. in the results of improved more claiming or one generally are about + claims Novelty made Models Results. i.e. = scores, Metrics + obtain ap- Data to and data, metrics on evaluation them ply run de- models, studies implement MT most and that sign observation the on based tive mgs utmdlM sa ciersac rbe,bti o is but scope. problem, a present research as active our an such is context MT extra-textual Multimodal to images. not and translated, being hsi ia o orc rnlto.Tkn eta con- textual account Taking into translation. text correct for when vital is especially this texts, entire translating when features hsrve ae usit esetv h signif- the perspective into puts paper review This ototnhwvr e oesaetse nexist- on tested are models new however, often Most met- new on tested be can models MT Existing 1 nti eiw cnet eest h etne fadocume a of sentences the to refers ‘context’ review, this In ewl ics e M oesta extend that models NMT new discuss will we 1 en oeigln-ag depen- long-range modeling means 2 .W hnognz u perspec- our organize then We ). na poigvrino this of version upcoming an In –SO) ssociated utside 3 nt , the context span considered during translation. We will sented by the University of Edinburgh (Sennrich et al., distinguish those that use unstructured text spans from 2016c) obtained the highest ranking thanks particularly those that perform structured analyses requiring con- to two additional improvements of the generic model. text, in particular lexical disambiguation and anaphora The first one was to use back-translation of monolin- resolution. gual target data from a state-of-the-art phrase-based SMT engine to increase the amount of parallel data 2 Neural MT Models and Toolkits available for training (Sennrich et al., 2016a). The sec- ond one was to use byte-pair encoding, allowing trans- 2.1 Mainstream Models lation of character n-grams and thus overcoming the Early attempts to use neural networks in MT aimed limited vocabulary of the encoder and decoder embed- to replace n-gram language models with neural net- dings (Sennrich et al., 2016b). Low-level linguistic la- work ones (Bengio et al., 2003; Schwenk et al., 2006). bels were shown to bring small additional benefits to Later, feed-forward neural networks were used to en- translation quality (Sennrich and Haddow, 2016). The hance the phrase-based systems by rescoring the trans- Edinburgh system was soon afterward open-sourced lation probability of phrases (Devlin et al., 2014). Vari- under the name of Nematus (Junczys-Dowmunt et al., able length input was accommodated by using recur- 2016). rent neural networks (RNNs), which offered a prin- Research and commercial MT systems alike were cipled way to represent sequences thanks to hidden quick to adopt NMT, starting with the best-resourced states. One of the first “continuous” models, i.e. not us- language pairs, such as English vs. other European lan- ing explicit memories of aligned phrases, was proposed guages and Chinese. Around the end of 2016, online by Kalchbrenner and Blunsom (2013), with RNNs for MT offered by Bing, DeepL, or Systran was the target language model, and a convolutional source powered by deeper and deeper RNNs (as far as infor- sentence model (or a n-gram one). To address the van- mation is available). In the case of DeepL, although lit- ishing gradient problem with RNNs, long short-term tle information about the systems is published, its visi- memory (LSTM) units (Hochreiter and Schmidhuber, ble quality3 could be partly explained by the use of the 1997) were used in sequence-to-sequence models high-quality Linguee parallel data. (Sutskever et al., 2014), and further simplified as gated An interesting development have been the claims recurrent units (GRU) (Cho et al., 2014; Chung et al., for “bridging the gap between human and machine 2014). Such units allowed the networks to capture translation” from the Google NMT team in September longer-term dependencies between words thanks to 2016 on EN/FR and EN/DE (Wu et al., 2016), and for specialized gates enabling them to remember vs. for- “achieving human parity on ... news translation” from get past inputs. the Microsoft NMT team in March 2018 on EN/ZH Such sequence-to-sequence models were applied to (Hassan et al., 2018). These claims have raised atten- MT with an encoder and a decoder RNN (Cho et al., tion from the media, but have also been disputed by 2014), but had serious difficulties in representing deeper evaluations (see Section 3.2). long sentences as a single vector (Pouget-Abadie et al., RNNs with attention allow top performance to be 2014), although using bi-directional RNNs and con- reached, but at the price of a large computational cost. catenating their representations for each word could For instance, the largest Google NMT system from partly address this limitation. The key innovation, 2016 (Wu et al., 2016), with its 8 encoder and decoder however, was the attention mechanism introduced by layers of 1,024 LSTM nodes each, required training on Bahdanau et al. (2015), which allows the decoder to se- 96 nVidia K80 GPUs for 6 days, in spite of massive lect at each step which part of the source sentence is parallelization (e.g. running each layer on a separate more useful to consider for predicting the next word.2 GPU). A more promising approach to decrease compu- Attention is a context vector – a weighted sum over tational complexity is the use of convolutional neural all hidden states of the encoder – than can be seen networks for sequence to sequence modeling, as pro- as modeling the alignment between input and out- posed by Gehring et al. (2017) in the ConvS2S model put positions. The efficiency of the model was fur- from AI Research. This model outperformed ther improved, with small effects on translation qual- Wu et al.’s system on WMT 2014 EN/DE and EN/FR ity (Luong et al., 2015; Wiseman and Rush, 2016). The translation “at an order of magnitude faster speed, both proposal for distinguishing local vs. global attention on GPU and CPU”. Posted in May 2017, the model models by Luong et al. (2015) has yet to be incorpo- was outperformed the next month by the Transformer rated in mainstream models. (Vaswani et al., 2017). The demonstration that NMT with attention-based The Transformer NMT model (Vaswani et al., 2017) encoder-decoder RNNs outperformed phrase-based removes sequential dependencies (recurrence) in the SMT came at the 2016 news translation task of the encoder and decoder networks, as well as the need for WMT evaluations (Bojar et al., 2016). The system pre- convolutions, and makes use of self-attention networks

2This paper was first posted on Arxiv in September 2014, while 3See https://www.deepl.com/en/quality.html and the one by Cho et al. (2014) was posted in June of the same year. the beginning of Section 3.2 for an estimate. for positional encoding. For instance, the encoder is • DL4MT: Deep Learning for Machine Translation. composed of six pairs of 512-dimensional layers; in https://github.com/nyu-dl/dl4mt-tutorial. each pair, the first layer implements multi-head self- • FairSeq: Facebook AI Research attention, while the second is a fully-connected feed- Sequence-to-Sequence Toolkit. forward layer. In the decoder, an additional layer in https://github.com/pytorch/fairseq. each pair implements multi-head attention over the en- coder’s output. As a result, training on GPUs can be • Marian: Fast Neural Machine Translation fully parallelized, thus substantially reducing training in C++ (Junczys-Dowmunt et al., 2018). time, and slightly outperforms RNN models. https://marian-nmt.github.io/. For these reasons, the Transformer was quickly • Nematus: Open-Source Neural Machine Trans- adopted by the research community: it was used by lation in TensorFlow (Sennrich et al., 2017). virtually all systems for the WMT 2018 news task https://github.com/EdinburghNLP/nematus. (Bojar et al., 2018). The model is now implemented • Neural Monkey (Helcl and Libovicky`, 2017). in most NMT toolkits (see next section). While the https://github.com/ufal/neuralmonkey. Transformer remains the state-of-the-art at the time of writing, several of its authors have shown that RNN ar- • OpenNMT (Klein et al., 2017). chitectures could be improved beyond the Transformer http://opennmt.net/. using some of its insights, and that hybrid architectures • Sockeye (Hieber et al., 2017) based on RNN, CNN and the Transformer pushed the https://github.com/awslabs/sockeye. scores on WMT’14 EN/DE and EN/FR datasets even Tensor2Tensor (Vaswani et al., 2018). higher (Chen et al., 2018).4 A deeper attention model • https://github.com/tensorflow/tensor2tensor. for MT has been presented at the end of 2018, filtering attention from lower to higher levels over five layers 3 How Good is NMT? Fine-grained (Zhang et al., 2018), with encouraging results. The findings of the WMT 2018 news translation task Evaluation Studies (Bojar et al., 2018) confirmed the merits of the Trans- Most proponents of novel NMT models evaluate former, though certain improvements in the architec- them using the BLEU metric (Papineni et al., 2002) ture allowed a late-coming submission from Facebook on parallel data from the WMT conferences (e.g. (Ott et al., 2018) to be ranked first on EN/DE. This http://statmt.org/wmt18). While this is a system trained the Transformer using reduced numeric fairly common and accepted procedure, the signifi- precision, thus accelerating “training by nearly 5x on cance6 of small increases in BLEU is not entirely clear, a single 8-GPU machine”. Results were further im- especially in terms of perceived quality, given the mul- proved by the team using advances in back-translation tiplicity of quality aspects that are actually relevant to to generate synthetic source sentences (Edunov et al., users (Hovy et al., 2002). Human rating of quality, 2018), with training sets reaching hundreds of millions e.g. through direct assessment (Graham et al., 2017), is of sentences; the system also achieved state-of-the-art generally carried out only yearly, at dedicated evalua- performance on WMT ’14 EN/DE test sets. tion campaigns such as WMT or IWSLT, often without 2.2 NMT Toolkits delving into specific quality attributes any further. Therefore, a rich set of evaluation studies have at- The above NMT models are often available as open- tempted to shed light on the various improvements source implementations in MT toolkits, which are built brought by NMT, often compared to SMT. These stud- upon general-purpose machine learning frameworks ies applied existing metrics, or devised new ones, using supporting neural networks. Machine learning frame- new or existing data sets, to assess fine-grained quality works are evolving at a rapid pace and so do NMT im- aspects of NMT output from various systems. 5 plementations. Most NMT toolkits are now built over The studies presented in this section evaluate exist- TensorFlow and Torch (Lua or Python), though others ing NMT systems, without propose new techniques ad- are built from scratch or over other frameworks. dressing the observed limitations (such proposals are The NMT toolkits most frequently used in research discussed in Section 4 below)). We organize this sec- studies, including submissions to shared tasks, are the tion according to the linguistic complexity of the qual- following ones (in alphabetical order). ity aspects (or attributes) of NMT output, from words 4Furthermore, inspiration from the Transformer can be even to texts.After a preliminary discussion of two studies found in the BERT language modeling technique (Bidirectional En- using BLEU in various conditions, we consider evalu- coder Representations from Transformers), also from Google, which ations of morphology, the lexicon, verb phrases, and reached new state-of-the-art results on 11 NLP benchmarks, includ- word order (Section 3.2). When then turn to eval- ing question answering or inference (Devlin et al., 2018). 5Among the most recent changes in the ML ecosystem, one can uations of contextual factors, from semantic proper- cite the merger of Caffe into PyTorch, the growth of PyTorch itself in ties including word sense disambiguation and lexical comparison to the earlier LuaTorch, the end of Theano development by the University of Montreal, and the integration of Keras into the 6‘Significance’ meaning here ‘importance’ or ‘relevance’, and not core of TensorFlow. statistical significance, which is often duly tested (Koehn, 2004). choice (3.3.1), to discourse-related or document-level but lexical errors were not further subcategorized (see quality aspects, in particular the translation of pro- Section 3.3.1 for such attempts). nouns (3.3.2).7 The comparisons showed that NMT made about 20% fewer lexical or morphological mistakes than 3.1 Studies Using BLEU SMT, and up to 50% fewer word-order errors (espe- Claims about performance can be thoroughly analyzed cially on verb placement, which is essential in Ger- even when BLEU is used as a metric. Recently, man), thus demonstrating better flexibility than SMT Toral et al. (2018) examined again the claim for human for language pairs with different word orders. How- parity on EN/ZH translation from the Microsoft NMT ever, NMT sometimes failed to translate all source team (Hassan et al., 2018) by inspecting more closely words, such as negations, which is detrimental to ad- the test sets. One finding is that a significant portion equacy and difficult to spot by the user. The authors of the data was originally written in English, so the also found that the Translation Error Rate (TER) of system’s Chinese source was “translationese”, i.e. in- all systems increased similarly with sentence length, fluenced by the target language.8 If evaluation is re- and NMT outperformed PBSMT, though by a smaller stricted to original English source sentences, then hu- margin on longer sentences. An extended version of man parity is not reached. Moreover, many reference the study (Bentivogli et al., 2018) added an analysis of have fluency problems, contain grammati- IWSLT EN/FR data which confirmed the above con- cal errors or mistranslated nouns. The authors also con- clusions, and found that NMT had a better capacity to firm the finding of Läubli et al. (2018) that professional reorder nouns in EN/FR translation than PBSMT, but human judges, who also have higher inter-rater agree- made more errors on proper nouns. ment, still find a gap between human translations and Popovic´ (2017) performed error analysis on 267 NMT. EN/DE and 204 DE/EN sentences from WMT 2016 Koehn and Knowles (2017) analyze evaluation re- News Test, and comparedthe output of an NMT system sults of NMT and PBSMT with the BLEU metric and (Sennrich et al., 2016c) with one from PBSMT (based make several observations: the quality of NMT de- on Moses), both obtained from WMT submissions. A creases quickly out of the training domain, and with variety of grammatical aspects were evaluated, show- long sentences; the attention model is not necessarily ing that morphology (particularly word forms), English a true alignment model, and the beam search leads to noun collocations, word order, and fluency are better acceptable results for narrow beams only. The authors for NMT than PBSMT. Still, the tested RNN-based infer six challenges for NMT, but discourse-level met- NMT system had difficulties with polysemous English rics point to additional challenges, in particular related source words, and with English continuous verb tenses to the use of context (see Section 3.3). (on the target side). In an extended version, Popovic´ (2018) confirmed these conclusions, and added analy- 3.2 Grammatical and Lexical Qualities of NMT ses of English-Serbian translation on 267 sentences. Shortly after the NMT approach became state-of-the- In an early comparison of PBSMT to NMT, art, several finer-grained evaluations than those based Castilho et al. (2017b) required professional transla- on BLEU were applied to it. These studies differ tors to post-edit MT output, namely 100 English sen- widely in the granularity of error classifications, and tences from MOOCs, translated into German, Por- in how error metrics are applied and on what data, as tuguese, Russian and Greek. Translators ranked out- we now discuss. puts from the two systems, and counted the time and 3.2.1 Human and Automatic Ratings number of operations used for post-editing; fluency and adequacy were also rated. Specific error annota- One of the first detailed analyses of the output of NMT tion was performed as well, dividing errors into sev- (encoder-decoder RNNs with attention) in comparison eral classes: inflectional morphology, word order, and with SMT was presented by Bentivogli et al. (2016). omission / addition / mistranslation. NMT globally out- Using high-quality post-edits by professional transla- performed PBSMT on these metrics, except for omis- tors on system outputs from the IWSLT EN/DE 2015 sion and mistranslation. It also outperformed PBSMT task, errors were automatically detected and classified on post-editing time, as NMT errors were more difficult according to several categories: morphological errors to grasp, although fewer sentences needed correction. (correct lemma but wrong form), lexical errors (wrong These findings were confirmed in a subsequent article lemma), and word order errors. The latter type was fur- (Castilho et al., 2017a) which added two additional use ther analyzed according to POS and dependency tags; cases beyond MOOCs: EN/DE translation of product 7Studies of other system qualities such as efficiency, adaptabil- ads, and ZH/EN patent translation. NMT thus appeared ity, or usability have been comparatively less frequent and are not superior in fluency, but superiority in adequacy or post- included here. Properties such as the ability to handle multilingual – editing effort was not observed. The use of NMT as an as opposed to bilingual – models and to perform zero-shot translation have been examined, e.g. by Lakew et al. (2018).) assistance tool for professional translators appeared as 8This is additional empirical evidence for the need of properly uncertain.9 constructed directional corpora, e.g. extracted from Europarl with additional speaker information (Cartoni and Meyer, 2012). 9In the meanwhile, the switch of virtually all online MT offerings Given the multiplicity of translation direc- target sentences, then it can also find the correct trans- tions and domains that can be tested, it may be lation using beam search when only the source is given. of no surprise that other studies followed suit. Sennrich (2017) designed LingEval97, a test set Toral and Sánchez-Cartagena (2017) evaluated PB- of 97k contrastive pairs, built from reference EN/DE SMT and NMT submissions to WMT 2016 for 9 translations from WMT. A reference translation can be translation directions (EN to/from CS, DE, FI, RO, modified in five different ways to generate an incorrect RU, except FI/EN) and confirmed that NMT is more counterpart, using editing rules to automate the process fluent (measured with an edit distance) and has better for a large set. Incorrect sentences are generated by inflected forms, but struggles with sentences longer (1) changing the gender of a singular determiner; (2) than 40 words. In a larger journal article submitted changing the number of a verb; (3) changing a verb in August 2017, Klubickaˇ et al. (2018) apply a mul- particle; (4) changing aspects of sentence polarity, e.g. tidimensional quality metric (MQM) and study the inserting or deleting a negation particle; (5) swapping statistical significance of differences between MT characters in unseen names. The main findings are that systems, for English to Croatian, a morphologically character-based NMT systems (RNN-based) are bet- rich language. MQM is applied by two human raters ter than byte-pair encoding ones on type 5 errors, but over 100 sentences, with outputs from 3 systems, with worse on types 1, 2 and 3. As for polarity, while spu- a large taxonomy of error types, such as word order, rious insertions of negations are well detected by all agreement, spelling, along with omission and mis- studied systems, spurious deletions are less well de- translation. The authors found that their best system tected, echoing he fact that negations are sometimes (NMT Nematus) reduced the error of their weakest one omitted in NMT output. (PBSMT Moses) by about 50%, and was especially LingEval97 was recently reused in a compari- better for long-distance grammatical agreement. son of RNN, CNN and Transformer architectures by Burchardt et al. (2017) created a large test suite of Tang et al. (2018a), along with ContraWSD set pre- around 5000 EN/DE segments to evaluate MT out- sented below for a semantic evaluation. Performance put for 120 phenomena grouped in 15 categories (e.g. on LingEval97 appeared to be quite similar across ar- ‘ambiguity’, ‘function words’, or ‘long-distance depen- chitectures, with RNNs being particularly competitive dency’). They used about 800 items for a compari- for modeling long-distance agreement between sub- son of 7 MT systems (rule-based, PBSMT, or neural), jects and verbs (in fact, detecting wrong agreements). and reached somehow surprising conclusions, likely Isabelle et al. (2017) proposed a linguistically- because scores were micro-averaged across categories motivated test suite or more exactly a challenge set, as of test sentences with very different sizes (e.g. 529 out the sentences are not accompanied by a reference trans- of 777 test verb tense / aspect / mood). Had macro- lation – instead, human judges are required to evaluate averaging been used, the likely winner would have whether each challenge sentence was translated cor- been the Google NMT system (Wu et al., 2016), which rectly or not. The application cost remains moderate performed best on most error categories. due to the small amount of sentences: 108 for EN/FR Another error taxonomy was proposed by translation. The sentences are divided into three cat- Esperanca-Rodier et al. (2017) and was applied to egories: morpho-syntactic (including agreement and PBSMT and NMT output over the BTEC-corpus, to subjunctive mood), lexico-syntactic (including double- compare reference-based metrics with explicit error object and manner-of-movement verbs), and syntactic annotation and study the translators’ perception of the (e.g. yes-no and tag questions, and placement of clitic output. Again, NMT outperformed PBSMT, albeit pronouns). At the end of 2016, the challenge set was slightly. Similarly, Brussel et al. (2018) compared applied to PBSMT and NMT (Sennrich et al., 2016c; the outputs of online systems on EN/NL translation. Wu et al., 2016). Later on, it was also applied to the NMT was found to be particularly fluent, although online DeepL Translator10 showing a 50% error reduc- omissions remained a problem, and made fewer WSD tion with respect to the best NMT system of the initial errors but more mistranslations, which may be harder article. to post-edit. 3.3 Evaluation of Semantic and Discourse 3.2.2 Contrastive Pairs and Challenge Sets Phenomena in NMT Output Evaluation methods based on contrastive pairs require Categorizing MT errors as ‘semantic’ or ‘discourse’ is access to the probability estimates of pairs of source not always clear-cut, as it often involves an hypothesis and target sentences from the evaluated system. These on the cause of an error. For instance, is outputting a probabilities are easy to obtain from NMT systems that wrong pronoun a morpho-syntactic or a discourse er- are not used as black boxes, but impossible to get from ror? If only its gender is wrong, then this may be at- online systems. Moreover, these methods do not guar- tributed to ignorance of its antecedent, whereas if the antee that if a systems correctly scores two candidate case is wrong (subject vs. object), then the error can be considered as grammatical. In this section, we present to NMT tends to indicate a consensus on the advantages of NMT for web translation. 10https://medium.com/@pisabell/ NMT evaluation studies focusing on errors that can be Therefore, non-WSD lexical errors may violate cohe- attributed to insufficient knowledge or modeling of se- sion, coherence, sense frequency distributions, collo- mantic and discourse properties, which often require cations, terminology, or grammatical constraints. considering a context made of multiple sentences.11 For instance, the test set designed by Bawden et al. We group studies into three categories: those deal- (2018) for ambiguous source words (like wf above) ing with lexical choice (including WSD and lexical equates WSD errors to coherence errors, because they coherence), those dealing with referential phenomena generally make the output incoherent. Conversely, (anaphora and coreference), and finally those dealing non-WSD errors are equated to cohesion errors, al- with discourse structure, though no study among the though the authors concede that “these types are not latter deals with NMT. mutually exclusive and the distinction is not always so clear.” While cohesion errors per se can in principle 3.3.1 Evaluation of Lexical Choice: WSD and be counted automatically (Wong and Kit, 2012), WSD non-WSD Errors errors, as well as non-WSD errors not related to cohe- Notations. Word ambiguity is often cited as an ob- sion (e.g. due to collocations or terminology) are more vious difficulty for translation. In reality, “ambigu- difficult to spot without human intervention. One solu- ity” is a complex notion, and we will focus in this tion is the recent trend – though not without remote an- section on content words (open class). Let us sup- cestors (King and Falkedal, 1990) – to use contrastive pose ideally that a word w may convey one or more pairs containing ambiguous source words, which we language-independent senses s1,s2,..., as listed for discuss hereafter. instance in WordNet, and that a given occurrence of w conveys only one sense at a time. Let us now con- Test suites with contrastive pairs. Based on the sider independently a word wf in the source language, same principle as Lingeval97 (Sennrich, 2017) men- e e and three words w1 to w3 in the target language, with tioned above, ContraWSD is a set of contrastive pairs f e the following senses: w : {s1,s2,s3}, w1 : {s1,s2}, intended to evaluate the capacity of an MT system e e w2 : {s2,s3}, w3 : {s3,s5}. to translate the correct sense of a polysemous word e e in context (Rios Gonzales et al., 2017). About 80 All these words except w2 and w3 can be said to be ambiguous, as they may convey more than one sense. word senses were selected automatically, by observing However, for translation, only the ambiguity of wf ac- target-side variation, for each of the DE/EN and FR/EN 13 tually matters. If an occurrence of wf conveys sense pairs. For each sense, 90 sentences are available on e average, and for each reference translation, an average s1 but is translated with w2 (which cannot convey this sense), this is called a word sense disambiguation of 3.5 and 2.2 wrong translations are generated by re- e placing the target word with other observed translations (WSD) error, regardless of how w2 was actually found, i.e. whether or not WSD was explicitly performed on of the word. As with Lingeval97,a system can be tested the source side. If, however, the occurrence of wf con- with ContraWSD only if it can output the probability e e of a source/target sentence pair, which excludes black veys sense s2, then both w1 and w2 can be used, in principle. Then, regardless of what a reference trans- box systems, and a good answer is a case where the e e system ranks a correct translation higher than an incor- lation may contain, using one of w1 or w2 cannot be a WSD error.12 rect one, given the source. The authors found that a This representation does not account for the addi- baseline NMT system (Nematus, Sennrich et al., 2017) tional constraints that may distinguish between trans- reached about 70% accuracy, compared to 93–95% for e e f a human. The sense-aware systems proposed in their lations by w1 and w2 when w conveys sense s2, and which may lead to non-WSD lexical errors. For in- study remained in the same accuracy range on average (see below), but scored higher when disambiguating stance, it may happen that s2 is an infrequent sense of e e e frequent words. w1, or if they are verbs, w1 and w2 may have different sub-categorization frames. If a previous occurrence of This approach was pursued and proposed as a sup- f e plementary test suite at WMT18 (Rios et al., 2018), w was translated by w2, it may be the case that word repetition is necessary for cohesion, or for understand- where it was formulated as a classic translation task, ing a repeated reference, ruling out a subsequent trans- with a test set of 3,249 DE/EN sentence pairs (from e several corpora on OPUS) which contained ambiguous lation by w1. Other constraints may come from col- locations (MWEs) or terminology. Some of these fac- German words identified in ContraWSD. The scoring tors are mere preferences, while others are strong con- is automatic in most cases, by observing the presence straints, leading to genuine mistakes if not respected. of a known correct vs. incorrect translation of each pol- ysemous source word. All systems submitted to the 11This contrasts with the local view of context adopted e.g. by WMT18 news translation task (Bojar et al., 2018) were Knowles and Koehn (2018), where context actually means the left also evaluated for WSD, and compared to certain 2016 bigram context. ′ systems, finding that accuracy of the best system pro- 12Note that if a word w f may convey only one sense, there is no potential for a source-side WSD error. Note also that translating a gressed from 81% to 93%, and that the correlation with word w′f by a word that can convey all its senses does not oblige the system to perform source-side WSD. 13See https://github.com/a-rios/ContraWSD. BLEU scores was strong but not perfect. encoded context seems insufficient to enable WSD in Another contrastive test set was made available in most cases. November 2017 by Bawden et al. (2018, Section 2.1) This is confirmed by Tang et al. (2018b) who di- to assess lexical choice in English/French translation, rectly looked at how attention is distributed when but also pronoun choice (see next section). The set translating polysemous words from ContraWSD. They is thus composed of two equally sized subsets, each compared RNN encoder-decoder with the Transformer consisting of ‘blocks’ based on modified movie sub- model, with two ways to compute translation accu- titles. There are 100 blocks testing lexical choice racy on polysemous words: either by comparing di- capabilities (WSD and non-WSD) (see also Bawden, rectly with a word-aligned reference, or by scoring the 2018, Section 7.1). Formally, let us denote a block as contrastive pairs as in (Rios Gonzales et al., 2017). In f f e e e f f e e e ((C1 ,S , C1 ,Sα,Sβ), (C2 ,S , C2 ,Sβ,Sα)). Each both cases, the Transformer clearly outperforms the block is based on a source sentence Sf containing a RNN, though performance appears to be lower with polysemous word wf . Two different source sentences reference-based scoring. The main findings are that at- f f f C1 and C2 are provided as context, i.e. preceding S . tention weights are more concentrated on the “ambigu- Their role is to modify the sense conveyed by the oc- ous noun itself rather than context tokens” and that “at- currence of wf in sentence Sf . For each context, the tention is not the main mechanism used by NMT mod- f e block provides a correct translation of S (Sα in the els to incorporate contextual information for WSD.” e ContraWSD was again put to use by Tang et al. first case, Sβ in the second case), along with an incor- e e (2018a) for a quantitative evaluation of WSD for rect one (Sβ and respectively Sα). The reference trans- e e lations of the context sentences (C1 and C2 ) are also DE/EN and DE/FR translation. The comparison of included. Because the source sentence is kept constant RNNs, CNNs and Transformer showed that the latter for the two contexts, a non-contextual system would is significantly better than the other two, likely because provide the same answer for both contexts (i.e. same the network “connects distant words via shorter net- ranking of true/false candidates) and obtain 50% accu- work paths”, but no further explanation or analysis on racy. Among the 100 blocks provided by Bawden et al. WSD is provided. (2018), some are designed to test WSD capabilities, 3.3.2 Evaluation of Pronouns and Coreference and include contexts such as Cf indicates that wf con- 1 A revival of the interest in improving discourse-level veys sense s1 (with the notations above), so the correct e e phenomena in MT has led since 2010 to several initia- translation is w1 and the incorrect one is w2. Then, f f tives and studies to improve the evaluation of pronoun context C indicates that w conveys sense s3, and re- 2 translation, i.e. to make it more accurate but also more verses the correctness of we/we translations. Other 1 2 efficient, and if possible, to automate it. With the ad- blocks test non-WSD related lexical choices, which vent of NMT, the new architectures have been submit- rely more on the target contexts Ce and Ce for decid- 1 2 ted to the same tests and compared with PBSMT. ing which translation is correct, e.g. the need to repeat ParCor is a parallel EN/DE corpus first annotated the same word. with anaphoric relations, and then with coreference Exploring attention to context for NMT of polyse- ones (Guillou et al., 2014; Lapshinova-Koltunski et al., mous words. A quantitative analysis of the WSD ca- 2018). It includes TED talks and EU Bookshop publi- pacities of NMT (encoder-decoder RNN with atten- cations. The annotation pays special attention to the tion) was provided by Liu, Lu, and Neubig (2018) in status of pronouns, and distinguishes several cases of August 2017, who opted for straightforward criteria referential vs. non referential uses. Using similar anno- to identify polysemous words and assess their transla- tation guidelines, the authors designed the PROTEST tions. The number of senses of each EN source word test suite, which contains 250 pronounsalong with their (for EN/DE, EN/FR and EN/ZH NMT) was extracted reference translations (Guillou and Hardmeier, 2016). from the online Cambridge Dictionary, and correct Identity between a candidate and reference pronoun translation meant identity to the reference. Further on, translation is scored automatically, but differences are to demonstrate the benefits of their proposed NMT im- submitted to human judgment. Indeed, depending provements (see below), they restricted the list of pol- on the pronoun systems of the source and target lan- ysemous words to a list of 171 English homographs guages (often EN/FR and EN/DE in these evalua- found on Wikipedia. tions), but also and crucially on the lexical choice The findings presented by Marvin and Koehn (2018) for a pronoun’s antecedent, a variety of translations may explain why the capabilities of baseline NMT sys- can be acceptable for pronouns. This limits the ac- tems (RNN-based built with OpenNMT-py for EN/FR curacy of automatic reference-based metrics such as translation over Europarl and NewsComments) for APT (Miculicich Werlen and Popescu-Belis, 2017), as WSD remain quite limited. They examined the repre- recently discussed by Guillou and Hardmeier (2018), sentations of occurrences of polysemous words at var- and requires alternative strategies when evaluations ious levels of the NMT encoding layers. Specifically, must be quick, large-scale and cost-effective, e.g. for the tests involved 426 sentences and four polysemous pronoun-oriented shared tasks. words (right, like, last, and case), and showed that the Several shared tasks have been organized to assess the quality of pronoun translation, but due to evaluation (+1.2) than on average. Moreover, for sentences with difficulties, protocols have evolved from year to year. it and a feminine antecedent, BLEU increases by Two main approaches have been tried: (1) evaluate 4.8 points. The attention weights of the system are the accuracy of pronoun translation, though this can- compared to CoreNLP results in two ways, first by not be done automatically with sufficient confidence identifying the token that receives maximal attention for a shared task, and requires some form of human when the pronoun is translated. This token coincides evaluation; (2) evaluate the accuracy of pronoun pre- with the antecedent found by CoreNLP more often diction given the source text and a lemmatized ver- (+6 points) than for baseline methods (random, first, sion of the reference translation with deleted pronouns, or last noun). The second evaluation has human raters which can be done (semi-])automatically.14 Both ap- identify the actual antecedent in 500 sentences with proaches have been tried at the DiscoMT 2015 shared it where more than one candidate antecedent exists task (Hardmeier et al., 2015), but only the second one in the previous sentence. Here, CoreNLP is correct was continued in the following years (Guillou et al., in only 79% of the cases, while using NMT attention 2016; Loáiciga et al., 2017). is 72% correct, well above the best heuristic at 54%. At WMT 2018, pronoun translation was evalu- These arguments strongly indicate that NMT learns to ated for all 16 systems participating in the EN/DE perform inter-sentential anaphora resolution to some news translation task using an external test suite extent.16 (Guillou et al., 2018) in the PROTEST style, with Moving away from evaluations performed on 200 occurrences of it and they on the source side. test suites with reference translations, as well as These pronouns have multiple possible translations from those requiring coreference resolution, con- into German. Evaluation was semi-automatic, with trastive pairs have also been designed for pro- candidates matching the reference (1,150) being ‘ap- noun translation. The above-mentioned test set proved’ and the others being submitted to human by Bawden, Sennrich, Birch, and Haddow (2018) also judges (2,050). Seven out of 15 systems (all NMT) contains 100 blocks that aim to test the translation translate correctly more than 145 pronouns out of 200, of personal and possessive pronouns. As for WSD, with the best one reaching 157 (Microsoft’s Marian the context and source sentences are kept constant Junczys-Dowmunt et al., 2018). Pronoun accuracy is (e.g. with a pronoun it), but four alternative transla- highly correlated with BLEU (r = 0.91) and APT tions of the context sentence are generated, varying (r = 0.89). Event references reach 81% accuracy the translation of the antecedent: (a) reference transla- and pleonastic it 93% on average. Intra-sentential tion; (b) possible translation with the opposite gender; anaphoric occurrences of it are better translated than (c) and (d), inaccurate translations, feminine and mas- inter-sentential ones (58% vs. 45%). culine. For each translation of the context sentence, A similar method using PROTEST was applied by the contrastive pair differs only in the translation of it, the same authors to EN/FR MT with PBSMT and NMT with a masculine vs. feminine French pronoun (il or systems (Hardmeier and Guillou, 2018), with 250 oc- elle). In situations (c) and (d), the system is expected currences of it and they. The best system, which to prefer the “contextually correct” translation, agree- is the Transformed-based context-aware system from ing with the gender of the inaccurate translation of the Voita et al. (2018), translated 199 pronouns correctly, antecedent. The best system designed by Bawden et al. while the average over the 9 tested systems is 160 (2018) achieves 72.5% accuracy versus 50% for a non- (64%). The study shows that the context-aware sys- contextual NMT system. tem is highly accurate on pleonastic (non-referential) Finally, a much larger but less structured set of pronouns (27 out of 30) and intra-sentential anaphoric contrastive pairs for pronouns has been presented it and they (35/40 and 21/25) but still struggles with by Müller, Rios, Voita, and Sennrich (2018).17 The inter-sentential ones (15/30 and 11/25). EN/DE pairs contain only source sentences occurring The evaluation approach adopted by in the Open Subtitles, without editing of the context Voita, Serdyukov, Sennrich, and Titov (2018) for sentences. The key to ensure high quality automatic their context-aware NMT architecture is quite ex- data selection is to focus on the English source pronoun emplary. Their goal is to demonstrate improvement it and its possible German translations into er, sie or es, of pronoun translation, and this is evaluated without with several constraints: automatic anaphora resolution the use of specific test suites or contrastive pairs (see on both EN/DE sides (with CoreNLP (Manning et al., below). The authors use Stanford’s CoreNLP coref- 2014)18 and CorZu (Tuggener, 2016)) must find an an- erence resolution system15 to identify sentences with tecedent; the antecedents on the EN and DE sides must pronouns that have their antecedent in the previous be word aligned (with fast-align (Dyer et al., 2013)); sentence; for such sentences, BLEU improves more 16This analysis of performance in situations that are genuinely 14Lemmatization prevents non-MT strategies like powerful lan- ambiguous is reminiscent of Linzen, Dupoux, and Goldberg (2016), guage models from attaining high scores, as it happened at the 2015 who assess the capacity of LSTM networks to model syntactic de- DiscoMT shared task (Hardmeier et al., 2015). pendencies such as noun-verb agreement. 15An unspecified component of CoreNLP (Manning et al., 2014), 17See https://github.com/ZurichNLP/ContraPro. possibly the deterministic one (Lee et al., 2013). 18Unspecified coreference component. and the EN/DE pronouns must also be aligned. With when entire texts are considered by professional trans- these constraints in mind, the set includes the source lators, the difference between human and NMT trans- and target sentences containing it and its translation, lations becomes statistically significant. One can there- and as much context as needed before the sentence. To fore infer that there are perceptible imperfections in the generate the wrong alternative in the contrastive pair, NMT translation of text-level properties such as cohe- the correct translation is randomly replaced with one sion and coherence. of the two incorrect ones. The set contains 12,000 oc- currences of it, with 4,000 for each possible transla- 3.4 Synthesis tion; most antecedents (58%) are in the previous sen- When a new system is presented in a publication, it tence. Using these contrastive pairs, the authors find cannot be expected from the authors that they apply that context-aware models outperform baselines by up a large array of existing metrics. Evaluation studies to 20 percentage points, especially on the sentences that deepen the analysis of MT output are thus wel- where the antecedent is in the preceding sentence, come. As reviewed in this section, studies of NMT while BLEU scores are only marginally improved. models from 2017–2018 have revealed significant im- provements in output quality brought by NMT models, 3.3.3 Evaluation of Discourse Structure confirmed from a variety of perspectives: Several metrics have been proposed to assess the ability to correctly translate discourse structure, but • type of metric: automated (e.g. using the TER dis- none of the studies applied them to NMT systems, tance to a reference translation) vs. human (e.g. as they pre-dated their advent. Discourse struc- judges who may post-edit, or compare, or rate ab- ture results from argumentation relations between sen- solutely one or more translations, with or without tences, often made explicit through discourse con- knowledge of the source language); nectives. Although strategies to improve connec- tive translation by PBSMT systems have been de- • type of comparison: absolute score or comparative signed (Meyer and Popescu-Belis, 2012; Meyer et al., score (often pitching NMT against SMT); 2015), along with metrics to assess the improve- • type of system output: 1-best, n-best, or probabil- ments (Hajlaoui and Popescu-Belis, 2013), they have ities over contrastive pairs (which require access not been recently applied to NMT systems. to a system’s internals); Similarly, metrics involving discourse structure (sentence-level RST parse trees) such as DiscoTK- • test data: large corpora from WMT, excerpts from party (Joty et al., 2017) have been shown to correlate them, domain-specific data, or test suites aimed at positively with human judgments (for PBSMT). How- one or more linguistic phenomena. ever, this study mostly refers to data from the WMT 2014 shared task on (meta-)evaluation of metrics, One of the main observationsof this review is the rather which did not include any NMT output at that time. large number of assessments of document-level quality, Sim Smith and Specia (2018) designed a discourse- which frequently support the need for discourse-aware aware MT evaluation metric that compares embeddings MT. However, these studies also indicate that signifi- of source and target connectives, which is validated on cant progress remains to be made on several discourse- EN/FR MT outputs from 2014 and earlier, accompa- level phenomena: lexical coherence, anaphora resolu- nied by human ratings. tion. and discourse structure. A manual analysis of discourse phenomena in SMT, with quality estimation as the background ob- 4 Increasing Context Spans in NMT jective, was presented by Scarton and Specia (2015), In a subsequent version of the paper, this section will while other taxonomies of discourse-related errors, review studies from 2017–2018 that attempted to im- applied by manual analysts, have been inspired by prove the translation of discourse-level phenomena, contrastive linguistics at the discourse level, allow- and/or attempted to use larger spans of context when ing comparison of cross-lingual contrasts in human translating. The section will be divided in three parts: and machine translation and concluding to NMT su- NMT systems using wider contexts; NMT models for periority (Lapshinova-Koltunski and Hardmeier, 2017; improving WSD and lexical choice; and the processing Šoštaric´ et al., 2018). of discourse-level phenomena, particularly pronouns, Indirect evidence on the capability of NMT to trans- in NMT. late inter-sentential dependencies comes from the re- cent study by Läubli, Sennrich, and Volk (2018), re- Acknowledgments assessing Hassan et al.’s claim that the Bing Transla- tor achieved human parity on ZH/EN news translation. The author is grateful to the Swiss National Sci- Without examining detailed quality attributes such as ence Foundation (SNSF) for support through the DO- word order, lexical choice, or pronouns, the authors MAT project (n. 175693): On-demand Knowledge for asked human judges to rate translations at the text level Document-level Machine Translation. rather than the sentence level, and they showed that References Ondrejˇ Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Christof Monz. Findings of the 2018 conference gio. Neural machine translation by jointly learn- on machine translation (wmt18). In Proceedings Proceedings of ing to align and translate. In of the Third Conference on Machine Translation: the International Conference on Learning Repre- Shared Task Papers, pages 272–303. Associa- sentations, San Diego, CA, USA, 2015. URL tion for Computational Linguistics, 2018. URL https://arxiv.org/pdf/1409.0473. http://aclweb.org/anthology/W18-6401. Rachel Bawden. Contextual machine translation of di- alogue: going beyond the sentence. PhD thesis, Uni- Laura Van Brussel, Arda Tezcan, and Lieve Macken. versité Paris-Saclay, LIMSI-CNRS, Orsay, , A fine-grained error analysis of NMT, SMT and November 2018. RBMT output for english-to-dutch. In Proceed- ings of the Eleventh International Conference on Rachel Bawden, Rico Sennrich, Alexandra Birch, Language Resources and Evaluation (LREC 2018), and Barry Haddow. Evaluating discourse phe- Miyazaki, Japan, 2018. European Language Re- nomena in neural machine translation. In sources Association (ELRA). Proceedings of the 2018 Conference of the North American Chapter of the Association for Aljoscha Burchardt, Vivien Macketanz, Jon De- Computational Linguistics: Human Language hdari, Georg Heigold, Jan-Thorsten Peter, Technologies, Volume 1 (Long Papers), pages and Philip Williams. A linguistic evaluation 1304–1313. Association for Computational Lin- of rule-based, phrase-based, and neural MT guistics, 2018. doi: 10.18653/v1/N18-1118. URL engines. The Prague Bulletin of Mathemati- http://aclweb.org/anthology/N18-1118. cal Linguistics, 108(1):159–170, 2017. URL https://doi.org/10.1515/pralin-2017-0017. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language Bruno Cartoni and Thomas Meyer. Extracting direc- model. Journal of Machine Learning Research, 3: tional and comparable corpora from a multilingual 1137–1155, 2003. corpus for translation studies. In Proceedings of the Eight International Conference on Language Re- Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, sources and Evaluation (LREC), Istanbul, Turkey, and Marcello Federico. Neural versus phrase-based 2012. European Language Resources Association machine translation quality: a case study. In (ELRA). Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages Sheila Castilho, Joss Moorkens, Federico Gaspari, 257–267. Association for Computational Lin- Iacer Calixto, John Tinsley, and Andy Way. Is neu- guistics, 2016. doi: 10.18653/v1/D16-1025. URL ral machine translation the new state of the art? The http://aclweb.org/anthology/D16-1025. Prague Bulletin of Mathematical Linguistics, 108 Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, (1):109–120, 2017a. and Marcello Federico. Neural versus phrase-based MT quality: An in-depth analysis on english- Sheila Castilho, Joss Moorkens, Federico Gaspari, german and english-french. Computer Speech & Rico Sennrich, Vilelmini Sosoni, Panayota Geor- Language, 49:52–70, 2018. ISSN 0885-2308. gakopoulou, Pintu Lohar, Andy Way, Antonio Va- doi: https://doi.org/10.1016/j.csl.2017.11.004. URL lerio Miceli Barone, and Maria Gialama. A compar- http://www.sciencedirect.com/science/article/pii/S0ative quality evaluation885230817301079 of PBSMT and. NMT using professional translators. In Proceedings of Machine Ondrejˇ Bojar, Rajen Chatterjee, Christian Feder- Translation Summit XVI, Nagoya, Japan, 2017b. mann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Varvara Logacheva, Christof Monz, Matteo Negri, Johnson, Wolfgang Macherey, George Foster, Aurelie Neveol, Mariana Neves, Martin Popel, Llion Jones, Mike Schuster, Noam Shazeer, Niki Matt Post, Raphael Rubino, Carolina Scarton, Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Lucia Specia, Marco Turchi, Karin Verspoor, Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff and Marcos Zampieri. Findings of the 2016 Hughes. The best of both worlds: Combining conference on machine translation. In Pro- recent advances in neural machine translation. ceedings of the First Conference on Machine In Proceedings of the 56th Annual Meeting of Translation: Volume 2, Shared Task Papers, pages the Association for Computational Linguistics 131–198. Association for Computational Lin- (Volume 1: Long Papers), pages 76–86. Associ- guistics, 2016. doi: 10.18653/v1/W16-2301. URL ation for Computational Linguistics, 2018. URL http://aclweb.org/anthology/W16-2301. http://aclweb.org/anthology/P18-1008. Kyunghyun Cho, Bart van Merrienboer, Caglar Jonas Gehring, Michael Auli, David Grangier, Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Denis Yarats, and Yann N. Dauphin. Con- Holger Schwenk, and Yoshua Bengio. Learning volutional sequence to sequence learning. In phrase representations using rnn encoder–decoder Doina Precup and Yee Whye Teh, editors, Pro- for statistical machine translation. In Proceedings ceedings of the 34th International Conference of the 2014 Conference on Empirical Meth- on Machine Learning, volume 70 of Proceed- ods in Natural Language Processing (EMNLP), ings of Machine Learning Research, pages pages 1724–1734. Association for Computational 1243–1252, International Convention Centre, Linguistics, 2014. doi: 10.3115/v1/D14-1179. URL Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://aclweb.org/anthology/D14-1179. http://proceedings.mlr.press/v70/gehring17a.html.

Junyoung Chung, Caglar Gulcehre, Kyunghyun Yvette Graham, Timothy Baldwin, Alistair Moffat, and Cho, and Yoshua Bengio. Empirical evalu- Justin Zobel. Can machine translation systems be ation of gated recurrent neural networks on evaluated by the crowd alone. Natural Language sequence modeling. In NIPS 2014 Deep Engineering, 23(1):3–30, 2017. Learning and Representation Learning Work- shop, Montreal, QC, Canada, 2014. URL Liane Guillou and Christian Hardmeier. Protest: A https://arxiv.org/abs/1412.3555. test suite for evaluating pronouns in machine trans- lation. In Proceedings of the Tenth International Jacob Devlin, Rabih Zbib, ZhongqiangHuang, Thomas Conference on Language Resources and Evaluation Lamar, Richard Schwartz, and John Makhoul. Fast (LREC 2016), Paris, France, may 2016. European and robust neural network joint models for statistical Language Resources Association (ELRA). ISBN machine translation. In Proceedings of the 52nd 978-2-9517408-9-1. Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Liane Guillou and Christian Hardmeier. Automatic pages 1370–1380. Association for Computational reference-based evaluation of pronoun translation Linguistics, 2014. doi: 10.3115/v1/P14-1129. URL misses the point. In Proceedings of the 2018 http://aclweb.org/anthology/P14-1129. Conference on Empirical Methods in Natural Language Processing, pages 4797–4802. Associ- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ation for Computational Linguistics, 2018. URL Kristina Toutanova. BERT: pre-training of deep http://aclweb.org/anthology/D18-1513. bidirectional transformers for language understand- ing. CoRR, abs/1810.04805, 2018. URL Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg http://arxiv.org/abs/1810.04805. Tiedemann, and Bonnie Webber. ParCor 1.0: A par- allel pronoun-coreference corpus to support statis- Chris Dyer, Victor Chahuneau, and Noah A. Smith. tical MT. In Proceedings of the 9th International A simple, fast, and effective reparameterization Conference on Language Resources and Evaluation of ibm model 2. In Proceedings of the 2013 (LREC-2014), Reykjavik, , May 2014. Euro- Conference of the North American Chapter of the pean Language Resources Association (ELRA). Association for Computational Linguistics: Human Language Technologies, pages 644–648. Associ- Liane Guillou, Christian Hardmeier, Preslav Nakov, ation for Computational Linguistics, 2013. URL Sara Stymne, Jörg Tiedemann, Yannick Vers- http://aclweb.org/anthology/N13-1073. ley, Mauro Cettolo, Bonnie Webber, and Andrei Popescu-Belis. Findings of the 2016 WMT shared Sergey Edunov, Myle Ott, Michael Auli, and task on cross-lingual pronoun prediction. In David Grangier. Understanding back-translation Proceedings of the First Conference on Machine at scale. In Proceedings of the 2018 Confer- Translation: Volume 2, Shared Task Papers, pages ence on Empirical Methods in Natural Lan- 525–542. Association for Computational Lin- guage Processing, pages 489–500. Association guistics, 2016. doi: 10.18653/v1/W16-2345. URL for Computational Linguistics, 2018. URL http://aclweb.org/anthology/W16-2345. http://aclweb.org/anthology/D18-1045. Liane Guillou, Christian Hardmeier, Ekaterina Emmanuelle Esperanca-Rodier, Caroline Rossi, Lapshinova-Koltunski, and Sharid Loáiciga. A Alexandre Bérard, and Laurent Besacier. Eval- pronoun test suite evaluation of the english–german uation of NMT and SMT: A study on uses mt systems at wmt 2018. In Proceedings of and perceptions. In Proceedings of the 39th the Third Conference on Machine Translation: Conference on Translating and the Com- Shared Task Papers, pages 570–577. Associa- puter, pages 11–24. Asling, 2017. URL tion for Computational Linguistics, 2018. URL https://www.asling.org/tc39/wp-content/uploads/TC39http://aclweb.org/anthology/W18-6435-proceedings-final-1Nov-4.20pm.pdf. . Najeh Hajlaoui and Andrei Popescu-Belis. Assess- Natural Language Processing, pages 2486– ing the accuracy of discourse connective transla- 2496. Association for Computational Linguis- tions: Validation of an automatic metric. In In- tics, 2017. doi: 10.18653/v1/D17-1263. URL ternational Conference on Intelligent Text Process- http://aclweb.org/anthology/D17-1263. ing and Computational Linguistics, pages 236–247. Springer, 2013. Shafiq Joty, Francisco Guzmán, Lluís Màrquez, and Preslav Nakov. Discourse structure Christian Hardmeier and Liane Guillou. Pro- in machine translation evaluation. Com- noun translation in english-french ma- putational Linguistics, 43(4):683–722, chine translation: An analysis of error 2017. doi: 10.1162/COLI_a_00298. URL types. CoRR, abs/1808.10196, 2018. URL http://aclweb.org/anthology/J17-4001. http://arxiv.org/abs/1808.10196. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Hoang. Is neural machine translation ready Tiedemann, Yannick Versley, and Mauro Cettolo. for deployment? A case study on 30 trans- Pronoun-focusedMT and cross-lingual pronoun pre- lation directions. In Proceedings of IWSLT diction: Findings of the 2015 DiscoMT shared task (13th International Workshop on Spoken Language on pronoun translation. In Proceedings of the Sec- Technology), Seattle, WA, USA, 2016. URL ond Workshop on Discourse in Machine Translation, http://arxiv.org/abs/1610.01108. pages 1–16. Association for Computational Lin- guistics, 2015. doi: 10.18653/v1/W15-2501. URL Marcin Junczys-Dowmunt, Roman Grundkiewicz, http://aclweb.org/anthology/W15-2501. Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Hany Hassan, Anthony Aue, Chang Chen, Vishal Alham Fikri Aji, Nikolay Bogoychev, André F. T. Chowdhary, Jonathan Clark, Christian Feder- Martins, and Alexandra Birch. Marian: Fast neural mann, Xuedong Huang, Marcin Junczys-Dowmunt, machine translation in c++. In Proceedings of ACL William Lewis, Mu Li, Shujie Liu, Tie-Yan 2018, System Demonstrations, pages 116–121. As- Liu, Renqian Luo, Arul Menezes, Tao Qin, sociation for Computational Linguistics, 2018. URL Frank Seide, Xu Tan, Fei Tian, Lijun Wu, http://aclweb.org/anthology/P18-4020. Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Achieving hu- Nal Kalchbrenner and Phil Blunsom. Recurrent con- man parity on automatic chinese to english news tinuous translation models. In Proceedings of the translation. CoRR, abs/1803.05567, 2018. URL 2013 Conference on Empirical Methods in Natural http://arxiv.org/abs/1803.05567. Language Processing, pages 1700–1709. Associ- ation for Computational Linguistics, 2013. URL Jindrichˇ Helcl and Jindrichˇ Libovicky.` Neural mon- http://aclweb.org/anthology/D13-1176. key: An open-source tool for sequence learning. The Prague Bulletin of Mathematical Linguistics, 107 Margaret King and Kirsten Falkedal. Using test (1):5–17, 2017. suites in evaluation of machine translation sys- tems. In Proceedings of the 13th International Felix Hieber, Tobias Domhan, Michael Denkowski, Conference on Computational Linguistics (COL- David Vilar, Artem Sokolov, Ann Clifton, and Matt ING), volume 2, Helsinki, Finland, 1990. URL Post. Sockeye: A toolkit for neural machine transla- http://aclweb.org/anthology/C90-2037. tion. arXiv preprint arXiv:1712.05690, 2017. URL https://arxiv.org/abs/1712.05690. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. OpenNMT: Sepp Hochreiter and Jürgen Schmidhuber. Long short- Open-source toolkit for neural machine trans- term memory. Neural Computation, 9(8):1735– lation. In Proceedings of ACL 2017, System 1780, 1997. Demonstrations, pages 67–72. Association for Computational Linguistics, 2017. URL Eduard Hovy, Margaret King, and Andrei Popescu- http://aclweb.org/anthology/P17-4012. Belis. Principles of context-based machine transla- tion evaluation. Machine Translation, 17(1):43–75, Filip Klubicka,ˇ Antonio Toral, and Víctor M. 2002. Sánchez-Cartagena. Quantitative fine-grained human evaluation of machine translation sys- Pierre Isabelle, Colin Cherry, and George Fos- tems: a case study on English to Croatian. ter. A challenge set approach to evaluating Machine Translation, 32(3):195–215, Sep machine translation. In Proceedings of the 2018. doi: 10.1007/s10590-018-9214-x. URL 2017 Conference on Empirical Methods in https://doi.org/10.1007/s10590-018-9214-x. Rebecca Knowles and Philipp Koehn. Con- Tal Linzen, Emmanuel Dupoux, and Yoav Gold- text and copying in neural machine transla- berg. Assessing the ability of lstms to tion. In Proceedings of the 2018 Conference learn syntax-sensitive dependencies. Trans- on Empirical Methods in Natural Language actions of the Association for Computa- Processing, pages 3034–3041. Association tional Linguistics, 4:521–535, 2016. URL for Computational Linguistics, 2018. URL http://aclweb.org/anthology/Q16-1037. http://aclweb.org/anthology/D18-1339. Frederick Liu, Han Lu, and Graham Neubig. Han- Philipp Koehn. Statistical significance tests for ma- dling homographs in neural machine translation. chine translation evaluation. In Proceedings of the In Proceedings of the 2018 Conference of the 2004 Conference on Empirical Methods in Natural North American Chapter of the Association for Language Processing, Barcelona, , 2004. URL Computational Linguistics: Human Language http://aclweb.org/anthology/W04-3250. Technologies, Volume 1 (Long Papers), pages 1336–1345. Association for Computational Lin- Philipp Koehn and Rebecca Knowles. Six challenges guistics, 2018. doi: 10.18653/v1/N18-1121. URL for neural machine translation. In Proceedings of http://aclweb.org/anthology/N18-1121. the First Workshop on Neural Machine Translation, pages 28–39. Association for Computational Lin- Sharid Loáiciga, Sara Stymne, Preslav Nakov, guistics, 2017. doi: 10.18653/v1/W17-3204. URL Christian Hardmeier, Jörg Tiedemann, Mauro http://aclweb.org/anthology/W17-3204. Cettolo, and Yannick Versley. Findings of the 2017 discomt shared task on cross-lingual pro- Surafel Melaku Lakew, Mauro Cettolo, and Mar- noun prediction. In Proceedings of the Third cello Federico. A comparison of transformer Workshop on Discourse in Machine Translation, and recurrent neural networks on multilingual pages 1–16. Association for Computational Lin- neural machine translation. In Proceedings of guistics, 2017. doi: 10.18653/v1/W17-4801. URL the 27th International Conference on Compu- http://aclweb.org/anthology/W17-4801. tational Linguistics, pages 641–652. Associa- Thang Luong, Hieu Pham, and D. Christopher Man- tion for Computational Linguistics, 2018. URL ning. Effective approaches to attention-based http://aclweb.org/anthology/C18-1054 . neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Ekaterina Lapshinova-Koltunski and Christian Hard- Natural Language Processing (EMNLP) meier. Discovery of discourse-related language con- , pages trasts through alignment discrepancies in english- 1412–1421. Association for Computational Lin- german translation. In Proceedings of the Third guistics, 2015. doi: 10.18653/v1/D15-1166. URL http://aclweb.org/anthology/D15-1166 Workshop on Discourse in Machine Translation, . pages 73–81. Association for Computational Lin- Christopher Manning, Mihai Surdeanu, John Bauer, guistics, 2017. doi: 10.18653/v1/W17-4810. URL Jenny Finkel, Steven Bethard, and David Mc- http://aclweb.org/anthology/W17-4810. Closky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd Ekaterina Lapshinova-Koltunski, Christian Hardmeier, Annual Meeting of the Association for Compu- and Pauline Krielke. ParCorFull: a parallel corpus tational Linguistics: System Demonstrations, annotated with full coreference. In Proceedings of pages 55–60. Association for Computational Lin- 11th Language Resources and Evaluation Confer- guistics, 2014. doi: 10.3115/v1/P14-5010. URL ence, Miyazaki, Japan, 2018. http://aclweb.org/anthology/P14-5010. Samuel Läubli, Rico Sennrich, and Martin Volk. Has Rebecca Marvin and Philipp Koehn. Exploring machine translation achieved human parity? a case word sense disambiguation abilities of neural for document-level evaluation. In Proceedings of the machine translation systems (non-archival ex- 2018 Conference on Empirical Methods in Natural tended abstract). In Proceedings of the 13th Language Processing, pages 4791–4796. Associ- Conference of the Association for Machine ation for Computational Linguistics, 2018. URL Translation in the Americas (Volume 1: Re- http://aclweb.org/anthology/D18-1512. search Papers), pages 125–131. Association for Machine Translation in the Americas, 2018. URL Heeyoung Lee, Angel Chang, Yves Peirsman, http://aclweb.org/anthology/W18-1812. Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. Deterministic coreference res- Thomas Meyer and Andrei Popescu-Belis. Using olution based on entity-centric, precision- sense-labeled discourse connectives for statisti- ranked rules. Computational Linguistics, 39 cal machine translation. In Proceedings of the (4), 2013. doi: 10.1162/COLI_a_00152. URL Joint Workshop on Exploiting Synergies between http://aclweb.org/anthology/J13-4004. Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine and Structure in Statistical Translation, pages Translation (HyTra), pages 129–138. Associa- 78–85. Association for Computational Linguis- tion for Computational Linguistics, 2012. URL tics, 2014. doi: 10.3115/v1/W14-4009. URL http://aclweb.org/anthology/W12-0117. http://aclweb.org/anthology/W14-4009.

Thomas Meyer, Najeh Hajlaoui, and Andrei Popescu- Annette Rios, Mathias Müller, and Rico Sen- Belis. Disambiguating discourse connectives for sta- nrich. The word sense disambiguation test tistical machine translation. IEEE/ACM Transac- suite at WMT18. In Proceedings of the tions on Audio, Speech, and Language Processing, Third Conference on Machine Translation: 23(7):1184–1197, 2015. Shared Task Papers, pages 588–596. Associa- tion for Computational Linguistics, 2018. URL Lesly Miculicich Werlen and Andrei Popescu- http://aclweb.org/anthology/W18-6437. Belis. Validation of an automatic metric for the accuracy of pronoun translation (apt). In Annette Rios Gonzales, Laura Mascarell, and Proceedings of the Third Workshop on Dis- Rico Sennrich. Improving word sense disam- course in Machine Translation, pages 17–25. biguation in neural machine translation with Association for Computational Linguistics, sense embeddings. In Proceedings of the Sec- 2017. doi: 10.18653/v1/W17-4802. URL ond Conference on Machine Translation, pages http://aclweb.org/anthology/W17-4802. 11–19. Association for Computational Linguis- tics, 2017. doi: 10.18653/v1/W17-4702. URL Mathias Müller, Annette Rios, Elena Voita, and http://aclweb.org/anthology/W17-4702. Rico Sennrich. A large-scale test set for the evaluation of context-aware pronoun translation Carolina Scarton and Lucia Specia. A quan- in neural machine translation. In Proceedings titative analysis of discourse phenom- of the Third Conference on Machine Transla- ena in machine translation. Discours, tion: Research Papers, pages 61–72. Associa- 2015. doi: 10.4000/discours.9047. URL tion for Computational Linguistics, 2018. URL http://journals.openedition.org/discours/9047. http://aclweb.org/anthology/W18-6307. Holger Schwenk, Daniel Dechelotte, and Jean-Luc Myle Ott, Sergey Edunov, David Grangier, and Gauvain. Continuous space language models Michael Auli. Scaling neural machine translation. for statistical machine translation. In Proceed- In Proceedings of the Third Conference on Machine ings of the COLING/ACL 2006 Main Conference Translation: Research Papers, pages 1–9. Associ- Poster Sessions, pages 723–730. Association ation for Computational Linguistics, 2018. URL for Computational Linguistics, 2006. URL http://aclweb.org/anthology/W18-6301. http://aclweb.org/anthology/P06-2093.

Kishore Papineni, Salim Roukos, Todd Ward, Rico Sennrich. How grammatical is character-level and Wei-Jing Zhu. Bleu: a method for au- neural machine translation? assessing MT quality tomatic evaluation of machine translation. In with contrastive translation pairs. In Proceedings Proceedings of the 40th Annual Meeting of of the 15th Conference of the European Chapter the Association for Computational Linguis- of the Association for Computational Linguistics: tics, Philadelphia, PA, USA, 2002. URL Volume 2, Short Papers, pages 376–382. Associ- http://aclweb.org/anthology/P02-1040. ation for Computational Linguistics, 2017. URL http://aclweb.org/anthology/E17-2060. Maja Popovic.´ Comparing language related is- sues for NMT and PBMT between German Rico Sennrich and Barry Haddow. Linguistic in- and English. The Prague Bulletin of Mathe- put features improve neural machine translation. matical Linguistics, 108(1):209–220, 2017. URL In Proceedings of the First Conference on Ma- https://doi.org/10.1515/pralin-2017-0021chine. Translation: Volume 1, Research Papers, pages 83–91. Association for Computational Lin- Maja Popovic.´ Language-related issues for NMT and guistics, 2016. doi: 10.18653/v1/W16-2209. URL PBMT for English–German and English–Serbian. http://aclweb.org/anthology/W16-2209. Machine Translation, pages 1–17, 2018. URL https://doi.org/10.1007/s10590-018-9219-5Rico. Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with Jean Pouget-Abadie, Dzmitry Bahdanau, Bart monolingual data. In Proceedings of the 54th van Merrienboer, Kyunghyun Cho, and Yoshua Annual Meeting of the Association for Compu- Bengio. Overcoming the curse of sentence tational Linguistics (Volume 1: Long Papers), length for neural machine translation using pages 86–96. Association for Computational Lin- automatic segmentation. In Proceedings of guistics, 2016a. doi: 10.18653/v1/P16-1009. URL SSST-8, Eighth Workshop on Syntax, Semantics http://aclweb.org/anthology/P16-1009. Rico Sennrich, Barry Haddow, and Alexandra Birch. In Proceedings of the Third Conference on Machine Neural machine translation of rare words with Translation: Research Papers, pages 26–35. Asso- subword units. In Proceedings of the 54th Annual ciation for Computational Linguistics, 2018b. URL Meeting of the Association for Computational http://aclweb.org/anthology/W18-6304. Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Lin- Antonio Toral and Víctor M. Sánchez-Cartagena. A guistics, 2016b. doi: 10.18653/v1/P16-1162. URL multifaceted evaluation of neural versus phrase- http://aclweb.org/anthology/P16-1162. based machine translation for 9 language directions. In Proceedings of the 15th Conference of the Rico Sennrich, Barry Haddow, and Alexandra European Chapter of the Association for Com- Birch. Edinburgh neural machine translation putational Linguistics: Volume 1, Long Papers, systems for wmt 16. In Proceedings of the pages 1063–1073, Valencia, Spain, April 2017. First Conference on Machine Translation: Vol- Association for Computational Linguistics. URL ume 2, Shared Task Papers, pages 371–376. http://www.aclweb.org/anthology/E17-1100. Association for Computational Linguistics, 2016c. doi: 10.18653/v1/W16-2323. URL Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. http://aclweb.org/anthology/W16-2323. Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Pro- Rico Sennrich, Orhan Firat, Kyunghyun Cho, ceedings of the Third Conference on Machine Trans- Alexandra Birch, Barry Haddow, Julian Hitschler, lation (WMT). Association for Computational Lin- Marcin Junczys-Dowmunt, Samuel Läubli, An- guistics, 2018. tonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a toolkit for neu- Don Tuggener. Incremental Coreference Resolution ral machine translation. In Proceedings of the for German. PhD thesis, University of Zurich, Software Demonstrations of the 15th Confer- Zurich, Switzerland, 2016. ence of the European Chapter of the Association Ashish Vaswani, Noam Shazeer, Niki Parmar, for Computational Linguistics, pages 65–68. As- Jakob Uszkoreit, Llion Jones, Aidan N Gomez, sociation for Computational Linguistics, 2017. URL Lukasz Kaiser, and Illia Polosukhin. Attention http://aclweb.org/anthology/E17-3017. is all you need. In I. Guyon, U. V. Luxburg, Karin Sim Smith and Lucia Specia. Assessing S. Bengio, H. Wallach, R. Fergus, S. Vish- crosslingual discourse relations in machine transla- wanathan, and R. Garnett, editors, Advances in tion. arXiv preprint arXiv:1810.03148, 2018. URL Neural Information Processing Systems 30, pages https://arxiv.org/abs/1810.03148. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. Margita Šoštaric,´ Christian Hardmeier, and Sara Stymne. Discourse-related language contrasts in Ashish Vaswani, Samy Bengio, Eugene Brevdo, english-croatian human and machine translation. In Francois Chollet, Aidan Gomez, Stephan Gouws, Proceedings of the Third Conference on Machine Llion Jones, Łukasz Kaiser, Nal Kalchbren- Translation: Research Papers, pages 36–48. As- ner, Niki Parmar, Ryan Sepassi, Noam Shazeer, sociation for Computational Linguistics, 2018. URL and Jakob Uszkoreit. Tensor2tensor for neu- http://aclweb.org/anthology/W18-6305. ral machine translation. In Proceedings of the 13th Conference of the Association for Machine Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Se- Translation in the Americas (Volume 1: Re- quence to sequence learning with neural networks. search Papers), pages 193–199. Association for In Advances in Neural Information Process- Machine Translation in the Americas, 2018. URL ing Systems (NIPS), pages 3104–3112, 2014. URL http://aclweb.org/anthology/W18-1819. https://papers.nips.cc/paper/5346-sequence-to-sequence-learning\-with-neural-networks.pdf. Elena Voita, Pavel Serdyukov, Rico Sennrich, Gongbo Tang, Mathias Müller, Annette Rios, and Ivan Titov. Context-aware neural machine and Rico Sennrich. Why self-attention? a translation learns anaphora resolution. In Pro- targeted evaluation of neural machine transla- ceedings of the 56th Annual Meeting of the tion architectures. In Proceedings of the 2018 Association for Computational Linguistics (Volume Conference on Empirical Methods in Natural 1: Long Papers), pages 1264–1274. Associa- Language Processing, pages 4263–4272. Associ- tion for Computational Linguistics, 2018. URL ation for Computational Linguistics, 2018a. URL http://aclweb.org/anthology/P18-1117. http://aclweb.org/anthology/D18-1458. Sam Wiseman and Alexander M. Rush. Sequence-to- Gongbo Tang, Rico Sennrich, and Joakim Nivre. An sequence learning as beam-search optimization. In analysis of attention mechanisms: The case of word Proceedings of the 2016 Conference on Empirical sense disambiguation in neural machine translation. Methods in Natural Language Processing, pages 1296–1306. Association for Computational Lin- guistics, 2016. doi: 10.18653/v1/D16-1137. URL http://aclweb.org/anthology/D16-1137. Billy T. M. Wong and Chunyu Kit. Extending ma- chine translation evaluation metrics with lexical cohesion to document level. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1060–1068. As- sociation for Computational Linguistics, 2012. URL http://aclweb.org/anthology/D12-1097. Yonghui Wu et al. Google’s neural ma- chine translation system: Bridging the gap between human and machine transla- tion. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144. Biao Zhang, Deyi Xiong, and Jinsong Su. Neural ma- chine translation with deep attention. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, pages 1–1, October 2018. doi: 10.1109/TPAMI. 2018.2876404.