Context in Neural Machine Translation: a Review of Models and Evaluations
Total Page:16
File Type:pdf, Size:1020Kb
Context in Neural Machine Translation: A Review of Models and Evaluations Andrei Popescu-Belis University of Applied Sciences of Western Switzerland (HES–SO) School of Management and Engineering Vaud (HEIG–VD) Route de Cheseaux 1, CP 521 1401 Yverdon-les-Bains, Switzerland [email protected] January 29, 2019 Abstract features when translating entire texts, especially when this is vital for correct translation. Taking textual con- This review paper discusses how context has been text into account1 means modeling long-range depen- used in neural machine translation (NMT) in the dencies between words, phrases, or sentences, which past two years (2017–2018). Starting with a brief are typically studied by linguistics under the topics of retrospect on the rapid evolution of NMT models, discourse and pragmatics. When it comes to transla- the paper then reviews studies that evaluate NMT tion, the capacity to model context may improve certain output from various perspectives, with emphasis on those analyzing limitations of the translation of translation decisions, e.g. by favoring a better lexical contextual phenomena. In a subsequent version, choice thanks to document-level topical information, the paper will then present the main methods that or by constraining pronoun choice thanks to knowledge were proposed to leverage context for improving about antecedents. translation quality, and distinguishes methods that This review paper puts into perspective the signif- aim to improve the translation of specific phenom- icant amount of studies devoted in 2017 and 2018 to ena from those that consider a wider unstructured improve the use of context in NMT and measure these context. improvements. We start with a brief recap of the main- stream neural models and toolkits that have revolution- 1 Looking Back on the Past Two Years ized MT (Section 2). We then organize our perspec- tive based on the observation that most MT studies de- Neural network architectures have become mainstream sign and implement models, run them on data, and ap- for machine translation (MT) in the past three years ply evaluation metrics to obtain scores, i.e. Models + (2016–2018). This paradigm shift took a considerably Data + Metrics = Results. Novelty claims are generally shorter time than the previous one, which was from made about one or more of the left-hand side terms, rule-based to phrase-based statistical MT models. Neu- claiming improved results in comparison to previous ral machine translation (NMT) was adopted thanks to ones. its superior performance, and despite its higher com- Existing MT models can be tested on new met- putational cost (which has been mitigated by opti- rics and/or datasets, to highlight previously unobserved mized hardware and software) or its need for very large properties of these models. Therefore, in Section 3, arXiv:1901.09115v1 [cs.CL] 25 Jan 2019 training datasets (which has been addressed through we review evaluation studies of NMT, which either back-translation of monolingual data and character- apply existing metrics (going beyond n-gram match- level translation as back-off). The NMT revolution is ing) or devise new ones. We discuss these studies apparent in the burst of the numbers of related scien- by increasing order of complexity of the evaluated tific publications since 2017, as well as in the increased aspects: first grammatical ones, and then semantic attention MT receives from the general media, often and discourse-level ones, including word sense disam- related to visible improvements in the quality of online biguation (WSD) and pronoun translation. MT systems. Most often however, new models are tested on exist- While much remains to be done, especially for ing data and metrics, to enable controlled comparisons low-resource language pairs or for specific domains, with competing models. In an upcoming version of this the quality of the most favorable cases such as paper, we will discuss new NMT models that extend English-French or German-English news translation 1 has reached unprecedented levels, leading to claims In this review, ‘context’ refers to the sentences of a document being translated, and not to extra-textual context such as associated that it achieves human parity. A remaining bottle- images. Multimodal MT is an active research problem, but is outside neck, however, is the capacity to leverage contextual our present scope. the context span considered during translation. We will sented by the University of Edinburgh (Sennrich et al., distinguish those that use unstructured text spans from 2016c) obtained the highest ranking thanks particularly those that perform structured analyses requiring con- to two additional improvements of the generic model. text, in particular lexical disambiguation and anaphora The first one was to use back-translation of monolin- resolution. gual target data from a state-of-the-art phrase-based SMT engine to increase the amount of parallel data 2 Neural MT Models and Toolkits available for training (Sennrich et al., 2016a). The sec- ond one was to use byte-pair encoding, allowing trans- 2.1 Mainstream Models lation of character n-grams and thus overcoming the Early attempts to use neural networks in MT aimed limited vocabulary of the encoder and decoder embed- to replace n-gram language models with neural net- dings (Sennrich et al., 2016b). Low-level linguistic la- work ones (Bengio et al., 2003; Schwenk et al., 2006). bels were shown to bring small additional benefits to Later, feed-forward neural networks were used to en- translation quality (Sennrich and Haddow, 2016). The hance the phrase-based systems by rescoring the trans- Edinburgh system was soon afterward open-sourced lation probability of phrases (Devlin et al., 2014). Vari- under the name of Nematus (Junczys-Dowmunt et al., able length input was accommodated by using recur- 2016). rent neural networks (RNNs), which offered a prin- Research and commercial MT systems alike were cipled way to represent sequences thanks to hidden quick to adopt NMT, starting with the best-resourced states. One of the first “continuous” models, i.e. not us- language pairs, such as English vs. other European lan- ing explicit memories of aligned phrases, was proposed guages and Chinese. Around the end of 2016, online by Kalchbrenner and Blunsom (2013), with RNNs for MT offered by Bing, DeepL, Google or Systran was the target language model, and a convolutional source powered by deeper and deeper RNNs (as far as infor- sentence model (or a n-gram one). To address the van- mation is available). In the case of DeepL, although lit- ishing gradient problem with RNNs, long short-term tle information about the systems is published, its visi- memory (LSTM) units (Hochreiter and Schmidhuber, ble quality3 could be partly explained by the use of the 1997) were used in sequence-to-sequence models high-quality Linguee parallel data. (Sutskever et al., 2014), and further simplified as gated An interesting development have been the claims recurrent units (GRU) (Cho et al., 2014; Chung et al., for “bridging the gap between human and machine 2014). Such units allowed the networks to capture translation” from the Google NMT team in September longer-term dependencies between words thanks to 2016 on EN/FR and EN/DE (Wu et al., 2016), and for specialized gates enabling them to remember vs. for- “achieving human parity on ... news translation” from get past inputs. the Microsoft NMT team in March 2018 on EN/ZH Such sequence-to-sequence models were applied to (Hassan et al., 2018). These claims have raised atten- MT with an encoder and a decoder RNN (Cho et al., tion from the media, but have also been disputed by 2014), but had serious difficulties in representing deeper evaluations (see Section 3.2). long sentences as a single vector (Pouget-Abadie et al., RNNs with attention allow top performance to be 2014), although using bi-directional RNNs and con- reached, but at the price of a large computational cost. catenating their representations for each word could For instance, the largest Google NMT system from partly address this limitation. The key innovation, 2016 (Wu et al., 2016), with its 8 encoder and decoder however, was the attention mechanism introduced by layers of 1,024 LSTM nodes each, required training on Bahdanau et al. (2015), which allows the decoder to se- 96 nVidia K80 GPUs for 6 days, in spite of massive lect at each step which part of the source sentence is parallelization (e.g. running each layer on a separate more useful to consider for predicting the next word.2 GPU). A more promising approach to decrease compu- Attention is a context vector – a weighted sum over tational complexity is the use of convolutional neural all hidden states of the encoder – than can be seen networks for sequence to sequence modeling, as pro- as modeling the alignment between input and out- posed by Gehring et al. (2017) in the ConvS2S model put positions. The efficiency of the model was fur- from Facebook AI Research. This model outperformed ther improved, with small effects on translation qual- Wu et al.’s system on WMT 2014 EN/DE and EN/FR ity (Luong et al., 2015; Wiseman and Rush, 2016). The translation “at an order of magnitude faster speed, both proposal for distinguishing local vs. global attention on GPU and CPU”. Posted in May 2017, the model models by Luong et al. (2015) has yet to be incorpo- was outperformed the next month by the Transformer rated in mainstream models. (Vaswani et al., 2017). The demonstration that NMT with attention-based The Transformer NMT model (Vaswani et al., 2017) encoder-decoder RNNs outperformed phrase-based removes sequential dependencies (recurrence) in the SMT came at the 2016 news translation task of the encoder and decoder networks, as well as the need for WMT evaluations (Bojar et al., 2016).