Towards Making the Most of Context in Neural Machine Translation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Target Document: Y = ⟨y1, ⋯, yk, ⋯, yn⟩ y y … y y y y Towards Making the Most of Context in Neural Machine Translationy1 = ⟨ 1,1, ⋯, 1,t, ⋯, ⟨��⟩⟩ yk−1 = ⟨ k−1,1, ⋯, k−1,t, ⋯, ⟨��⟩⟩ yk = ⟨ k,1, ⋯, k,t, ⋯, ⟨��⟩⟩ Decoder Softmax Zaixiang Zheng1∗ , Xiang Yue1∗ , Shujian Huang1 , Jiajun Chen1 and Alexandra Birch2 Linear 1National Key Laboratory for Novel Software Technology, Nanjing University 2ILCC, School of Informatics, University of Edinburgh Add & Norm fzhengzx,[email protected], fhuangsj,[email protected], [email protected] Feed Forward Add & Norm Abstract Cross Attention Cross Attention Document-level machine translation manages to x N outperform sentence level models by a small mar- Context Attention Context Attention gin, but have failed to be widely adopted. We argue x N x N that previous research did not make a clear use of Self Attention Self Attention Add & Norm the global context, and propose a new document- Source Target Relative level NMT framework that deliberately models the Document Document local context of each sentence with the awareness Context Source Current Sentence Context Target Current Sentence Self Attention of the global context of the document in both source Context-aware encoder Context-aware decoder and target languages. We specifically design the model to be able to deal with documents contain- Figure 1: Illustration of typical Transformer-based context-aware Pos ing any number of sentences, including single sen- approaches (some of them do not consider target context (grey line)). Emb tences. This unified approach allows our model to Word be trained elegantly on standard datasets without Emb needing to train on sentence and document level sentence, and incorporated into each layer of encoder and/or data separately. Experimental results demonstrate decoder [Zhang et al., 2018; Tan et al., 2019]. More specifi- that our model outperforms Transformer baselines cally, the representation of each word in the current sentence and previous document-level NMT models with is a deep hybrid of both global document context and local substantial margins of up to 2.1 BLEU on state-of- sentence context in every layer. We notice that these hybrid the-art baselines. We also provide analyses which encoding approaches have two main weaknesses: show the benefit of context far beyond the neigh- • Models are context-aware, but do not fully exploit the boring two or three sentences, which previous stud- context. The deep hybrid makes the model more sen- ies have typically incorporated. sitive to noise in the context, especially when the context is enlarged. This could explain why previous studies 1 Introduction show that enlarging context leads to performance degra- dation. Therefore, these approaches have not taken the Recent studies suggest that neural machine translation best advantage of the entire document context. (NMT) [Sutskever et al., 2014; Bahdanau et al., 2015; • Models translate documents, but cannot translate single Vaswani et al., 2017] has achieved human parity, espe- sentences. Because the deep hybrid requires global doc- cially on resource-rich language pairs [Hassan et al., 2018]. ument context as additional input, these models are no However, standard NMT systems are designed for sentence- longer compatible with sentence-level translation based level translation, which cannot consider the dependencies on the solely local sentence context. As a result, these among sentences and translate entire documents. To ad- approaches usually translate poorly for single sentence dress the above challenge, various document-level NMT documents without document-level context. models, viz., context-aware models, are proposed to lever- In this paper, we mitigate the aforementioned two weak- [ age context beyond a single sentence Wang et al., 2017; nesses by designing a general-purpose NMT architecture Miculicich et al., 2018; Zhang et al., 2018; Yang et al., which can fully exploit the context in documents of arbitrary ] 2019 and have achieved substantial improvements over their number of sentences. To avoid the deep hybrid, our architec- context-agnostic counterparts. ture balances local context and global context in a more delib- Figure 1 briefly illustrates typical context-aware models, erate way. More specifically, our architecture independently where the source and/or target document contexts are re- encodes local context in the source sentence, instead of mix- garded as an additional input stream parallel to the current ing it with global context from the beginning so it is robust to ∗Equal contribution. This work was done when Zaixiang was when the global context is large and noisy. Furthermore our visiting at the University of Edinburgh. architecture translates in a sentence-by-sentence manner with 3983 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) access to the partially generated document translation as the late a document. On the contrary, our model can consider target global context which allows the local context to govern the entire arbitrary long document and simultaneously exploit the translation process for single-sentence documents. contexts in both source and target languages. Furthermore, We highlight our contributions in three aspects: most of these document-level models cannot be applied to • We propose a new NMT framework that is able to deal sentence-level translation, lacking both simplicity and flexi- with documents containing any number of sentences, in- bility in practice. They rely on variants of components specif- cluding single-sentence documents, making training and ically designed for document context (e.g., encoder/decoder- deployment simpler and more flexible. to-context attention embedded in all layers [Zhang et al., • We conduct experiments on four document-level transla- 2018; Miculicich et al., 2018; Tan et al., 2019]), being lim- tion benchmark datasets, which show that the proposed ited to the scenario where the document context must be the unified approach outperforms Transformer baselines and additional input stream. Thanks to our general-purpose mod- previous state-of-the-art document-level NMT models eling, the proposed model manages to do general translation both for sentence-level and document-level translation. regardless of the number of sentences of the input text. • Based on thorough analyses, we demonstrate that the document context really matters; and the more context 3 Background provided, the better our model translates. This finding is in contrast to the prevailing consensus that a wider con- Sentence-level NMT Standard NMT models usually model ENT MT encoder- text deteriorates translation quality. sentence-level translation (S N ) within an decoder framework [Bahdanau et al., 2015]. Here SENT- NMT models aim to maximize the conditional log-likelihood 2 Related Work log p(yjx; θ) over a target sentence y = hy1; : : : ; yT i given Context beyond the current sentence is crucial for machine a source sentence x = hx1; : : : ; xI i from abundant parallel (m) (m) M translation. Bawden et al. [2018], Laubli¨ et al. [2018], Muller¨ bilingual data Ds = fx ; y gm=1 of i.i.d observations: PM (m) (m) et al. [2018], Voita et al. [2018] and Voita et al. [2019b] show L(Ds; θ) = m=1 log p(y jx ; θ). that without access to the document-level context, NMT is Document-level NMT Given a document-level parallel likely to fail to maintain lexical, tense, deixis and ellipsis con- (m) (m) M (m) (m) n sistencies, resolve anaphoric pronouns and other discourse dataset Dd = fX ;Y gm=1, where X = hxk ik=1 (m) characteristics, and propose corresponding testsets for eval- is a source document containing n sentences while Y = (m) n uating discourse phenomena in NMT. hyk ik=1 is a target document with n sentences, the train- Most of the current document-level NMT models can be ing criterion for document-level NMT model (DOCNMT) is classified into two main categories, context-aware model, and to maximize the conditional log-likelihood over the pairs of post-processing model. The post-processing models intro- document translation sentence by sentence by: duce an additional module that learns to refine the transla- M tions produced by context-agnostic NMT systems to be more X (m) (m) discourse coherence [Xiong et al., 2019; Voita et al., 2019a]. L(Dd; θ) = log p(Y jX ; θ) While this kind of approach is easy to deploy, the two-stage m=1 M n generation process may result in error accumulation. X X (m) (m) (m) (m) In this paper, we pay attention mainly to context-aware = log p(yk jy<k ; xk ; x−k ; θ) models, while post-processing approaches can be incorpo- m=1 k=1 rated with and facilitate any NMT architectures. Tiedemann (m) and Scherrer [2017] and Junczys-Dowmunt [2019] use the where y<k denotes the history translated sentences prior to (m) (m) concatenation of multiple sentences (usually a small num- yk , while x−k means the rest of the source sentences other ber of preceding sentences) as NMT’s input/output. Going than the current k-th source sentence x(m). beyond simple concatenation, Jean et al. [2017] introduce a k separate context encoder for a few previous source sentences. Wang et al. [2017] includes a hierarchical RNN to summarize 4 Approach source context. Other approaches using a dynamic memory to By the definition of local and global contexts, general trans- store representations of previously translated contents [Tu et lation can be seen as a hierarchical natural language under- al., 2018; Kuang et al., 2018; Maruf and Haffari, 2018]. Mi- standing and generation problem based on local and global culicich et al. [2018], Zhang et al. [2018], Yang et al. [2019], contexts. Accordingly, we propose a general-purpose archi- Maruf et al. [2019] and Tan et al. [2019] extend context- tecture to exploit context machine translation to a better ex- aware model to Transformer architecture with additional con- tent.

Load more