<<

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

Target Document: Y = ⟨y1, ⋯, yk, ⋯, yn⟩ y y … y y y y Towards Making the Most of Context in Neural Machine Translationy1 = ⟨ 1,1, ⋯, 1,t, ⋯, ⟨���⟩⟩ yk−1 = ⟨ k−1,1, ⋯, k−1,t, ⋯, ⟨���⟩⟩ yk = ⟨ k,1, ⋯, k,t, ⋯, ⟨���⟩⟩ Decoder Softmax Zaixiang Zheng1∗ , Xiang Yue1∗ , Shujian Huang1 , Jiajun Chen1 and Alexandra Birch2 Linear 1National Key Laboratory for Novel Software , Nanjing University 2ILCC, School of Informatics, University of Edinburgh Add & Norm {zhengzx,xiangyue}@smail.nju.edu.cn, {huangsj,chenjj}@nju.edu.cn, [email protected] Feed Forward

Add & Norm Abstract Cross Attention Cross Attention Document-level machine manages to x N outperform sentence level models by a small mar- Context Attention Context Attention gin, but have failed to be widely adopted. We argue x N x N that previous research did not make a clear use of Self Attention Self Attention Add & Norm the global context, and propose a new document- Source Target Relative level NMT framework that deliberately models the Document Document local context of each sentence with the awareness Context Source Current Sentence Context Target Current Sentence Self Attention of the global context of the document in both source Context-aware encoder Context-aware decoder and target . We specifically design the model to be able to deal with documents contain- Figure 1: Illustration of typical Transformer-based context-aware Pos ing any number of sentences, including single sen- approaches (some of them do not consider target context (grey line)). Emb

tences. This unified approach allows our model to be trained elegantly on standard datasets without Emb needing to train on sentence and document level sentence, and incorporated into each layer of encoder and/or data separately. Experimental results demonstrate decoder [Zhang et al., 2018; Tan et al., 2019]. More specifi- that our model outperforms Transformer baselines cally, the representation of each word in the current sentence and previous document-level NMT models with is a deep hybrid of both global document context and local substantial margins of up to 2.1 BLEU on state-of- sentence context in every layer. We notice that these hybrid the-art baselines. We also provide analyses which encoding approaches have two main weaknesses: show the benefit of context far beyond the neigh- • Models are context-aware, but do not fully exploit the boring two or three sentences, which previous stud- context. The deep hybrid makes the model more sen- ies have typically incorporated. sitive to noise in the context, especially when the con- text is enlarged. This could explain why previous studies 1 Introduction show that enlarging context leads to performance degra- dation. Therefore, these approaches have not taken the Recent studies suggest that neural machine translation best advantage of the entire document context. (NMT) [Sutskever et al., 2014; Bahdanau et al., 2015; • Models translate documents, but cannot translate single Vaswani et al., 2017] has achieved human parity, espe- sentences. Because the deep hybrid requires global doc- cially on resource-rich pairs [Hassan et al., 2018]. ument context as additional input, these models are no However, standard NMT systems are designed for sentence- longer compatible with sentence-level translation based level translation, which cannot consider the dependencies on the solely local sentence context. As a result, these among sentences and translate entire documents. To ad- approaches usually translate poorly for single sentence dress the above challenge, various document-level NMT documents without document-level context. models, viz., context-aware models, are proposed to lever- In this paper, we mitigate the aforementioned two weak- [ age context beyond a single sentence Wang et al., 2017; nesses by designing a general-purpose NMT architecture Miculicich et al., 2018; Zhang et al., 2018; Yang et al., which can fully exploit the context in documents of arbitrary ] 2019 and have achieved substantial improvements over their number of sentences. To avoid the deep hybrid, our architec- context-agnostic counterparts. ture balances local context and global context in a more delib- Figure 1 briefly illustrates typical context-aware models, erate way. More specifically, our architecture independently where the source and/or target document contexts are re- encodes local context in the source sentence, instead of mix- garded as an additional input stream parallel to the current ing it with global context from the beginning so it is robust to ∗Equal contribution. This work was done when Zaixiang was when the global context is large and noisy. Furthermore our visiting at the University of Edinburgh. architecture translates in a sentence-by-sentence manner with

3983 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) access to the partially generated document translation as the late a document. On the contrary, our model can consider target global context which allows the local context to govern the entire arbitrary long document and simultaneously exploit the translation process for single-sentence documents. contexts in both source and target languages. Furthermore, We highlight our contributions in three aspects: most of these document-level models cannot be applied to • We propose a new NMT framework that is able to deal sentence-level translation, lacking both simplicity and flexi- with documents containing any number of sentences, in- bility in practice. They rely on variants of components specif- cluding single-sentence documents, making training and ically designed for document context (e.g., encoder/decoder- deployment simpler and more flexible. to-context attention embedded in all layers [Zhang et al., • We conduct experiments on four document-level transla- 2018; Miculicich et al., 2018; Tan et al., 2019]), being lim- tion benchmark datasets, which show that the proposed ited to the scenario where the document context must be the unified approach outperforms Transformer baselines and additional input stream. Thanks to our general-purpose mod- previous state-of-the-art document-level NMT models eling, the proposed model manages to do general translation both for sentence-level and document-level translation. regardless of the number of sentences of the input text. • Based on thorough analyses, we demonstrate that the document context really matters; and the more context 3 Background provided, the better our model translates. This finding is in contrast to the prevailing consensus that a wider con- Sentence-level NMT Standard NMT models usually model ENT MT encoder- text deteriorates translation quality. sentence-level translation (S N ) within an decoder framework [Bahdanau et al., 2015]. Here SENT- NMT models aim to maximize the conditional log-likelihood 2 Related Work log p(y|x; θ) over a target sentence y = hy1, . . . , yT i given Context beyond the current sentence is crucial for machine a source sentence x = hx1, . . . , xI i from abundant parallel (m) (m) M translation. Bawden et al. [2018], Laubli¨ et al. [2018], Muller¨ bilingual data Ds = {x , y }m=1 of i.i.d observations: PM (m) (m) et al. [2018], Voita et al. [2018] and Voita et al. [2019b] show L(Ds; θ) = m=1 log p(y |x ; θ). that without access to the document-level context, NMT is Document-level NMT Given a document-level parallel likely to fail to maintain lexical, tense, deixis and ellipsis con- (m) (m) M (m) (m) n sistencies, resolve anaphoric pronouns and other discourse dataset Dd = {X ,Y }m=1, where X = hxk ik=1 (m) characteristics, and propose corresponding testsets for eval- is a source document containing n sentences while Y = (m) n uating discourse phenomena in NMT. hyk ik=1 is a target document with n sentences, the train- Most of the current document-level NMT models can be ing criterion for document-level NMT model (DOCNMT) is classified into two main categories, context-aware model, and to maximize the conditional log-likelihood over the pairs of post-processing model. The post-processing models intro- document translation sentence by sentence by: duce an additional module that learns to refine the transla- M tions produced by context-agnostic NMT systems to be more X (m) (m) discourse coherence [Xiong et al., 2019; Voita et al., 2019a]. L(Dd; θ) = log p(Y |X ; θ) While this kind of approach is easy to deploy, the two-stage m=1 M n generation process may result in error accumulation. X X (m) (m) (m) (m) In this paper, we pay attention mainly to context-aware = log p(yk |y

3984 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

Target Document: Y = ⟨y1, ⋯, yk, ⋯, yn⟩ … y y y y y1 = ⟨y1,1, ⋯, y1,t, ⋯, ⟨���⟩⟩ yk−1 = ⟨ k−1,1, ⋯, k−1,t, ⋯, ⟨���⟩⟩ yk = ⟨ k,1, ⋯, k,t, ⋯, ⟨���⟩⟩

Encoder Decoder Softmax

Linear

Gate & Norm Gate & Norm Gate & Norm Add & Norm Add & Norm Feed Forward Feed Forward Segment-Level k Segment-Level n − k Segment-Level Relative Attention 1 − Relative Attention Relative Attention Global … Add & Norm Add & Norm Encoding … x N Cross Attention Cross Attention Add & Norm Add & Norm Add & Norm … FeedForward FeedForward FeedForward Add & Norm Add & Norm N-1 x Add & Norm Add & Norm Add & Norm Relative Relative … … Self Attention Extended History Context Self Attention Local Self Attention Self Attention Self Attention Encoding

Pos Pos Emb Emb Pos Word Seg Pos Word Seg Pos Word Seg Emb Emb Emb Emb Emb Emb Emb Emb Emb Word Word Emb Emb

x x x … x x x … x x x … y k yk t yk T x1 = ⟨ 1,1, ⋯, 1,i, ⋯, 1,I⟩ xk = ⟨ k,1, ⋯, k,i, ⋯, k,I⟩ xn = ⟨ n,1, ⋯, n,i, ⋯, n,I⟩ y1 = ⟨⋯⟩ yk−1 = ⟨⟨���⟩, ⋯, k−1,t, ⋯⟩ y = ⟨⟨���⟩, ⋯, , , ⋯, , ⟩ Source Document: X = ⟨x1, ⋯, xk, ⋯, xn⟩ Figure 2: Illustration of the proposed model. The local encoding is complete and independent, which also allows context-agnostic generation.

local context can directly flow through to the decoder to MultiHead(·) means the attention is performed in a multi- dominate translation. (Section 4.1) headed fashion [Vaswani et al., 2017]. We let the input rep- 0 • Once the local and global understanding of the source resentations x˜k to be the 0-th layer representations hk, while document is constructed, the decoder generates target we denote the (N − 1)-th layer of the local encoder as the document by sentence basis, based on source represen- L N−1 local context for each sentence, i.e., hk = hk . tations of the current sentence as well as target global … context from previous translated history and local con- Global Context Encoding text from the partial translation so far. (Section 4.2) We add an additional layer on the top of the local context en- coding layers, which retrieves global context from the entire This general-purpose modeling allows the proposed model … to fully utilize bilingual and entire document context and go document by a segment-level relative attention, and outputs beyond the restricted scenario where models must have doc- final representations based on hybrid local and global context ument context as additional input streams and fail to translate by gated context fusion mechanism. single sentences. These two advantages meet our expectation Segment-level Relative Attention. Given the local repre- of a unified and general NMT framework. sentations of each sentences, we propose to extend the rela- tive attention [Shaw et al., 2018] from token-level to segment- 4.1 Encoder level to model the inter-sentence global context: Lexical and Positional Encoding G L L L The source input will be transformed to lexical and positional h = MultiHead(Seg-Attn(h , h , h )), representations. We use word position embedding in Trans- where Seg-Attn(Q, K, V) denotes the proposed segment- former [Vaswani et al., 2017] to represent the order of . level relative attention. Let us take xk,i as query as an ex- Note that we reset word positions for each sentence, i.e., the i- ample, its the contextual representations zk,i by the proposed th word in each sentence shares the word position embedding attention is computed over all words (e.g., xκ,j ) in the docu- w s Ei . Besides, we introduce segment embedding Ek to repre- ment regarding the sentence (segment) they belong to: sent the k-th sentence. Therefore, the representation of i-th s w n |xκ| word in k-th sentence is given by x˜k,i = E[xk,i] + Ek + Ei , X X z = ακ,j (W V x + γV ), where E[xk,i] means of xk,i. k,i k,i κ,j k−κ κ=0 j=1 Local Context Encoding κ,j κ,j We construct the local context for each sentence with a stack αk,i = softmax(ek,i ), of standard transformer layers [Vaswani et al., 2017]. Given where ακ,j is the attention weight of x to x . The corre- the k-th source sentence x , the local encoder leverages N −1 k,i k,i κ,j k κ,j stacked layers to map it into encoded representations. sponding attention logit ek,i can be computed with respect to ˆl l−1 l−1 l−1 relative sentence distance by: hk = MultiHead(SelfAttn(hk , hk , hk )), κ,j Q K K > p l ˆl ˆl e = (W xk,i)(W xκ,j + γ ) / dz, (1) hk = LayerNorm(FeedForward(hk) + hk), k,i k−κ ∗ where SelfAttn(Q, K, V) denotes self-attention, while where γk−κ is a parameter vector corresponding to the rela- Q, K, V indicate queries, keys, and values, respectively. tive distance between the k-th and κ-th sentences, providing

3985 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) inter-sentential clues. W Q, W K , and W V are linear projec- 5 Experiment tion matrices for the queries, keys and values, respectively. We experiment on four widely used document-level parallel Gated Context Fusion. After the global context is re- datasets in two language pairs for machine translation: trieved, we adopt a gating mechanism to obtain the final en- • TED (ZH-EN/EN-DE). The Chinese-English and coder representations h by fusing local and global context: English-German TED datasets are from IWSLT 2015 and 2017 evaluation campaigns respectively. We mainly L G g = σ(Wg[h ; h ]), explore and develop our approach on TED ZH-EN, where we take dev2010 as development set and tst2010- h = LayerNorm(1 − g) hL + g hG, 2013 as testset. For TED EN-DE, we use tst2016-2017 as our testset and the rest as development set. where Wg is a learnable linear transformation. [·; ·] denotes concatenation operation. σ(·) is sigmoid activation which • News (EN-DE). We take News Commentary v11 as our leads the value of the fusion gate to be between 0 to 1. training set. The WMT newstest2015 and newstest2016 indicates element-wise multiplication. are used for development and testsets respectively. • Europarl (EN-DE). The corpus are extracted from 4.2 Decoder the Europarl v7 according to the method mentioned in Maruf et al. [2019].1 The goal of the decoder is to generate sentence We applied byte pair encoding [Sennrich et al., 2016, BPE] by sentence by considering the generated previous sentences to segment all sentences with 32K merge operations. We as target global context. A natural idea is to store the hid- splited each document by 20 sentences to alleviate memory den states of previous target translations and allow the self consumption in the training of our proposed models. We used attentions of the decoder to access to these hidden states as the Transformer architecture as our sentence-level, context- extended history context. agnostic baseline and develop our proposed model on the top To that purpose, we leverage and extend Transformer- of it. For models on TED ZH-EN, we used a configuration XL [Dai et al., 2019] as the decoder. Transformer-XL is a smaller than transformer base [Vaswani et al., 2017] novel Transformer variant, which is designed to cache and with model dimension dz = 256, dimension dffn = 512 reuse the previous computed hidden states in the last segment and number of layers N = 4. As for models on the rest as an extended context, so that long-term dependency infor- datasets, we change the dimensions to 512/2048. We used mation occurs many words back could propagate through the the Adam optimizer [Kingma and Ba, 2014] and the same recurrence connections between segments, which just meets learning rate schedule strategy as [Vaswani et al., 2017] with our requirement of generating document long text. We cast 8,000 warmup steps. The training batch consisted of approx- each sentence as a ”segment” in translation tasks and equip imately 2048 source tokens and 2048 target tokens. Label the Transformer-XL based decoder with cross-attention to re- smoothing [Szegedy et al., 2016] of value 0.1 was used for trieve time-dependent source context for the current sentence. training. For inference, we used beam search with a width of Formally, given two consecutive sentences, yk and yk−1, the 5 with a length penalty of 0.6. The evaluation metric is BLEU l-th layer of our decoder first employs self-attention over the [Papineni et al., 2002]. We did not apply checkpoint averag- extended history context: ing [Vaswani et al., 2017] on the parameters for evaluation. l−1 l−1 l−1 ˜sk = [SG(sk−1); sk ], 5.1 Main Results l l−1 l−1 l−1 Document-level Translation. We list results of experi- ¯sk = MultiHead(Rel-SelfAttn(sk ,˜sk ,˜sk )), l l l−1 ments in Table 1, comparing four context-aware NMT mod- ¯sk = LayerNorm(¯sk + sk ), els: Document-aware Transformer [Zhang et al., 2018, ] [ where the function SG(·) stands for stop-gradient. DocT , Hierarchical Attention NMT Miculicich et al., 2018, ] [ ] Rel-SelfAttn(Q, K, V) is a variant of self-attention HAN , Selective Attention NMT Maruf et al., 2019, SAN with word-level relative position encoding. For more specific and Query-guided Capsule Network [Yang et al., 2019, details, please refer to [Dai et al., 2019]. After that, the QCN]. As shown in Table 1, by leveraging document context, cross-attention module fetching the source context from our proposed model obtains 2.1, 2.0, 2.5, and 1.0 gains over sentence-level Transformer baselines in terms of BLEU score encoder representation hk is computed as: on TED ZH-EN, TED EN-DE, News and Europarl datasets, l l respectively. Among them, our model archives new state-of- ˆsk = MultiHead(CrossAttn(¯sk, hk, hk)), l l l the-art results on TED ZH-EN and Europarl, showing the su- sk = LayerNorm(FeedForward(ˆsk) + ˆsk). periority of exploiting the whole document context. Though N our model is not the best on TED EN-DE and News tasks, Given the final representations of the last decoder layer sk , it is still comparable with QCN and HAN and achieves the the of current target sentence yk are computed as: best average performance on English-German benchmarks Y by at least 0.47 BLEU score over the best previous model. p(yk|y N = softmax(E[yk,t] s ). k,t 1 et al. t The last two corpora are from Maruf [2019]

3986 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

ZH-EN EN-DE Model ∆|θ| v v train test TED TED News Europarl avg. SENTNMT [Vaswani et al., 2017] 0.0m 1.0× 1.0× 17.0 23.10 22.40 29.40 24.96 DocT [Zhang et al., 2018] 9.5m 0.65× 0.98× n/a 24.00 23.08 29.32 25.46 HAN [Miculicich et al., 2018] 4.8m 0.32× 0.89× 17.9 24.58 25.03 28.60 26.07 SAN [Maruf et al., 2019] 4.2m 0.51× 0.86× n/a 24.42 24.84 29.75 26.33 QCN [Yang et al., 2019] n/a n/a n/a n/a 25.19 22.37 29.82 25.79 OURS 4.7m 0.22× 1.08× 19.1 25.10 24.91 30.40 26.80

Table 1: Experiment results of our model in comparison with several baselines, including increments of the number of parameters over Transformer baseline (∆|θ|), training/testing speeds (vtrain/vtest, some of them are derived from Maruf et al. [2019]), and translation results of the testsets in BLEU score.

Model Test Model BLEU (BLEUdoc) SENTNMT 17.0 SENTNMT [Vaswani et al., 2017] 11.4 (21.0) DOCNMT (documents as input/output) 14.2 DOCNMT (documents as input/output) n/a (17.0) HAN [Miculicich et al., 2018] 15.6 Modeling source context OURS 17.8 Doc2Sent 6.8 + reset word positions for each sentence 10.0 Table 2: Results of sentence-level translation on TED ZH-EN. + segment embedding 10.5 + segment-level relative attention 12.2 + context fusion gate 12.4 regularizations introduced in Yang et al. [2019]. In addition, Modeling target context while sacrificing training speed, the parameter increment and Transformer-XL decoder [Sent2Doc] 12.4 decoding speed could be manageable. Final model [OURS] 12.9 (24.4)

Sentence-level Translation. We compare the performance Table 3: Ablation study on modeling context on TED ZH-EN devel- on single sentence translation in Table 2, which demonstrates opment set. ”Doc” means using a entire document as a sequence for the good compatibility of our proposed model on both doc- input or output. BLEUdoc indicates the document-level BLEU score ument and sentence translation, whereas the performance of calculated on the concatenation of all output sentences. other approaches greatly leg behind the sentence-level base- line. The reason is while our proposed model does not, the previous approaches require document context as a separate ing target context leading to error propagation. In the end, input stream. This difference ensures the feasibility in both by jointly modeling both source and target contexts, our final document and sentence-level translation in this unified frame- model can obtain the best performance. work. Therefore, our proposed model can be directly used in Effect of Quantity of Context: the More, the Better. We general translation tasks with any input text of any number of also experiment to show how the quantity of context affects sentences, which is more deployment-friendly. our model in document translation. As shown in Figure 3, we find that providing only one adjacent sentence as context 5.2 Analysis and Discussion helps performance on document translation, but that the more Does Bilingual Context Really Matter? Yes. To investi- context is given, the better the translation quality is, although gate how important the bilingual context is and correspond- there does seem to be an upper limit of 20 sentences. Suc- ing contributions of each component, we summary the abla- cessfully incorporating context of this size is something re- [ tion study in Table 3. First of all, using the entire document lated work has not successfully achieved Zhang et al., 2018; ] as input and output directly cannot even generate document Miculicich et al., 2018; Yang et al., 2019 . We attribute this translation with the same number of sentences as source doc- advantage to our hierarchical model design which leads to ument, which is much worse than sentence-level baseline and more gains than pains from the increasingly noisy global con- our model in terms of document-level BLEU. For source con- text guided by the well-formed, uncorrupted local context. text modeling, only casting the whole source document as an Effect of Transfer Learning: Data Hungry Remains a input sequence (Doc2Sent) does not work. Meanwhile, re- Problem for Document-level Translation. Due to the lim- set word positions and introducing segment embedding for itation of document-level parallel data, exploiting sentence- each sentence alleviate this problem, which verifies one of level parallel corpora or monolingual document-level cor- our motivations that we should focus more on local sentences. pora draws more attention. We investigate transfer learning Moreover, the gains by the segment-level relative attention (TL) approaches on TED ZH-EN. We pretrain our model and gated context fusion mechanism demonstrate retrieving on WMT18 ZH-EN sentence-level parallel corpus with 7m and integrating source global context are useful for document sentence pairs, where every single sentence is regarded as a translation. As for target context, employing Transformer- document. We then continue to finetune the pretrained model XL decoder to exploit target historically global context also on TED ZH-EN document-level parallel data (source & tar- leads to better performance on document translation. This is get TL). We also compare to a variant only whose encoder is somewhat contrasted to [Zhang et al., 2018] claiming that us- initialized (source TL). As shown in Table 4, transfer learn-

3987 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

Figure 3: BLEU score w.r.t. #sent. of context on TED ZH-EN.

Model Dev Test Transformer [Vaswani et al., 2017] 11.4 17.0 Figure 4: Visualization of sentence-to-sentence attention based on BERT+MLM [Li et al., 2019] n/a 20.7 segment-level relative attention. Each row represents a sentence OURS 12.9 19.1 while each column represents another sentence to be attended. The OURS + source TL 13.9 19.7 weights of each row sum to 1. OURS + source & target TL 14.9 21.3

Table 4: Effect of transfer learning (TL). Model deixis lex.c. ell.infl. ell.VP SENTNMT 50.0 45.9 52.2 24.2 OURS 61.3 46.1 61.0 35.6 ing approach can help alleviate the need for document level Voita et al. [2019b]∗ 81.6 58.1 72.2 80.0 data in source and target languages to some extent. However, the scarcity of document-level parallel data still prevents the Table 5: Accuracy (%) of discourse phenomena. ∗ different data and document-level NMT from extending at scale. system conditions, only for reference. What Does Model Learns about Context? A Case Study. Furthermore, we are interested in what the proposed model tains groups of contrastive examples consisting of a positive learns about context. In Figure 4, we visualize the sentence- translation with correct discourse phenomenon and negative to-sentence attention weights of a source document based on translations with incorrect phenomena. The goal is to figure segment-level relative attention. Formally, the weight of the out if a model is more likely to generate a correct translation k-th sentence attending to the κ-th sentence are computed by compared to the incorrect variation. We summarize the re- ακ = 1 P P ακ,j , where ακ,j is defined by Eq.(1). As k |xk| i j k,i k,i sults in Table 5. Our model is better at resolving discourse shown in Figure 4, we find very interesting patterns (which consistencies compared to context-agnostic baseline. Voita are also prevalent in other cases): 1) first two sentences (blue et al. [2019b] use a context-agnostic baseline, trained on 4× frame), which contain the main topic and idea of a document, larger data, to generate first-pass drafts, and perform post- seem to be a very useful context for all sentences; 2) the pre- processings, which is not directly comparable, but would be vious and subsequent adjacent sentences (red and purple di- easily incorporated with our model to achieve better results. agonals, respectively) draw dense attention, which indicates the importance of surrounding context; 3) although sound- 6 Conclusion ing contexts are crucial, the subsequent sentence significantly outweighs the previous one. This may imply that the lack of In this paper, we propose a unified local and global NMT target future information but the availability of the past in- framework, which can successfully exploit context regard- formation in the decoder forces the encoder to retrieve more less of how many sentence(s) are in the input. Extensive ex- knowledge about the next sentence than the previous one; 4) perimentation and analysis show that our model has indeed the model seems to not that care about the current sentence. learned to leverage a larger context. In future work we will Probably because the local context can flow through the con- investigate the feasibility of extending our approach to other text fusion gate, the segment-level relative attention just fo- document-level NLP tasks, e.g., summarization. cuses on fetching useful global context; 5) the 6-th sentence also gets attraction by all the others (brown frame), which Acknowledgements may play a special role in the inspected document. Shujian Huang is the corresponding author. This work Analysis on Discourse Phenomena. We also want to ex- was supported by the National Science Foundation of China amine whether the proposed model actually learns to uti- (No. U1836221, 61772261, 61672277). Zaixiang Zheng lize document context to resolve discourse inconsistencies was also supported by China Scholarship Council (No. that context-agnostic models cannot handle. We use con- 201906190162). Alexandra Birch was supported by the Eu- trastive test sets for the evaluation of discourse phenomena ropean Union’s Horizon 2020 research and innovation pro- for English-Russian by Voita et al. [2019b]. There are four gramme under grant agreements No 825299 (GoURMET) test sets in the suite regarding deixis, consistency, el- and also by the UK EPSRC fellowship grant EP/S001271/1 lipsis (inflection), and ellipsis (verb phrase). Each testset con- (MTStretch).

3988 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

References [Papineni et al., 2002] Kishore Papineni, Salim Roukos, [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Todd Ward, and Wei-Jing Zhu. Bleu: a method for au- Cho, and Yoshua Bengio. Neural machine translation by tomatic evaluation of machine translation. In ACL, 2002. jointly learning to align and translate. In ICLR, 2015. [Sennrich et al., 2016] Rico Sennrich, Barry Haddow, and [Bawden et al., 2018] Rachel Bawden, Rico Sennrich, Alexandra Birch. Neural machine translation of rare words Alexandra Birch, and Barry Haddow. Evaluating dis- with subword units. In ACL, 2016. course phenomena in neural machine translation. In [Shaw et al., 2018] Peter Shaw, Jakob Uszkoreit, and Ashish NAACL-HLT, 2018. Vaswani. Self-attention with relative position representa- tions. In NAACL-HLT, 2018. [Dai et al., 2019] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan R. Salakhut- [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and dinov. Transformer-xl: Attentive language models beyond Quoc V Le. Sequence to sequence learning with neural a fixed-length context. In ACL, 2019. networks. In NIPS, 2014. [Hassan et al., 2018] Hany Hassan, Anthony Aue, Chang [Szegedy et al., 2016] Christian Szegedy, Vincent Van- Chen, Vishal Chowdhary, Jonathan Clark, Christian Fe- houcke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. dermann, Xuedong Huang, Marcin Junczys-Dowmunt, Rethinking the inception architecture for computer vision. William Lewis, Mu Li, et al. Achieving human parity In CVPR, 2016. on automatic chinese to english news translation. arXiv [Tan et al., 2019] Xin Tan, Longyin Zhang, Deyi Xiong, and preprint arXiv:1803.05567, 2018. Guodong Zhou. Hierarchical modeling of global con- [Jean et al., 2017] Sebastien´ Jean, Stanislas Lauly, Orhan Fi- text for document-level neural machine translation. In rat, and Kyunghyun Cho. Does neural machine translation EMNLP-IJCNLP, 2019. benefit from larger context? CoRR, abs/1704.05135, 2017. [Tiedemann and Scherrer, 2017] Jorg¨ Tiedemann and Yves [Junczys-Dowmunt, 2019] Marcin Junczys-Dowmunt. Mi- Scherrer. Neural machine translation with extended con- crosoft translator at wmt 2019: Towards large-scale text. In DiscoMT, 2017. document-level neural machine translation. In WMT, [Tu et al., 2018] Zhaopeng Tu, Yang Liu, Shuming Shi, and 2019. Tong Zhang. Learning to remember translation history [Kingma and Ba, 2014] Diederik P. Kingma and Jimmy Ba. with a continuous cache. TACL, 2018. Adam: A method for stochastic optimization. In ICLR, [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki 2014. Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, [Kuang et al., 2018] Shaohui Kuang, Deyi Xiong, Weihua Łukasz Kaiser, and Illia Polosukhin. Attention is All you Luo, and Guodong Zhou. Modeling coherence for neu- Need. In NIPS, 2017. ral machine translation with dynamic and topic caches. In [Voita et al., 2018] Elena Voita, Pavel Serdyukov, Rico Sen- COLING, 2018. nrich, and Ivan Titov. Context-aware neural machine trans- ACL [Laubli¨ et al., 2018] Samuel Laubli,¨ Rico Sennrich, and lation learns anaphora resolution. In , 2018. Martin Volk. Has machine translation achieved human [Voita et al., 2019a] Elena Voita, Rico Sennrich, and Ivan parity? a case for document-level evaluation. In EMNLP, Titov. Context-aware monolingual repair for neural ma- 2018. chine translation. In EMNLP-IJCNLP, 2019. [Li et al., 2019] Liangyou Li, Xin Jiang, Qun Liu, Huawei [Voita et al., 2019b] Elena Voita, Rico Sennrich, and Ivan Noah’, and Ark Lab. Pretrained Language Models for Titov. When a good translation is wrong in context: Document-Level Neural Machine Translation. arXiv Context-aware machine translation improves on deixis, el- preprint, 2019. lipsis, and lexical cohesion. In ACL, 2019. [Maruf and Haffari, 2018] Sameen Maruf and Gholamreza [Wang et al., 2017] Longyue Wang, Zhaopeng Tu, Andy Haffari. Document context neural machine translation Way, and Qun Liu. Exploiting cross-sentence context for with memory networks. In ACL, 2018. neural machine translation. In EMNLP, 2017. [Maruf et al., 2019] Sameen Maruf, Andre´ FT Martins, and [Xiong et al., 2019] Hao Xiong, Zhongjun He, Hua Wu, and Gholamreza Haffari. Selective attention for context-aware Haifeng Wang. Modeling coherence for discourse neural neural machine translation. In NAACL-HLT, 2019. machine translation. In AAAI, 2019. [Miculicich et al., 2018] Lesly Miculicich, Dhananjay Ram, [Yang et al., 2019] Zhengxin Yang, Jinchao Zhang, Fandong Nikolaos Pappas, and James Henderson. Document-level Meng, Shuhao Gu, Yang Feng, and Jie Zhou. Enhancing neural machine translation with hierarchical attention net- context modeling with a query-guided capsule network for works. In EMNLP, 2018. document-level translation. In EMNLP-IJCNLP, 2019. [Muller¨ et al., 2018] Mathias Muller,¨ Annette Rios, Elena [Zhang et al., 2018] Jiacheng Zhang, Huanbo Luan, Voita, and Rico Sennrich. A Large-Scale Test Set for the Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, and Evaluation of Context-Aware Pronoun Translation in Neu- Yang Liu. Improving the transformer translation model ral Machine Translation. In WMT, 2018. with document-level context. In EMNLP, 2018.

3989