Learning to Remember Translation History with a Continuous Cache

Learning to Remember Translation History with a Continuous Cache Zhaopeng Tu Yang Liu Tencent AI Lab Tsinghua University [email protected] [email protected] Shuming Shi Tong Zhang Tencent AI Lab Tencent AI Lab [email protected] [email protected] Abstract text, they generate the context from discrete lexi- cons, thus would cause errors propagated from gen- Existing neural machine translation (NMT) erated translations. Accordingly, they only take models generally translate sentences in isola- into account source sentences but fail to make use tion, missing the opportunity to take advan- of target-side information.1 Another potential lim- tage of document-level information. In this work, we propose to augment NMT models itation is that they are computationally expensive, with a very light-weight cache-like memory which limits the scale of cross-sentence context. network, which stores recent hidden represen- In this work, we propose a very light-weight alter- tations as translation history. The probabil- native that can both cover large-scale cross-sentence ity distribution over generated words is up- context as well as exploit bilingual translation his- dated online depending on the translation history. Our work is inspired by recent successes tory retrieved from the memory, endowing of memory-augmented neural networks on multi- NMT models with the capability to dynami- ple NLP tasks (Weston et al., 2015; Sukhbaatar et cally adapt over time. Experiments on multiple domains with different topics and styles al., 2015; Miller et al., 2016; Gu et al., 2017), es- show the effectiveness of the proposed ap- pecially the efficient cache-like memory networks proach with negligible impact on the compu- for language modeling (Grave et al., 2017; Daniluk tational cost. et al., 2017). Specifically, the proposed approach augments NMT models with a continuous cache (CACHE), which stores recent hidden representa- 1 Introduction tions as history context. By minimizing the compu- Neural machine translation (NMT) has advanced the tation burden of the cache-like memory, we are able state of the art in recent years (Kalchbrenner et al., to use larger memory and scale to longer translation 2014; Cho et al., 2014; Sutskever et al., 2014; Bah- history. Since we leverage internal representations danau et al., 2015). However, existing models gen- instead of output words, our approach is more ro- arXiv:1711.09367v1 [cs.CL] 26 Nov 2017 erally treat documents as a list of independent sen- bust to the error propagation problem, and thus can tence pairs and ignore cross-sentence information, incorporate useful target-side context. which leads to translation inconsistency and ambi- Experimental results show that the proposed guity arising from a single source sentence. approach significantly and consistently improves There have been few recent attempts to model translation performance over a strong NMT baseline cross-sentence context for NMT: Wang et al. on multiple domains with different topics and styles. (2017a) use a hierarchical RNN to summarize the We found the introduced cache is able to remember previous K source sentences, while Jean et al. translation patterns at different levels of matching (2017) use an additional set of an encoder and at- and granularity, ranging from exactly matched lexi- tention model to dynamically select part of the pre- 1Wang et al. (2017a) indicate that “considering target-side vious source sentence. While these approaches have history inversely harms translation performance, since it suffers proven their ability to represent cross-sentence con- from serious error propagation problems.” cal patterns to fuzzily matched patterns, from word- ... 开Ë 都觉觉觉得得得 ... '¶ 觉觉觉得得得这 _/ Src level patterns to phrase-level patterns. 一! :::GGG ，一! 挑战。 . initially they all felt that . everyone 2 Neural Machine Translation Ref felt that this was also an opportunity and a challenge . Suppose that x = x1; : : : xj; : : : xJ represents a ... felt that . we feel that it is also a source sentence and y = y1; : : : yt; : : : yT a target NMT sentence. NMT directly models the probability of challenge and a challenge . translation from the source sentence to the target (a) The translation of “:G” (“opportunity”) suf- sentence word by word: fers from ambiguity problem, while the translation of “觉得” (“feel”) suffers from tense inconsistency T Y problem. The former problem is not caused by at- P (yjx) = P (y jy ; x) (1) t <t tending to wrong source words, as shown below. t=1 As shown in Figure 2 (a), the probability of generat- ing the t-th word yt is computed by P (ytjy<t; x) = g(yt−1; st; ct) (2) where g(·) first linearly transforms its input and then applies a softmax function, yt−1 is the previously generated word, st is the t-th decoding hidden state, and ct is the t-th source representation. The decoder state st is computed as follows: st = f(yt−1; st−1; ct) (3) where f(·) is an activation function, which is imple- (b) Attention matrix. mented as GRU (Cho et al., 2014) in this work. ct is a dynamic vector that selectively summarizes certain Figure 1: An example translation. parts of the source sentence at each decoding step: J suffering from inconsistency and ambiguity arising X ct = αt;jhj (4) from a single source sentence, as shown in Table 1. j=1 These problems can be alleviated by the proposed approach via modeling translation history, as de- where α is alignment probability calculated by an t;j scribed below. attention model (Bahdanau et al., 2015; Luong et al., 2015a), and hj is the encoder hidden state of the j-th 3 Approach source word xj. Since the continuous representation of a sym- 3.1 Architecture bol (e.g., hj and st) encodes multiple meanings of The proposed approach augments neural machine a word, NMT models need to spend a substantial translation models with a cache-like memory, which amount of their capacity in disambiguating source has proven useful for capturing longer history for the and target words based on the context defined by a language modeling task (Grave et al., 2017; Daniluk source sentence (Choi et al., 2016). Consistency is et al., 2017). The cache-like memory is essentially a another critical issue in document-level translation, key-value memory (Miller et al., 2016), which is an where a repeated term should keep the same trans- array of slots in the form of (key, value) pairs. The lation throughout the whole document (Xiao et al., matching stage is based on the key records while the 2011). Nevertheless, current NMT models still pro- reading stage uses the value records. From here on, cess a document by translating each sentence alone, we use cache to denote the cache-like memory. yy yy t t t t CacheCache ~s~tst keykey valuevalue yy yy ++ readingreading t-1t-1 t-1t-1 sst t stst combiningmmt t addressingmatching cct t cct t (a) Standard NMT (b) NMT augmented with a continuous cache Figure 2: Architectures of (a) standard NMT and (b) NMT augmented with an external cache to exploit translation history. At each decoding step, the current attention context ct that represents source-side content serves as a query to retrieve the cache (key matching) and an output vector mt that represents target-side information in the past translations is returned (value reading), which is combined with the current decoder state st (representation combining) to subsequently produce the target word yt. Since modern NMT models generate translation combined with the current decoder state st to sub- in a word-by-word manner, translation information sequently produce the target word yt (Section 3.2). is generally stored at word level, including source- When the full translation is generated, the decoding side context that embeds content being translated contexts are stored in the cache as a history for fu- and target-side context that corresponds to the gen- ture translations (Section 3.3). erated word. With the goal of remembering translation history in mind, the key should be designed 3.2 Reading from Cache with features to help match it to the source-side con- Cache reading involves the following three steps: text, while the value should be designed with fea- Key Matching The goal of key matching is to re- tures to help match it to the target-side context. To trieve similar records in the cache. To this end, we this end, we define the cache slots as pairs of vectors exploit the attention context representations c to f(c ; s );:::; (c ; s );:::; (c ; s )g where c and s t 1 1 i i I I i i define a probability distribution over the records in are the attention context vector and its correspond- the cache. Using context representations as keys in ing decoder state at time step i from the previous the cache, the cache lookup operator can be imple- translations. The two types of representation vectors mented with simple dot products between the stored correspond well to the source- and target-side con- representations and the current one: texts (Tu et al., 2017a). Figure 2(b) illustrates the model architecture. At exp(c>c ) P (c jc ) = t i (5) each decoding step t, the current attention context m i t PI > 0 i0=1 exp(ct ci) ct serves as a query, which is used to match and read from the cache looking for relevant informa- where ct is the attention context representation at tion to generate the target word. The retrieved vec- the current step t, ci is the stored representation tor mt, which embeds target-side contexts of gen- at the i-th slot of the cache, and I is the num- erating similar words in the translation history, is ber of slots in the cache.

Load more