Neural Associative Memory for Dual-Sequence Modeling

Neural Associative Memory for Dual-Sequence Modeling Dirk Weissenborn Language Technology Lab, DFKI Alt-Moabit 91c Berlin, Germany [email protected] Abstract problems can (at least partially) be modeled by this paradigm (Li and Hovy, 2015). These Many important NLP problems can be models operate on two distinct sequences, the posed as dual-sequence or sequence-to- source and the target sequence. Some tasks sequence modeling tasks. Recent ad- require the generation of the target based on vances in building end-to-end neural ar- the source (sequence-to-sequence modeling), chitectures have been highly successful in e.g., machine translation, whereas other tasks solving such tasks. In this work we pro- involve making predictions about a given source pose a new architecture for dual-sequence and target sequence (dual-sequence modeling), modeling that is based on associative e.g., recognizing textual entailment. Existing memory. We derive AM-RNNs, a recur- state-of-the-art, end-to-end differentiable models rent associative memory (AM) which aug- for both tasks exploit the same architectural ideas. ments generic recurrent neural networks The ability of such models to carry information (RNN). This architecture is extended to over long distances is a key enabling factor for the Dual AM-RNN which operates on their performance. Typically this can be achieved two AMs at once. Our models achieve by employing recurrent neural networks (RNN) very competitive results on textual en- that convey information over time through an in- tailment. A qualitative analysis demon- ternal memory state. Most famous is the LSTM strates that long range dependencies be- (Hochreiter and Schmidhuber, 1997) that accu- tween source and target-sequence can be mulates information at every time step additively bridged effectively using Dual AM-RNNs. into its memory state, which avoids the prob- However, an initial experiment on auto- lem of vanishing gradients that hindered previous encoding reveals that these benefits are RNN architectures from learning long range de- not exploited by the system when learn- pendencies. For example, Sutskever et al. (2014) ing to solve sequence-to-sequence tasks connected two LSTMs conditionally for machine which indicates that additional supervision translation where the memory state after process- or regularization is needed. ing the source was used as initialization for the memory state of the target LSTM. This very sim- 1 Introduction ple architecture achieved competitive results com- Dual-sequence modeling and sequence-to- pared to existing, very elaborate and feature-rich sequence modeling are important paradigms models. However, learning the inherent long range that are used in many applications involving dependencies between source and target requires natural language, including machine translation extensive training on large datasets. Bahdanau et (Bahdanau et al., 2015; Sutskever et al., 2014), al. (2015) proposed an architecture that resolved recognizing textual entailment (Cheng et al., this issue by allowing the model to attend over 2016; Rocktaschel¨ et al., 2016; Wang and Jiang, all positions in the source sentence when predict- 2016), auto-encoding (Li et al., 2015), syntactical ing the target sentence, which enabled the model parsing (Vinyals et al., 2015) or document-level to automatically learn alignments of words and question answering (Hermann et al., 2015). We phrases of the source with the target sentence. The might even argue that most, if not all, NLP important difference is that previous long range 258 Proceedings of the 1st Workshop on Representation Learning for NLP, pages 258–266, Berlin, Germany, August 11th, 2016. c 2016 Association for Computational Linguistics dependencies could be bridged directly via atten- content to. This form of memory is addressable tion. However, this architecture requires a larger via content or position shifts. Neural Turing Ma- number of operations that scales with the product chines inspired subsequent work on using different of the lengths of the source- and target sequence kinds of external memory, like queues or stacks and a memory that scales with the length of the (Grefenstette et al., 2015). Operations on these source sequence. memories are calculated via a recurrent controller In this work we introduce a novel architecture which is decoupled from the memory whereas for dual-sequence modeling that is based on as- AM-RNNs apply the RNN cell-function directly sociative memories (AM). AMs are fixed sized upon the content of the associative memory. memory arrays used to read and write content via Danihelka et al. (2016) introduced Associative an associated keys. Holographic Reduced Rep- LSTMs which extends standard LSTMs directly resentations (HRR) (Plate, 1995)) enable the ro- by reading and writing operations on an associa- bust and efficient retrieval of previously written tive memory. This architecture is closely related content from redundant memory arrays. Our ap- to ours. However, there are crucial differences that proach is inspired by the works of Danihelka et are due to the fact that we decouple the associative al. (2016) who recently demonstrated the bene- array from the original cell-function. Danihelka et fits of exchanging the memory cell of an LSTM al. (2016) directly include operations on the AM with an associative memory on various sequence in the definition of their Associative LSTM. This modeling tasks. In contrast to their architecture might cause problems, since some operations, e.g., which directly adapts the LSTM architecture we forget, are directly applied to the entire memory propose an augmentation to generic RNNs (AM- array although this can affect all elements stored RNNs, 3.2). Similar in spirit to Neural Turing § in the memory. We believe that only reading and Machines (Graves et al., 2014) we decouple the writing operations with respect to a calculated key AM from the RNN and restrict the interaction with should be performed on the associative memory. the AM to read and write operations which we Further operations should therefore only be ap- believe to be important. Based on this architec- plied on the stored elements. ture we derive the Dual AM-RNN ( 4) that oper- § Neural attention is another important mecha- ates on two associative memories simultaneously nism that realizes a form of content addressable for dual-sequence modeling. We conduct exper- memory. Most famously it has been applied to iments on the task of recognizing textual entail- machine translation (MT) where attention models ment ( 5). Our results and qualitative analysis § automatically learn soft word alignments between demonstrate that AMs can be used to bridge long source and translation (Bahdanau et al., 2015). At- range dependencies similar to the attention mech- tention requires memory that stores states of its in- anism while preserving the computational bene- dividual entries, separately, e.g., states for every fits of conveying information through a single, word in the source sentence of MT or textual en- fixed-size memory state. Finally, an initial in- tailment (Rocktaschel¨ et al., 2016), or entire sen- spection into sequence-to-sequence modeling with tence states as in Sukhbaatar et al. (2015) which Dual AM-RNNs shows that there are open prob- is an end-to-end memory network (Weston et al., lems that need to be resolved to make this ap- 2015) for question answering. Attention weights proach applicable to these kinds of tasks. are computed based on a provided input and the A TensorFlow (Abadi et al., 2015) im- stored elements. The thereby weighted memory plementation of (Dual)-AM RNNs can states are summed and the result is retrieved to be be found at https://github.com/ used as input to a down-stream neural network. dirkweissenborn/dual_am_rnn. Architectures based on attention require a larger amount of memory and a larger number of oper- 2 Related Work ations which scales with the usually dynamically Augmenting RNNs by the use of memory is not growing memory. In contrast to attention Dual novel. Graves et al. (2014) introduced Neural Tur- AM-RNNs utilize fixed size memories and a con- ing Machines which augment RNNs with exter- stant number of operations. nal memory that can be written to and read from. AM-RNNs also have an interesting connection It contains a predefined number of slots to write to LSTM-Networks (Cheng et al., 2016) which re- 259 cently demonstrated impressive results on various with m (Equation (3)). text modeling tasks. LSTM-Networks (LSTMN) select a previous hidden state via attention on a N x˜ = r m = r r x memory tape of past states (intra-attention) op- k k ~ k ~ k0 ~ k0 posed to using the hidden state of the previous kX0=1 N time step. The same idea is implicitly present = xk + rk ~ rk ~ xk in our architecture by retrieving a previous state 0 0 k =1=k via a computed key from the associative memory 0X6 = x + noise (Equation (6)). The main difference lies in the k (3) used memory architecture. We use a fixed size To reduce noise Danihelka et al. (2016) intro- memory array in contrast to a dynamically grow- duce permuted, redundant copies m of m (Equa- ing memory tape which requires growing compu- s tion (4)). This results in uncorrelated retrieval tational and memory resources. The drawback of noises which effectively reduces the overall re- our approach, however, is the potential loss of ex- trieval noise when computing their mean. Con- plicit memories due to retrieval noise or overwrit- sider N permutations represented by permutation ing. c matrices Ps. The retrieval equation becomes the following. 3 Associative Memory RNN N 3.1 Redundant Associative Memory ms = (Psrk) ~ xk (4) In the following, we use the terminology of Dani- Xk=1 N N helka et al. (2016) to introduce Redundant Asso- 1 c x˜ = (P r ) m ciative Memories and Holographic Reduced Rep- k N s k ~ s c s=1 resentations (HRR) (Plate, 1995).

Load more