Neural Associative Memory for Dual-Sequence Modeling

Dirk Weissenborn Language Technology Lab, DFKI Alt-Moabit 91c , [email protected]

Abstract problems can (at least partially) be modeled by this paradigm (Li and Hovy, 2015). These Many important NLP problems can be models operate on two distinct sequences, the posed as dual-sequence or sequence-to- source and the target sequence. Some tasks sequence modeling tasks. Recent ad- require the generation of the target based on vances in building end-to-end neural ar- the source (sequence-to-sequence modeling), chitectures have been highly successful in e.g., machine translation, whereas other tasks solving such tasks. In this work we pro- involve making predictions about a given source pose a new architecture for dual-sequence and target sequence (dual-sequence modeling), modeling that is based on associative e.g., recognizing textual entailment. Existing memory. We derive AM-RNNs, a recur- state-of-the-art, end-to-end differentiable models rent associative memory (AM) which aug- for both tasks exploit the same architectural ideas. ments generic recurrent neural networks The ability of such models to carry information (RNN). This architecture is extended to over long distances is a key enabling factor for the Dual AM-RNN which operates on their performance. Typically this can be achieved two AMs at once. Our models achieve by employing recurrent neural networks (RNN) very competitive results on textual en- that convey information over time through an in- tailment. A qualitative analysis demon- ternal memory state. Most famous is the LSTM strates that long range dependencies be- (Hochreiter and Schmidhuber, 1997) that accu- tween source and target-sequence can be mulates information at every time step additively bridged effectively using Dual AM-RNNs. into its memory state, which avoids the prob- However, an initial experiment on auto- lem of vanishing gradients that hindered previous encoding reveals that these benefits are RNN architectures from learning long range de- not exploited by the system when learn- pendencies. For example, Sutskever et al. (2014) ing to solve sequence-to-sequence tasks connected two LSTMs conditionally for machine which indicates that additional supervision translation where the memory state after process- or regularization is needed. ing the source was used as initialization for the memory state of the target LSTM. This very sim- 1 Introduction ple architecture achieved competitive results com- Dual-sequence modeling and sequence-to- pared to existing, very elaborate and feature-rich sequence modeling are important paradigms models. However, learning the inherent long range that are used in many applications involving dependencies between source and target requires natural language, including machine translation extensive training on large datasets. Bahdanau et (Bahdanau et al., 2015; Sutskever et al., 2014), al. (2015) proposed an architecture that resolved recognizing textual entailment (Cheng et al., this issue by allowing the model to attend over 2016; Rocktaschel¨ et al., 2016; Wang and Jiang, all positions in the source sentence when predict- 2016), auto-encoding (Li et al., 2015), syntactical ing the target sentence, which enabled the model parsing (Vinyals et al., 2015) or document-level to automatically learn alignments of words and question answering (Hermann et al., 2015). We phrases of the source with the target sentence. The might even argue that most, if not all, NLP important difference is that previous long range

258 Proceedings of the 1st Workshop on Representation Learning for NLP, pages 258–266, Berlin, Germany, August 11th, 2016. c 2016 Association for Computational Linguistics dependencies could be bridged directly via atten- content to. This form of memory is addressable tion. However, this architecture requires a larger via content or position shifts. Neural Turing Ma- number of operations that scales with the product chines inspired subsequent work on using different of the lengths of the source- and target sequence kinds of external memory, like queues or stacks and a memory that scales with the length of the (Grefenstette et al., 2015). Operations on these source sequence. memories are calculated via a recurrent controller In this work we introduce a novel architecture which is decoupled from the memory whereas for dual-sequence modeling that is based on as- AM-RNNs apply the RNN cell-function directly sociative memories (AM). AMs are fixed sized upon the content of the associative memory. memory arrays used to read and write content via Danihelka et al. (2016) introduced Associative an associated keys. Holographic Reduced Rep- LSTMs which extends standard LSTMs directly resentations (HRR) (Plate, 1995)) enable the ro- by reading and writing operations on an associa- bust and efficient retrieval of previously written tive memory. This architecture is closely related content from redundant memory arrays. Our ap- to ours. However, there are crucial differences that proach is inspired by the works of Danihelka et are due to the fact that we decouple the associative al. (2016) who recently demonstrated the bene- array from the original cell-function. Danihelka et fits of exchanging the memory cell of an LSTM al. (2016) directly include operations on the AM with an associative memory on various sequence in the definition of their Associative LSTM. This modeling tasks. In contrast to their architecture might cause problems, since some operations, e.g., which directly adapts the LSTM architecture we forget, are directly applied to the entire memory propose an augmentation to generic RNNs (AM- array although this can affect all elements stored RNNs, 3.2). Similar in spirit to Neural Turing § in the memory. We believe that only reading and Machines (Graves et al., 2014) we decouple the writing operations with respect to a calculated key AM from the RNN and restrict the interaction with should be performed on the associative memory. the AM to read and write operations which we Further operations should therefore only be ap- believe to be important. Based on this architec- plied on the stored elements. ture we derive the Dual AM-RNN ( 4) that oper- § Neural attention is another important mecha- ates on two associative memories simultaneously nism that realizes a form of content addressable for dual-sequence modeling. We conduct exper- memory. Most famously it has been applied to iments on the task of recognizing textual entail- machine translation (MT) where attention models ment ( 5). Our results and qualitative analysis § automatically learn soft word alignments between demonstrate that AMs can be used to bridge long source and translation (Bahdanau et al., 2015). At- range dependencies similar to the attention mech- tention requires memory that stores states of its in- anism while preserving the computational bene- dividual entries, separately, e.g., states for every fits of conveying information through a single, word in the source sentence of MT or textual en- fixed-size memory state. Finally, an initial in- tailment (Rocktaschel¨ et al., 2016), or entire sen- spection into sequence-to-sequence modeling with tence states as in Sukhbaatar et al. (2015) which Dual AM-RNNs shows that there are open prob- is an end-to-end memory network (Weston et al., lems that need to be resolved to make this ap- 2015) for question answering. Attention weights proach applicable to these kinds of tasks. are computed based on a provided input and the A TensorFlow (Abadi et al., 2015) im- stored elements. The thereby weighted memory plementation of (Dual)-AM RNNs can states are summed and the result is retrieved to be be found at https://github.com/ used as input to a down-stream neural network. dirkweissenborn/dual_am_rnn. Architectures based on attention require a larger amount of memory and a larger number of oper- 2 Related Work ations which scales with the usually dynamically Augmenting RNNs by the use of memory is not growing memory. In contrast to attention Dual novel. Graves et al. (2014) introduced Neural Tur- AM-RNNs utilize fixed size memories and a con- ing Machines which augment RNNs with exter- stant number of operations. nal memory that can be written to and read from. AM-RNNs also have an interesting connection It contains a predefined number of slots to write to LSTM-Networks (Cheng et al., 2016) which re-

259 cently demonstrated impressive results on various with m (Equation (3)). text modeling tasks. LSTM-Networks (LSTMN) select a previous hidden state via attention on a N x˜ = r m = r r x memory tape of past states (intra-attention) op- k k ~ k ~ k0 ~ k0 posed to using the hidden state of the previous kX0=1 N time step. The same idea is implicitly present = xk + rk ~ rk ~ xk in our architecture by retrieving a previous state 0 0 k =1=k via a computed key from the associative memory 0X6 = x + noise (Equation (6)). The main difference lies in the k (3) used memory architecture. We use a fixed size To reduce noise Danihelka et al. (2016) intro- memory array in contrast to a dynamically grow- duce permuted, redundant copies m of m (Equa- ing memory tape which requires growing compu- s tion (4)). This results in uncorrelated retrieval tational and memory resources. The drawback of noises which effectively reduces the overall re- our approach, however, is the potential loss of ex- trieval noise when computing their mean. Con- plicit memories due to retrieval noise or overwrit- sider N permutations represented by permutation ing. c matrices Ps. The retrieval equation becomes the following. 3 Associative Memory RNN N 3.1 Redundant Associative Memory ms = (Psrk) ~ xk (4) In the following, we use the terminology of Dani- Xk=1 N N helka et al. (2016) to introduce Redundant Asso- 1 c x˜ = (P r ) m ciative Memories and Holographic Reduced Rep- k N s k ~ s c s=1 resentations (HRR) (Plate, 1995). HRRs provide a X kX0=1 N N mechanism to encode an item x with a key r that 1 c = x + x P (r r ) can be written to a fixed size memory array m and k k0 ~ s k ~ k0 Nc k =1=k s=1 that can be retrieved from m via r. 0X6 X In HRR, keys r and values x refer to complex = xk + noise vectors that consist of a real and imaginary part: r = r + i r , x = x + i x , where The resulting retrieval noise becomes smaller re · im re · im i is the imaginary unit. We represent these com- because the mean of the permuted, complex key plex vectors as concatenations of their respective products tends towards zero with increasing Nc if the key dimensions are uncorrelated (see Dani- real and imaginary parts, e.g., r = [rre; rim]. The encoding- and retrieval-operation proposed helka et al. (2016) for more information). by Plate (1995) and utilized by Danihelka et 3.2 Augmenting RNNs with Associative al. (2016) is the complex multiplication (Equa- Memory tion (1)) of a key r with its value x (encoding), and A (RNN) can be defined the complex conjugate of the key r = rre i rim N M − · by a parametrized cell-function fθ : R R with the memory (retrieval), respectively. Note, × → RM RH that is recurrently applied to an input that this requires the modulus of the key to be × sequence X = (x , ..., x ). At each time step t equal to one, i.e., √rre rre + rim rim = 1, 1 T 1 it emits an output h and a state s , that is used as such that r = r− . Consider a single memory ar- t t additional input in the following time step (Equa- ray m containing N elements xk with respective tion (5)). keys rk (Equation (2)).

fθ(xt, st 1) = (st, ht) rre xre rim xim − r ~ x = − (1) N M H rre xim + rim xre x R , s R , h R (5)   ∈ ∈ ∈ N In this work we augment RNNs, or more m = rk ~ xk (2) k=1 specifically their cell-function fθ, with associa- X tive memory to form Associative Memory RNNs We retrieve an element xk by multiplying rk (AM-RNN) f˜θ as follows. Let st = [ct; nt] be

260 the concatenation of a memory state ct and, op- sentence at any time step in the target sentence tionally, some remainder nt that might addition- (Bahdanau et al., 2015; Rocktaschel¨ et al., 2016; ally be used in f, e.g., the output of an LSTM. For Cheng et al., 2016). These models are able to learn brevity, we neglect nt in the following, and thus important task specific correlations between words st = ct. At first, we compute a key given the pre- or phrases of the two sentences, like word/phrase vious output and the current input, which is in turn translation, or word-/phrase-level entailment or used to read from the associative memory array m contradiction. The success of these models is to retrieve a memory state s for the specified key mainly due to the fact that long range dependen- (Equation (6)). cies can be bridged directly via attention, instead of keeping information over long distances in a xt rt = bound Wr memory state that can get overwritten. ht 1   −  st 1 = rt ~ mt 1 (6) − − The bound-operation (Danihelka et al., 2016) The same can be achieved through associative (Equation (7)) guarantees that the modulus of rt is memory. Given the correct key a state that was not greater than 1. This is an important necessity written at any time step in the source sentence can as mentioned in 3.1. § be retrieved from an AM with minor noise that can efficiently be reduced by redundancy. There- rre0 d bound(r0) = (7) fore, AMs can bridge long range dependencies and r d  im0  can therefore be used as an alternative to attention. The trade-off for using an AM is that memorized d = max 1, rre0 rre0 + rim0 rim0 states cannot be used for their retrieval. How-  q  ever, the retrieval operation is constant in time and memory whereas the computational and memory Next, we apply the original cell-function f to θ complexity of attention based architectures grow the retrieved memory state (Equation (8)) and the linearly with the length of the source sequence. concatenation of the current input and last output which serves as input to the internal RNN. We update the associative memory array with the up- dated state using the conjugate key of the retrieval We propose two different architectures for solv- key (Equation (9)). ing dual sequence problems. Both approaches

xt use at least one AM-RNN for processing the st, ht = fθ , st 1 (8) ht 1 − source and another for the target sequence. The  −   first approach reads the source sequence X = mt = mt 1 + rt ~ (st st 1) − − − (x1, ..., xTx ) and uses the final associative mem- ˜ x x fθ(xt, mt 1) = (mt, ht) (9) ory array m (:= m ) to initialize the memory − Tx array my = mx of the AM-RNN that processes The entire computation workflow is illustrated 0 the target sequence Y = (y1, ..., yTy ). Note that in Figure 1a. this is basically the the conditional encoding archi- 4 Associative Memory RNNs for Dual tecture of Rocktaschel¨ et al. (2016). Sequence Modeling Important NLP tasks such as machine translation (MT) or detecting textual entailment (TE) involve The second approach uses the final AM array two distinct sequences as input, a source- and a of the source sequence mx in addition to an inde- y target sequence. In MT a system predicts the tar- pendent target AM array mt . At each time step t get sequence based on the source whereas in TE the Dual AM-RNN computes another key rt0 that source and target are given and an entailment-class is used to read from mx and feeds the retrieved should be predicted. Recently, both tasks were value as additional input to yt to the inner RNN of successfully modelled using an attention mecha- the target AM-RNN. These changes are reflected nism that can attend over positions in the source in the Equation (10) (compared to Equation (8))

261 ht RNN Dual AM-RNN

st 1 fθ st •∗ − AM-RNN RNN rt x ~ ~ mT φt x •∗ mt 1 mt r0 − ⊕ t

xt yt y ht 1 ht 1  −   − 

(a) Illustration of AM-RNN for input xt at time step t. (b) Illustration of a Dual AM-RNN that extends the AM-RNN with the utilization of the final memory ar- x ray mTx of source sequence X. Figure 1: Illustration of the computation workflow in AM-RNNs and Dual AM-RNNs. refers to ∗ the complex multiplication with the (complex) conjugate of rt· and can be interpreted as the• retrieval operation. Similarly, ~ can be interpreted as the encoding operation. and illustrated in Figure 1b. respectively. Following Cheng et al. (2016), word embeddings are initialized with Glove (Penning- y t ton et al., 2014) or randomly for unknown words. rt0 = bound Wr0 y ht 1   −  Glove initialized embeddings are tuned only af- x φt = rt0 ~ m ter an initial epoch through the training set.

yt y y Model In this experiment we compare the tradi- st, ht = fθ ht 1 , st 1 (10)  −  −  tional GRU with the (Dual) AM-GRU using con- φt ditional encoding (Rocktaschel¨ et al., 2016) us-    ing shared parameters between source and target RNNs. Associative memory is implemented with 5 Experiments 8 redundant memory copies. For the Dual AM- GRU we define r = r (see 4), i.e., we use 5.1 Setup t0 t § the same key for interacting with the premise and Dataset We conducted experiments on the Stan- hypothesis associative memory array while pro- ford Natural Language Inference (SNLI) Corpus cessing the hypothesis. The rationale behind this (Bowman et al., 2015) that consists of roughly is that we want to retrieve text passages from the 500k sentence pairs (premise-hypothesis). They premise that are similar to text passages of the tar- are annotated with textual entailment labels. The get sequence. task is to predict whether a premise entails, con- All of our models consist of 2 layers with a tradicts or is neutral to a given hypothesis. GRU as top-layer which is intended to summarize Training We perform mini-batch (B = 50) outputs of the bottom layer. The bottom layer cor- stochastic gradient descent using ADAM (Kingma responds to our different architectures. We con- and Ba, 2015) with β1 = 0, β2 = 0.999 and catenate the final output of the premise and hy- 3 an initial learning rate of 10− for small models pothesis together with their absolute difference to (H 100) and 10 4 (H = 500) for our large form the final representation that is used as input ≈ − model. The learning rate was halved whenever ac- to a two-layer perceptron with rectifier-activations curacy dropped over the period of one epoch. Per- for classification. formance on the development set was checked ev- ery 1000 mini-batches and the best model is used 5.2 Results for testing. We employ dropout with a probabil- The results are presented in Table 1. They long ity of 0.1 or 0.2 for the small and large models, range that the H=100-dimensional Dual AM-

262 Model H/ θ E Accuracy | − | LSTM (Rocktaschel¨ et al., 2016) 116/252k 80.9 LSTM shared (Rocktaschel¨ et al., 2016) 159/252k 81.4 LSTM-Attention (Rocktaschel¨ et al., 2016) 100/252k 83.5 GRU shared 126/321k 81.9 AM-GRU shared 108/329k 82.9 Dual AM-GRU shared 100/321k 84.4 Dual AM-GRU shared 500/5.6m 85.4 LSTM Network (Cheng et al., 2016) 450/3.4m 86.3

Table 1: Accuracies of different RNN-based architectures on SNLI dataset. We also report the respec- tive hidden dimension H and number of parameters θ E for each architecture without taking word | − | embeddings E into account.

GRU and conditional AM-GRU outperform our larger than traditional attention. baseline GRU system significantly. Especially the Dual AM-GRU does very well on this task achiev- 5.3 Sequence-to-Sequence Modeling ing 84.4% accuracy, which shows that it is im- End-to-end differentiable sequence-to-sequence portant to utilize the associative memory of the models consist of an encoder that encodes the premise separately for reading only. Most no- source sequence and a decoder which produces tably is that it achieves even better results than a the target sequence based on the encoded source. comparable LSTM architecture with two-way at- In a preliminary experiment we applied the Dual tention between all premise and hypothesis words AM-GRU without shared parameters to the task of (LSTM-Attention). This indicates that our Dual auto-encoding where source- and target sequence AM-GRU architecture is at least able to per- are the same. Intuitively we would like the AM- form similar or even better than an attention-based GRU to write phrase-level information with dif- model in this setup. ferent keys to the associative memory. However, we found that the encoder AM-GRU learned very We investigated this finding qualitatively from quickly to write everything with the same key to sampled examples by plotting heatmaps of cosine memory, which makes it work very similar to a similarities between the content that has been writ- standard RNN based encoder-decoder architecture ten to memory at every time step in the premise where the encoder state is simply used to initialize and what has been retrieved from it while the the decoder state. Dual AM-GRU processes the hypothesis. Random This finding is illustrated in Figure 3. The pre- examples are shown in Figure 2, where we can sented heatmap shows similarities between con- see that the Dual AM-GRU is indeed able to re- tent that has been retrieved while predicting the trieve the content from the premise memory that is target sequence and what has been written by the most related with the respective hypothesis words, encoder to memory. We observe that the similari- thus allowing to bridge important long-range de- ties between retrieved content and written content pendencies for solving this task similar to atten- are horizontally slightly increasing, i.e., towards tion. We observe that content for related words the end of the encoded source sentence. This indi- and phrases is retrieved from the premise memory cates that the encoder overwrites the the associa- when processing the hypothesis, e.g., “play” and tive memory while processing the source with the “video game” or “artist” and “sculptor”. same key. Increasing the size of the hidden dimension to 500 improves accuracy by another percentage 5.4 Discussion point. The recently proposed LSTM Network Our experiments on entailment show that the achieves slightly better results. However, its num- idea of using associative memory to bridge long ber of operations scales with the square of the term dependencies for dual-sequence modeling summed source and target sequence, which is even can work very well. However, this architecture is

263 Figure 2: Heatmaps of cosine similarity between content that has been written to the associative memory at each time step of the premise (x-axis) and what has been retrieved from it by the Dual AM-GRU while processing the hypothesis (y-axis).

264 Applying the Dual AM-RNN to a sequence-to- sequence modeling task revealed that the benefits of bridging long range dependencies cannot yet be achieved for this kind of problem. However, quan- titative as well as qualitative results on textual en- tailment are very promising and therefore we be- lieve that the Dual AM-RNN can be an impor- tant building block for NLP tasks involving two sequences.

Acknowledgments We thank Sebastian Krause, Tim Rocktaschel¨ and Leonhard Hennig for comments on an early draft of this work. This research was supported by the German Federal Ministry of Education and Re- search (BMBF) through the projects ALL SIDES (01IW14002), BBDC (01IS14013E), and Soft- Figure 3: Heatmap of cosine similarity between ware Campus (01IS12050, sub-project GeNIE). content that has been written to the associative memory at each time step by the encoder (x-axis) and what has been retrieved from it by the Dual References AM-GRU while decoding (y-axis). Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor- rado, Andy Davis, Jeffrey Dean, Matthieu Devin, not naively transferable to the task of sequence- Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal to-sequence modeling. We believe that the main Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh difficulty lies in the computation of an appropri- Levenberg, Dan Mane,´ Rajat Monga, Sherry Moore, ate key at every time step in the target sequence to Derek Murray, Chris Olah, Mike Schuster, Jonathon retrieve related content. Furthermore, the encoder Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- should be enforced to not always use the same key. war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- van, Fernanda Viegas,´ Oriol Vinyals, Pete Warden, For example, keys could be based on syntactical Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi- and semantical cues, which might ultimately result aoqiang Zheng. 2015. TensorFlow: Large-scale in capturing some form of Frame Semantics (Fill- on heterogeneous systems. Soft- more and Baker, 2001). This could facilitate de- ware available from tensorflow.org. coding significantly. We believe that this might be Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- achieved via regularization or by curriculum learn- gio. 2015. Neural machine translation by jointly ing (Bengio et al., 2009). learning to align and translate. In The International Conference on Learning Representations (ICLR). 6 Conclusion Yoshua Bengio, Jer´ omeˆ Louradour, Ronan Collobert, We introduced the Dual AM-RNN, a recurrent and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international confer- neural architecture that operates on associative ence on machine learning, pages 41–48. ACM. memories. The AM-RNN augments traditional RNNs generically with associative memory. The Samuel R. Bowman, Gabor Angeli, Christopher Potts, Dual AM-RNN extends AM-RNNs with a sec- and Christopher D Manning. 2015. A large an- notated corpus for learning natural language infer- ond read-only memory. Its ability to capture ence. In Proceedings of the 2015 Conference on long range dependencies enables effective learn- Empirical Methods in Natural Language Processing ing of dual-sequence modeling tasks such as rec- (EMNLP). Association for Computational Linguis- ognizing textual entailment. Our models achieve tics. very competitive results and outperform a com- Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. parable attention-based model while preserving Long short-term memory-networks for machine constant computational and memory resources. reading. arXiv preprint arXiv:1601.06733.

265 Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Kalchbrenner, and Alex Graves. 2016. Asso- Sequence to sequence learning with neural net- ciative long short-term memory. arXiv preprint works. In Advances in neural information process- arXiv:1602.03032. ing systems, pages 3104–3112.

Charles J Fillmore and Collin F Baker. 2001. Frame Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, semantics for text understanding. In Proceedings Ilya Sutskever, and Geoffrey Hinton. 2015. Gram- of WordNet and Other Lexical Resources Workshop, mar as a foreign language. In Advances in Neural NAACL. Information Processing Systems, pages 2755–2763. Shuohang Wang and Jing Jiang. 2016. Learning natu- Alex Graves, Greg Wayne, and Ivo Danihelka. ral language inference with lstm. In Proceedings of 2014. Neural turing machines. arXiv preprint the 2016 Human Language Technology Conference arXiv:1410.5401. of the North American Chapter of the Association of Computational Linguistics. Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. 2015. Learning to Jason Weston, Sumit Chopra, and Antoine Bordes. transduce with unbounded memory. In Advances 2015. Memory networks. In The International Con- in Neural Information Processing Systems, pages ference on Learning Representations (ICLR). 1819–1827.

Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- leyman, and Phil Blunsom. 2015. Teaching ma- chines to read and comprehend. In Advances in Neu- ral Information Processing Systems (NIPS).

Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In The Inter- national Conference on Learning Representations (ICLR).

Jiwei Li and Eduard Hovy. 2015. The NLP engine: A universal turing machine for nlp. arXiv preprint arXiv:1503.00168.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. In 53nd Annual Meeting of the As- sociation for Computational Linguistics (ACL).

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532– 1543.

Tony A Plate. 1995. Holographic reduced represen- tations. Neural networks, IEEE transactions on, 6(3):623–641.

Tim Rocktaschel,¨ Edward Grefenstette, Karl Moritz Hermann, Toma´sˇ Kociskˇ y,` and Phil Blunsom. 2016. Reasoning about entailment with neural attention. The International Conference on Learning Repre- sentations (ICLR).

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems, pages 2431–2439.

266