Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

Omer Levy∗ Kenton Lee∗ Nicholas FitzGerald Luke Zettlemoyer Paul G. Allen School, University of Washington, Seattle, WA {omerlevy,kentonl,nfitz,lsz}@cs.washington.edu

Abstract We present an alternate view to explain the suc- cess of LSTMs: the gates themselves are power- LSTMs were introduced to combat van- ful recurrent models that provide more representa- ishing gradients in simple RNNs by aug- tional power than previously realized. To demon- menting them with gated additive recur- strate this, we first show that LSTMs can be seen rent connections. We present an alterna- as a combination of two recurrent models: (1) an tive view to explain the success of LSTMs: S-RNN, and (2) an element-wise weighted sum of the gates themselves are versatile recurrent the S-RNN’s outputs over time, which is implicitly models that provide more representational computed by the gates. We hypothesize that, for power than previously appreciated. We many practical NLP problems, the weighted sum do this by decoupling the LSTM’s gates serves as the main modeling component. The S- from the embedded simple RNN, produc- RNN, while theoretically expressive, is in practice ing a new class of RNNs where the recur- only a minor contributor that clouds the mathemat- rence computes an element-wise weighted ical clarity of the model. By replacing the S-RNN sum of context-independent functions of with a context-independent function of the input, the input. Ablations on a range of prob- we arrive at a much more restricted class of RNNs, lems demonstrate that the gating mecha- where the main recurrence is via the element-wise nism alone performs as well as an LSTM weighted sums that the gates are computing. in most settings, strongly suggesting that We test our hypothesis on NLP problems, where the gates are doing much more in practice LSTMs are wildly popular at least in part due than just alleviating vanishing gradients. to their ability to model crucial phenomena such 1 Introduction as word order (Adi et al., 2017), syntactic struc- ture (Linzen et al., 2016), and even long-range se- Long short-term memory (LSTM) (Hochreiter and mantic dependencies (He et al., 2017). We con- Schmidhuber, 1997) has become the de-facto re- sider four challenging tasks: language modeling, current neural network (RNN) for learning repre- question answering, dependency parsing, and ma- sentations of sequences in NLP. Like simple re- chine translation. Experiments show that while re- current neural networks (S-RNNs) (Elman, 1990), moving the gates from an LSTM can severely hurt LSTMs are able to learn non-linear functions of performance, replacing the S-RNN with a simple arbitrary-length input sequences. However, they linear transformation of the input results in min- also introduce an additional memory cell to mit- imal or no loss in model performance. We also igate the vanishing gradient problem (Hochreiter, show that, in many cases, LSTMs can be further 1991; Bengio et al., 1994). This memory is con- simplified by removing the output gate, arriving trolled by a mechanism of gates, whose additive at an even more transparent architecture, where connections allow long-distance dependencies to the output is a context-independent function of be learned more easily during backpropagation. the weighted sum. Together, these results suggest While this view is mathematically accurate, in this that the gates’ ability to compute an element-wise paper we argue that it does not provide a complete weighted sum, rather than the non-linear transition picture of why LSTMs work in practice. dynamics of S-RNNs, are the driving force behind The∗ first two authors contributed equally to this paper. LSTM’s success.

732 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 732–739 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 2 What Do Memory Cells Compute? what part of the memory is deleted by filtering the previous state of the memory (c ). Writing to LSTMs are typically motivated as an augmenta- t−1 the memory is done by adding the filtered content tion of simple RNNs (S-RNNs), defined as: (it ◦ cet) to the retained memory (ft ◦ ct−1). ht = tanh(Whhht−1 + Whxxt + bh) (1) Output Layer (Equations6-7) The output S-RNNs suffer from the vanishing gradient prob- layer ht passes the memory cell through a tanh lem (Hochreiter, 1991; Bengio et al., 1994) due to activation function and uses an output gate ot to compounding multiplicative updates of the hidden read selectively from the squashed memory cell. state. By introducing a memory cell and an output Our goal is to study how much each of these layer controlled by gates, LSTMs enable shortcuts components contribute to the empirical perfor- through which gradients can flow when learning mance of LSTMs. In particular, it is worth consid- with backpropagation. This mechanism enables ering the memory cell in more detail to reveal why learning of long-distance dependencies while pre- it could serve as a standalone powerful model of serving the expressive power of recurrent non- long-distance context. It is possible to show that linear transformations provided by S-RNNs. it implicitly computes an element-wise weighted Rather than viewing the gates as simply an aux- sum of all the previous content layers by expand- iliary mechanism to address a learning problem, ing the recurrence relation in Equation5: we present an alternate view that emphasizes their ct = it ◦ cet + ft ◦ ct−1 modeling strengths. We argue that the LSTM t t should be interpreted as a hybrid of two distinct X  Y  = ij ◦ fk ◦ cej recurrent architectures: (1) the S-RNN which pro- j=0 k=j+1 (8) vides multiplicative connections across timesteps, t and (2) the memory cell which provides additive X t = wj ◦ cej connections across timesteps. On top of these re- j=0 currences, an output layer is included that simply t squashes and filters the memory cell at each step. Each weight wj is a product of the input gate ij (when its respective input c was read) and every Throughout this paper, let {x1,..., xn} be the ej subsequent forget gate f . An interesting property sequence of input vectors, {h1,..., hn} be the se- k of these weights is that, like the gates, they are also quence of output vectors, and {c1,..., cn} be the memory cell’s states. Then, given the basic LSTM soft element-wise binary filters. definition below, we can formally identify three 3 Standalone Memory Cells are Powerful sub-components. The restricted space of element-wise weighted cet = tanh(Wchht−1 + Wcxxt + bc) (2) sums allows for easier mathematical analysis, vi- it = σ(Wihht−1 + Wixxt + bi) (3) sualization, and perhaps even learnability. How-

ft = σ(Wfhht−1 + Wfxxt + bf ) (4) ever, constrained function spaces are also less ex- pressive, and a natural question is whether these ct = it ◦ ct + ft ◦ ct−1 (5) e models will work well for NLP problems that in- o = σ(W h + W x + b ) (6) t oh t−1 ox t o volve understanding context. We hypothesize that ht = ot ◦ tanh(ct) (7) the memory cell (which computes weighted sums) can function as a standalone contextualizer. To Content Layer (Equation2) We refer to c as et test this hypothesis, we present several simplifica- the content layer, which is the output of an S- tions of the LSTM’s architecture (Section 3.1), and RNN. Evaluating the need for multiplicative recur- show on a variety of NLP benchmarks that there rent connections in the content layer is the focus of is a qualitative performance difference between this work. The content layer is passed to the mem- models that contain a memory cell and those that ory cell, which decides which parts of it to store. do not (Section 3.2). We conclude that the content Memory Cell (Equations3-5) The memory cell and output layers are relatively minor contributors, ct is controlled by two gates. The input gate it and that the space of element-wise weighted sums controls what part of the content (cet) is written is sufficiently powerful to compete with fully pa- to the memory, while the forget gate ft controls rameterized LSTMs (Section 3.3).

733 3.1 Simplified Models task we also report the mean and standard devia- The modeling power of LSTMs is commonly as- tion of 5 runs of the LSTM settings to demonstrate sumed to derive from the S-RNN in the content the typical variance observed due to training with layer, with the rest of the model acting as a learn- different random initializations. ing aid to bypass the vanishing gradient problem. Language Modeling We evaluate the models on We first isolate the S-RNN by ablating the gates the Penn Treebank (PTB) (Marcus et al., 1993) (denoted as LSTM – GATES for consistency). language modeling benchmark. We use the im- To test whether the memory cell has enough plementation of Zaremba et al.(2014) from Ten- modeling power of its own, we take an LSTM sorFlow’s tutorial while replacing any invocation and replace the S-RNN in the content layer from of LSTMs with simpler models. We test two of Equation2 with a simple linear transformation their configurations: medium and large (Table1). (cet = Wcxxt) creating the LSTM – S-RNN model. We further simplify the LSTM by removing the Question Answering For question answering, output gate from Equation7( ht = tanh(ct)), we use two different QA systems on the Stan- leaving only the activation function in the output ford question answering dataset (SQuAD) (Ra- layer (LSTM – S-RNN – OUT). After removing the jpurkar et al., 2016): the Bidirectional Atten- S-RNN and the output gate from the LSTM, the tion Flow model (BiDAF) (Seo et al., 2016) and entire ablated model can be written in a modular, DrQA (Chen et al., 2017). BiDAF contains compact form: 3 LSTMs, which are referred to as the phrase

t layer, the modeling layer, and the span end en-  X t  coder. Our experiments replace each of these ht = OUTPUT wj ◦ CONTENT(xj) (9) j=0 LSTMs with their simplified counterparts. We di- rectly use the implementation of BiDAF from Al- where the content layer CONTENT(·) and the out- lenNLP (Gardner et al., 2017), and all experiments put layer OUTPUT(·) are both context-independent reuse the existing hyperparameters that were tuned functions, making the entire model highly con- for LSTMs. Likewise, we use an open-source strained and mathematically simpler. The com- implementation of DrQA1 and replace only the plexity of modeling contextual information is LSTMs, while leaving everything else intact. Ta- t needed only for computing the weights wj. As ble2 shows the results. we will see in Section 3.2, both of these ablations perform on par with LSTMs on several tasks. Dependency Parsing For dependency pars- Finally, we ablate the hidden state from the ing, we use the Deep Biaffine Dependency Parser (Dozat and Manning, 2016), which relies gates as well, by computing each gate gt via on stacked bidirectional LSTMs to learn context- σ(Wgxxt +bg). In this model, the only recurrence is the additive connection in the memory cell; it sensitive word embeddings for determining arcs has no multiplicative recurrent connections at all. between a pair of words. We directly use their re- It can be seen as a type of QRNN (Bradbury et al., leased implementation, which is evaluated on the 2016) or SRU (Lei et al., 2017b), but for consis- Universal Dependencies English Web Treebank tency we label it as LSTM – S-RNN – HIDDEN. v1.3 (Silveira et al., 2014). In our experiments, we use the existing hyperparameters and only re- 3.2 Experiments place the LSTMs with the simplified architectures. We compare model performance on four NLP Table3 shows the results. tasks, with an experimental setup that is lenient Machine Translation For machine translation, towards LSTMs and harsh towards its simplifica- we used OpenNMT (Klein et al., 2017) to train En- tions. In each case, we use existing implementa- glish to German translation models on the multi- tions and previously reported hyperparameter set- modal benchmarks from WMT 2016 (used in tings. Since these settings were tuned for LSTMs, OpenNMT’s readme file). We use OpenNMT’s any simplification that performs equally to (or bet- default model and hyperparameters, replacing the ter than) LSTMs under these LSTM-friendly set- stacked bidirectional LSTM encoder with the sim- tings provides strong evidence that the ablated component is not a contributing factor. For each 1https://github.com/hitvoice/DrQA

734 Configuration Model Perplexity Model UAS LAS LSTM 83.9 ± 0.3 LSTM 90.60 ± 0.21 88.05 ± 0.33 – S-RNN 80.5 – S-RNN 90.77 88.49 PTB (Medium) – S-RNN – OUT 81.6 – S-RNN – OUT 90.70 88.31 – S-RNN – HIDDEN 83.3 – S-RNN – HIDDEN 90.53 87.96 – GATES 140.9 – GATES 87.75 84.61 LSTM 78.8 ± 0.2 – S-RNN 76.0 Table 3: Performance on the universal dependen- PTB (Large) – S-RNN – OUT 78.5 cies parsing benchmark, measured by unlabeled – S-RNN – HIDDEN 82.9 – GATES 126.1 (UAS) and labeled attachment score (LAS).

Table 1: Performance on language modeling Model BLEU benchmarks, measured by perplexity. LSTM 38.19 ± 0.1 – S-RNN 37.84 – S-RNN – OUT 38.36 System Model EM F1 – S-RNN – HIDDEN 36.98 LSTM 67.9 ± 0.3 77.5 ± 0.2 – GATES 26.52 – S-RNN 68.4 78.2 BiDAF – S-RNN – OUT 67.4 77.2 Table 4: Performance on the WMT 2016 multi- – S-RNN – HIDDEN 66.5 76.6 – GATES 62.9 73.3 modal English to German benchmark. LSTM 68.8 ± 0.2 78.2 ± 0.2 – S-RNN 68.0 77.2 DrQA RNN’s, and falls within the standard deviation of – S-RNN – OUT 68.7 77.9 – S-RNN – HIDDEN 67.9 77.2 an LSTM’s on some tasks (see Table3). This latter – GATES 56.4 66.5 result indicates that the additive recurrent connec- tion in the memory cell – and not the multiplicative Table 2: Performance on SQuAD, measured by recurrent connections in the content layer or in the exact match (EM) and span overlap (F1). gates – is the most important computational ele- ment in an LSTM. As a corollary, this result also

2 suggests that a weighted sum of context words, plified architectures. Table4 shows the results. while mathematically simple, is a powerful model 3.3 Discussion of contextual information. We showed four major ablations of the LSTM. In 4 LSTM as Self-Attention the S-RNN experiments (LSTM – GATES), we ab- late the memory cell and the output layer. In the Attention mechanisms are widely used in the NLP LSTM – S-RNN and LSTM – S-RNN – OUT exper- literature to aggregate over a sequence (Cho et al., iments, we ablate the S-RNN. In the LSTM – S- 2014; Bahdanau et al., 2015) or contextualize to- RNN – HIDDEN, we remove not only the S-RNN kens within a sequence (Cheng et al., 2016; Parikh in the content layer, but also the S-RNNs in the et al., 2016) by explicitly computing weighted gates, resulting in a model whose sole recurrence sums. In the previous sections, we demonstrated is in the memory cell’s additive connection. that LSTMs implicitly compute weighted sums as As consistent with previous literature, removing well, and that this computation is central to their the memory cell degrades performance drastically. success. How, then, are these two computations In contrast, removing the S-RNN makes little to related, and in what ways do they differ? no difference in the final performance, suggesting After simplifying the content layer and remov- that the memory cell alone is largely responsible ing the output gate (LSTM – S-RNN – OUT), for the success of LSTMs in NLP. the model’s computation can be expressed as a Even after removing every multiplicative recur- weighted sum of context-independent functions of rence from the memory cell itself, the model’s the inputs (Equation9 in Section 3.1). This for- performance remains well above the vanilla S- mula abstracts over both the simplified LSTM and the family of attention mechanisms, and through 2For the S-RNN baseline (LSTM – GATES), we had to this lens, the memory cell’s computation can be tune the learning rate to 0.1 because the default value (1.0) resulted in exploding gradients. This is the only case where seen as a “cousin” of self-attention. In fact, we hyperparameters were modified in all of our experiments. can also leverage this abstraction to visualize the

735 simplified LSTM’s weights as is commonly done duzzi and Ghifary, 2016), recurrent additive net- with attention (see AppendixA for visualization). works (Lee et al., 2017), kernel neural net- However, there are three major differences in works (Lei et al., 2017a), and simple recurrent t how the weights wj are computed. units (Lei et al., 2017b), making it increasingly ap- First, the LSTM’s weights are vectors, while parent that LSTMs are over-parameterized. While attention typically computes scalar weights; i.e. these works indicate an obvious trend, they do not a separate weighted sum is computed for every focus on explaining what LSTMs are learning. In dimension of the LSTM’s memory cell. Multi- our carefully controlled ablation studies, we pro- headed self-attention (Vaswani et al., 2017) can pose and evaluate the minimal changes required be seen as a middle ground between the two ap- to test our hypothesis that LSTMs are powerful proaches, allocating a scalar weight for different because they dynamically compute element-wise subsets of the dimensions. weighted sums of content layers. Second, the weighted sum is accumulated with a dynamic program. This enables a linear rather 6 Conclusion than quadratic complexity in comparison to self- We presented an alternate view of LSTMs: they attention, but reduces the amount of parallel com- are a hybrid of S-RNNs and a gated model that dy- putation. This accumulation also creates an induc- namically computes weighted sums of the S-RNN tive bias of attending to nearby words, since the outputs. Our experiments investigated whether the weights can only decrease over time. S-RNN is a necessary component of LSTMs. In Finally, attention has a probabilistic interpreta- other words, are the gates alone as powerful of tion due to the softmax normalization, while the a model as an LSTM? Results across four ma- sum of weights in LSTMs can grow up to the se- jor NLP tasks (language modeling, question an- quence length. In variants of the LSTM that tie the swering, dependency parsing, and machine trans- input and forget gate, such as coupled-gate LSTMs lation) indicate that LSTMs suffer little to no per- (Greff et al., 2016) and GRUs (Cho et al., 2014), formance loss when removing the S-RNN. This the memory cell instead computes a weighted av- provides evidence that the gating mechanism is erage with a probabilistic interpretation. These doing the heavy lifting in modeling context. We variants compute locally normalized distributions further ablate the recurrence in each gate and find via a product of sigmoids rather than globally nor- that this incurs only a modest drop in performance, malized distributions via a single softmax. indicating that the real modeling power of LSTMs stems from their ability to compute element-wise 5 Related Work weighted sums of context-independent functions Many variants of LSTMs (Hochreiter and Schmid- of their inputs. huber, 1997) have been previously explored. This realization allows us to mathemati- These typically consist of a different parameteri- cally relate LSTMs and other gated RNNs to zation of the gates, such as LSTMs with peephole attention-based models. Casting an LSTM as a connections (Gers and Schmidhuber, 2000), or a dynamically-computed attention mechanism en- rewiring of the connections, such as GRUs (Cho ables the visualization of how context is used at et al., 2014). However, these modifications invari- every timestep, shedding light on the inner work- ably maintain the recurrent content layer. Even ings of the relatively opaque LSTM. more systematic explorations (Jozefowicz´ et al., Acknowledgements 2015; Greff et al., 2016; Zoph and Le, 2017) do not question the importance of the embedded S- The research was supported in part by DARPA RNN. This is the first study to provide apples- under the DEFT program (FA8750-13-2-0019), to-apples comparisons between LSTMs with and the ARO (W911NF-16-1-0121), the NSF (IIS- without the recurrent content layer. 1252835, IIS-1562364), gifts from , Ten- Several other recent works have also reported cent, and Nvidia, and an Allen Distinguished In- promising results with recurrent models that vestigator Award. We also thank Yoav Goldberg, are vastly simpler than LSTMs, such as quasi- Benjamin Heinzerling, Tao Lei, and the UW NLP recurrent neural networks (Bradbury et al., 2016), group for helpful conversations and comments on strongly-typed recurrent neural networks (Bal- the work.

736 References Allennlp: A deep semantic natural language pro- cessing platform. http://allennlp.org/ Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer papers/AllenNLP_white_paper.pdf. Lavi, and Yoav Goldberg. 2017. Fine-grained anal- ysis of sentence embeddings using auxiliary predic- Felix A. Gers and Jurgen¨ Schmidhuber. 2000. Recur- tion tasks. In ICLR. rent nets that time and count. In IJCNN.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Klaus Greff, Rupesh K Srivastava, Jan Koutn´ık, Bas R gio. 2015. Neural machine translation by jointly Steunebrink, and Jurgen¨ Schmidhuber. 2016. Lstm: learning to align and translate. In ICLR. A search space odyssey. IEEE Transactions on Neu- ral Networks and Learning Systems David Balduzzi and Muhammad Ghifary. 2016. . Strongly-typed recurrent neural networks. In Pro- Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle- ceedings of the 33nd International Conference moyer. 2017. Deep semantic role labeling: What on , ICML 2016, New York works and whats next. In Proceedings of the An- City, NY, USA, June 19-24, 2016 . pages 1292– nual Meeting of the Association for Computational 1300. http://jmlr.org/proceedings/ Linguistics. papers/v48/balduzzi16.html.

Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi. Sepp Hochreiter. 1991. Untersuchungen zu dynamis- Diploma, Technische Uni- 1994. Learning long-term dependencies with gradi- chen neuronalen netzen. versitat¨ Munchen¨ ent descent is difficult. IEEE Transactions on Neu- 91. ral Networks 5(2):157–166. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. James Bradbury, Stephen Merity, Caiming Xiong, and Long Short-term Memory. Neural computation Richard Socher. 2016. Quasi-recurrent neural net- 9(8):1735–1780. works. CoRR abs/1611.01576. Rafal Jozefowicz,´ Wojciech Zaremba, and Ilya Danqi Chen, Adam Fisch, Jason Weston, and An- Sutskever. 2015. An empirical exploration of recur- toine Bordes. 2017. Reading wikipedia to answer rent network architectures. In ICML. open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computa- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean tional Linguistics (Volume 1: Long Papers). Asso- Senellart, and Alexander M. Rush. 2017. Opennmt: ciation for Computational Linguistics, Vancouver, Open-source toolkit for neural machine translation. Canada, pages 1870–1879. http://aclweb. In Proc. ACL. https://doi.org/10.18653/ org/anthology/P17-1171. v1/P17-4012.

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Kenton Lee, Omer Levy, and Luke Zettlemoyer. Long short-term memory-networks for machine 2017. Recurrent additive networks. arXiv preprint reading. In Proceedings of the 2016 Conference arXiv:1705.07393 . on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguis- Tao Lei, Wengong Jin, Regina Barzilay, and Tommi tics, Austin, Texas, pages 551–561. https:// Jaakkola. 2017a. Deriving neural architectures from aclweb.org/anthology/D16-1053. sequence and graph kernels. In ICML.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- Tao Lei, Yu Zhang, and Yoav Artzi. 2017b. Train- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger ing rnns as fast as cnns. arXiv preprint Schwenk, and Yoshua Bengio. 2014. Learning arXiv:1709.02755 . phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. of the 2014 Conference on Empirical Methods in 2016. Assessing the ability of lstms to learn syntax- Natural Language Processing (EMNLP). Associa- sensitive dependencies. TACL 4:521–535. tion for Computational Linguistics, Doha, Qatar, pages 1724–1734. http://www.aclweb.org/ Mitchell P. Marcus, Beatrice Santorini, and Mary Ann anthology/D14-1179. Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computa- Timothy Dozat and Christopher D. Manning. 2016. tional Linguistics 19:313–330. Deep biaffine attention for neural dependency pars- ing. CoRR abs/1611.01734. Ankur Parikh, Oscar Tackstr¨ om,¨ Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable atten- Jeffrey L. Elman. 1990. Finding structure in time. tion model for natural language inference. In Cognitive Science 14:179–211. Proceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing. As- Matt Gardner, Joel Grus, Mark Neumann, Oyvind sociation for Computational Linguistics, Austin, Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe- Texas, pages 2249–2255. https://aclweb. ters, Michael Schmitz, and Luke Zettlemoyer. 2017. org/anthology/D16-1244.

737 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and signal the topic are remembered over very long Percy Liang. 2016. Squad: 100, 000+ questions for distances. machine comprehension of text. In EMNLP. The second visualization uses the dependency Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, parser. In this setting, since the recurrent architec- and Hannaneh Hajishirzi. 2016. Bidirectional at- tures are bidirectional, there are non-zero weights tention flow for machine comprehension. CoRR on all words in the sentence. The top-right triangle abs/1611.01603. indicates weights from the forward direction, and Natalia Silveira, Timothy Dozat, Marie-Catherine the bottom-left triangle indicates from the back- de Marneffe, Samuel Bowman, Miriam Connor, ward direction. For syntax, we see a significantly John Bauer, and Christopher D. Manning. 2014. A different pattern. Function words that are useful gold standard dependency corpus for English. In Proceedings of the Ninth International Conference for determining syntax are more likely to be re- on Language Resources and Evaluation (LREC- membered. Weights on head words are also likely 2014). to persist until the end of a constituent. This illustration provides only a glimpse into Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz what the model is capturing, and perhaps future, Kaiser, and Illia Polosukhin. 2017. Attention is all more detailed visualizations that take the individ- you need. arXiv preprint arXiv:1706.03762 . ual dimensions into account can provide further insight into what LSTMs are learning in practice. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. regularization. arXiv preprint arXiv:1409.2329 .

Barret Zoph and Quoc V Le. 2017. Neural architecture search with . In ICLR.

A Weight Visualization Given the empirical evidence that LSTMs are ef- fectively learning weighted sums of the content layers, it is natural to investigate what weights the model learns in practice. Using the more mathe- matically transparent simplification of LSTMs, we t can visualize the weights wj that are placed on ev- ery input j at every timestep t (see Equation9). Unlike attention mechanisms, these weights are vectors rather than scalar values. Therefore, we can only provide a coarse-grained visualization of the weights by rendering their L2-norm, as shown in Table5. In the visualization, each column indicates the word represented by the weighted sum, and each row indicates the word over which the weighted sum is computed. Dark horizontal streaks indicate the duration for which a word was remembered. Unsurprisingly, the weights on the diagonal are always the largest since it indicates the weight of the current word. More interesting task-specific patterns emerge when inspecting the off-diagonals that represent the weight on the con- text words. The first visualization uses the language model. Due to the language modeling setup, there are only non-zero weights on the current or previous words. We find that the common function words are quickly forgotten, while infrequent words that

738 Language model weights Dependency parser weights

The hymn was sung at my first inauguralchurchserviceas governor The hymn was sung at my first inauguralchurchserviceas governor

The The

hymn hymn

was was

sung sung

at at

my my

first first

inaugural inaugural

church church

service service

as as

governor governor

US troopsthere clashedwith guerrillasin a fight that left one Iraqi dead US troopsthere clashedwith guerrillasin a fight that left one Iraqi dead

US US

troops troops

there there

clashed clashed

with with

guerrillas guerrillas

in in

a a

fight fight

that that

left left

one one

Iraqi Iraqi

dead dead

He did commenton what he meant by the phrase . He did commenton what he meant by the phrase .

He He

did did

comment comment

on on

what what

he he

meant meant

by by

the the

phrase phrase

. .

I spoke to Bruce Garcey at NiMo regardingtheir RFP I spoke to Bruce Garcey at NiMo regardingtheir RFP

I I

spoke spoke

to to

Bruce Bruce

Garcey Garcey

at at

NiMo NiMo

regarding regarding

their their

RFP RFP

Table 5: Visualization of the weights on context words learned by the memory cell. Each column rep- resents the current word t, and each row represents a context word j. The gating mechanism implicitly computes element-wise weighted sums over each column. The darkness of each square indicates the L2- t 739 norm of the vector weights wj from Equation9. Figures on the left show weights learned by a language model. Figures on the right show weights learned by a dependency parser.