Long Short-Term Memory As a Dynamically Computed Element-Wise Weighted Sum

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum Omer Levy∗ Kenton Lee∗ Nicholas FitzGerald Luke Zettlemoyer Paul G. Allen School, University of Washington, Seattle, WA fomerlevy,kentonl,nfitz,[email protected] Abstract We present an alternate view to explain the success of LSTMs: the gates themselves are power- LSTMs were introduced to combat van- ful recurrent models that provide more representa- ishing gradients in simple RNNs by aug- tional power than previously realized. To demon- menting them with gated additive recur- strate this, we first show that LSTMs can be seen rent connections. We present an alterna- as a combination of two recurrent models: (1) an tive view to explain the success of LSTMs: S-RNN, and (2) an element-wise weighted sum of the gates themselves are versatile recurrent the S-RNN’s outputs over time, which is implicitly models that provide more representational computed by the gates. We hypothesize that, for power than previously appreciated. We many practical NLP problems, the weighted sum do this by decoupling the LSTM’s gates serves as the main modeling component. The S- from the embedded simple RNN, produc- RNN, while theoretically expressive, is in practice ing a new class of RNNs where the recur- only a minor contributor that clouds the mathemat- rence computes an element-wise weighted ical clarity of the model. By replacing the S-RNN sum of context-independent functions of with a context-independent function of the input, the input. Ablations on a range of prob- we arrive at a much more restricted class of RNNs, lems demonstrate that the gating mecha- where the main recurrence is via the element-wise nism alone performs as well as an LSTM weighted sums that the gates are computing. in most settings, strongly suggesting that We test our hypothesis on NLP problems, where the gates are doing much more in practice LSTMs are wildly popular at least in part due than just alleviating vanishing gradients. to their ability to model crucial phenomena such 1 Introduction as word order (Adi et al., 2017), syntactic struc- ture (Linzen et al., 2016), and even long-range se- Long short-term memory (LSTM) (Hochreiter and mantic dependencies (He et al., 2017). We con- Schmidhuber, 1997) has become the de-facto re- sider four challenging tasks: language modeling, current neural network (RNN) for learning repre- question answering, dependency parsing, and ma- sentations of sequences in NLP. Like simple re- chine translation. Experiments show that while recurrent neural networks (S-RNNs) (Elman, 1990), moving the gates from an LSTM can severely hurt LSTMs are able to learn non-linear functions of performance, replacing the S-RNN with a simple arbitrary-length input sequences. However, they linear transformation of the input results in min- also introduce an additional memory cell to mit- imal or no loss in model performance. We also igate the vanishing gradient problem (Hochreiter, show that, in many cases, LSTMs can be further 1991; Bengio et al., 1994). This memory is con- simplified by removing the output gate, arriving trolled by a mechanism of gates, whose additive at an even more transparent architecture, where connections allow long-distance dependencies to the output is a context-independent function of be learned more easily during backpropagation. the weighted sum. Together, these results suggest While this view is mathematically accurate, in this that the gates’ ability to compute an element-wise paper we argue that it does not provide a complete weighted sum, rather than the non-linear transition picture of why LSTMs work in practice. dynamics of S-RNNs, are the driving force behind The∗ first two authors contributed equally to this paper. LSTM’s success. 732 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 732–739 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 2 What Do Memory Cells Compute? what part of the memory is deleted by filtering the previous state of the memory (c ). Writing to LSTMs are typically motivated as an augmenta- t−1 the memory is done by adding the filtered content tion of simple RNNs (S-RNNs), defined as: (it ◦ cet) to the retained memory (ft ◦ ct−1). ht = tanh(Whhht−1 + Whxxt + bh) (1) Output Layer (Equations6-7) The output S-RNNs suffer from the vanishing gradient prob- layer ht passes the memory cell through a tanh lem (Hochreiter, 1991; Bengio et al., 1994) due to activation function and uses an output gate ot to compounding multiplicative updates of the hidden read selectively from the squashed memory cell. state. By introducing a memory cell and an output Our goal is to study how much each of these layer controlled by gates, LSTMs enable shortcuts components contribute to the empirical perfor- through which gradients can flow when learning mance of LSTMs. In particular, it is worth consid- with backpropagation. This mechanism enables ering the memory cell in more detail to reveal why learning of long-distance dependencies while pre- it could serve as a standalone powerful model of serving the expressive power of recurrent non- long-distance context. It is possible to show that linear transformations provided by S-RNNs. it implicitly computes an element-wise weighted Rather than viewing the gates as simply an aux- sum of all the previous content layers by expand- iliary mechanism to address a learning problem, ing the recurrence relation in Equation5: we present an alternate view that emphasizes their ct = it ◦ cet + ft ◦ ct−1 modeling strengths. We argue that the LSTM t t should be interpreted as a hybrid of two distinct X Y = ij ◦ fk ◦ cej recurrent architectures: (1) the S-RNN which pro- j=0 k=j+1 (8) vides multiplicative connections across timesteps, t and (2) the memory cell which provides additive X t = wj ◦ cej connections across timesteps. On top of these re- j=0 currences, an output layer is included that simply t squashes and filters the memory cell at each step. Each weight wj is a product of the input gate ij (when its respective input c was read) and every Throughout this paper, let fx1;:::; xng be the ej subsequent forget gate f . An interesting property sequence of input vectors, fh1;:::; hng be the se- k of these weights is that, like the gates, they are also quence of output vectors, and fc1;:::; cng be the memory cell’s states. Then, given the basic LSTM soft element-wise binary filters. definition below, we can formally identify three 3 Standalone Memory Cells are Powerful sub-components. The restricted space of element-wise weighted cet = tanh(Wchht−1 + Wcxxt + bc) (2) sums allows for easier mathematical analysis, vi- it = σ(Wihht−1 + Wixxt + bi) (3) sualization, and perhaps even learnability. How- ft = σ(Wfhht−1 + Wfxxt + bf ) (4) ever, constrained function spaces are also less expressive, and a natural question is whether these ct = it ◦ ct + ft ◦ ct−1 (5) e models will work well for NLP problems that in- o = σ(W h + W x + b ) (6) t oh t−1 ox t o volve understanding context. We hypothesize that ht = ot ◦ tanh(ct) (7) the memory cell (which computes weighted sums) can function as a standalone contextualizer. To Content Layer (Equation2) We refer to c as et test this hypothesis, we present several simplifica- the content layer, which is the output of an S- tions of the LSTM’s architecture (Section 3.1), and RNN. Evaluating the need for multiplicative recur- show on a variety of NLP benchmarks that there rent connections in the content layer is the focus of is a qualitative performance difference between this work. The content layer is passed to the mem- models that contain a memory cell and those that ory cell, which decides which parts of it to store. do not (Section 3.2). We conclude that the content Memory Cell (Equations3-5) The memory cell and output layers are relatively minor contributors, ct is controlled by two gates. The input gate it and that the space of element-wise weighted sums controls what part of the content (cet) is written is sufficiently powerful to compete with fully pa- to the memory, while the forget gate ft controls rameterized LSTMs (Section 3.3). 733 3.1 Simplified Models task we also report the mean and standard devia- The modeling power of LSTMs is commonly as- tion of 5 runs of the LSTM settings to demonstrate sumed to derive from the S-RNN in the content the typical variance observed due to training with layer, with the rest of the model acting as a learn- different random initializations. ing aid to bypass the vanishing gradient problem. Language Modeling We evaluate the models on We first isolate the S-RNN by ablating the gates the Penn Treebank (PTB) (Marcus et al., 1993) (denoted as LSTM – GATES for consistency). language modeling benchmark. We use the im- To test whether the memory cell has enough plementation of Zaremba et al.(2014) from Ten- modeling power of its own, we take an LSTM sorFlow’s tutorial while replacing any invocation and replace the S-RNN in the content layer from of LSTMs with simpler models. We test two of Equation2 with a simple linear transformation their configurations: medium and large (Table1). (cet = Wcxxt) creating the LSTM – S-RNN model. We further simplify the LSTM by removing the Question Answering For question answering, output gate from Equation7( ht = tanh(ct)), we use two different QA systems on the Stan- leaving only the activation function in the output ford question answering dataset (SQuAD) (Ra- layer (LSTM – S-RNN – OUT).

Load more