Long Short-Term Memory 1 Introduction
Total Page:16
File Type:pdf, Size:1020Kb
LONG SHORTTERM MEMORY Neural Computation J urgen Schmidhuber Sepp Ho chreiter IDSIA Fakultat f ur Informatik Corso Elvezia Technische Universitat M unchen Lugano Switzerland M unchen Germany juergenidsiach ho chreitinformatiktumuenchende httpwwwidsiachjuergen httpwwwinformatiktumuenchendehochreit Abstract Learning to store information over extended time intervals via recurrent backpropagation takes a very long time mostly due to insucient decaying error back ow We briey review Ho chreiters analysis of this problem then address it by introducing a novel ecient gradientbased metho d called Long ShortTerm Memory LSTM Truncating the gradient where this do es not do harm LSTM can learn to bridge minimal time lags in excess of discrete time steps by enforcing constant error ow through constant error carrousels within sp ecial units Multiplicative gate units learn to op en and close access to the constant error ow LSTM is lo cal in space and time its computational complexity p er time step and weight is O Our exp eriments with articial data involve lo cal distributed realvalued and noisy pattern representations In comparisons with RTRL BPTT Recurrent CascadeCorrelation Elman nets and Neural Sequence Chunking LSTM leads to many more successful runs and learns much faster LSTM also solves complex articial long time lag tasks that have never b een solved by previous recurrent network algorithms INTRODUCTION Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations shortterm memory as opp osed to longterm mem ory embo died by slowly changing weights This is p otentially signicant for many applications including sp eech pro cessing nonMarkovian control and music comp osition eg Mozer The most widely used algorithms for learning what to put in shortterm memory however take to o much time or do not work well at all esp ecially when minimal time lags b etween inputs and corresp onding teacher signals are long Although theoretically fascinating existing metho ds do not provide clear practical advantages over say backprop in feedforward nets with limited time windows This pap er will review an analysis of the problem and suggest a remedy The problem With conventional BackPropagation Through Time BPTT eg Williams and Zipser Werbos or RealTime Recurrent Learning RTRL eg Robinson and Fallside error signals owing backwards in time tend to either blow up or vanish the temp oral evolution of the backpropagated error exp onentially dep ends on the size of the weights Ho chreiter Case may lead to oscillating weights while in case learning to bridge long time lags takes a prohibitive amount of time or do es not work at all see section The remedy This pap er presents Long ShortTerm Memory LSTM a novel recurrent network architecture in conjunction with an appropriate gradientbased learning algorithm LSTM is designed to overcome these error backow problems It can learn to bridge time intervals in excess of steps even in case of noisy incompressible input sequences without loss of short time lag capabilities This is achieved by an ecient gradientbased algorithm for an architecture enforcing constant thus neither explo ding nor vanishing error ow through internal states of sp ecial units provided the gradient computation is truncated at certain architecturespecic p oints this do es not aect longterm error ow though Outline of pap er Section will briey review previous work Section b egins with an outline of the detailed analysis of vanishing errors due to Ho chreiter It will then introduce a naive approach to constant error backprop for didactic purp oses and highlight its problems concerning information storage and retrieval These problems will lead to the LSTM architecture as describ ed in Section Section will present numerous exp eriments and comparisons with comp eting metho ds LSTM outp erforms them and also learns to solve complex articial tasks no other recurrent net algorithm has solved Section will discuss LSTMs limitations and advantages The app endix contains a detailed description of the algorithm A and explicit error ow formulae A PREVIOUS WORK This section will fo cus on recurrent nets with timevarying inputs as opp osed to nets with sta tionary inputs and xp ointbased gradient calculations eg Almeida Pineda Gradientdescent variants The approaches of Elman Fahlman Williams Schmidhuber a Pearlmutter and many of the related algorithms in Pearl mutters comprehensive overview suer from the same problems as BPTT and RTRL see Sections and Timedelays Other metho ds that seem practical for short time lags only are TimeDelay Neural Networks Lang et al and Plates metho d Plate which up dates unit activa tions based on a weighted sum of old activations see also de Vries and Princip e Lin et al prop ose variants of timedelay networks called NARX networks Time constants To deal with long time lags Mozer uses time constants inuencing changes of unit activations deVries and Princip es ab ovementioned approach may in fact b e viewed as a mixture of TDNN and time constants For long time lags however the time constants need external ne tuning Mozer Sun et als alternative approach up dates the activation of a recurrent unit by adding the old activation and the scaled current net input The net input however tends to p erturb the stored information which makes longterm storage impractical Rings approach Ring also prop osed a metho d for bridging long time lags Whenever a unit in his network receives conicting error signals he adds a higher order unit inuencing appropriate connections Although his approach can sometimes b e extremely fast to bridge a time lag involving steps may require the addition of units Also Rings net do es not generalize to unseen lag durations Bengio et als approaches Bengio et al investigate metho ds such as simulated annealing multigrid random search timeweighted pseudoNewton optimization and discrete error propagation Their latch and sequence problems are very similar to problem a with minimal time lag see Exp eriment Bengio and Frasconi also prop ose an EM approach for propagating targets With n socalled state networks at a given time their system can b e in one of only n dierent states See also b eginning of Section But to solve continuous problems such as the adding problem Section their system would require an unacceptable number of states ie state networks Kalman lters Puskorius and Feldkamp use Kalman lter techniques to improve recurrent net p erformance Since they use a derivative discount factor imp osed to decay exp o nentially the eects of past dynamic derivatives there is no reason to b elieve that their Kalman Filter Trained Recurrent Networks will b e useful for very long minimal time lags Second order nets We will see that LSTM uses multiplicative units MUs to protect error ow from unwanted p erturbations It is not the rst recurrent net metho d using MUs though For instance Watrous and Kuhn use MUs in second order nets Some dierences to LSTM are Watrous and Kuhns architecture do es not enforce constant error ow and is not designed to solve long time lag problems It has fully connected secondorder sigmapi units while the LSTM architectures MUs are used only to gate access to constant error ow Watrous and Kuhns algorithm costs O W op erations p er time step ours only O W where W is the number of weights See also Miller and Giles for additional work on MUs Simple weight guessing To avoid long time lag problems of gradientbased approaches we may simply randomly initialize all network weights until the resulting net happ ens to classify all training sequences correctly In fact recently we discovered Schmidhuber and Ho chreiter Ho chreiter and Schmidhuber that simple weight guessing solves many of the problems in Bengio Bengio and Frasconi Miller and Giles Lin et al faster than the algorithms prop osed therein This do es not mean that weight guessing is a go o d algorithm It just means that the problems are very simple More realistic tasks require either many free parameters eg input weights or high weight precision eg for continuousvalued parameters such that guessing b ecomes completely infeasible Adaptive sequence chunkers Schmidhubers hierarchical chunker systems b do have a capability to bridge arbitrary time lags but only if there is lo cal predictability across the subsequences causing the time lags see also Mozer For instance in his p ostdo ctoral thesis Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning tasks involving minimal time lags in excess of steps The p erformance of chunker systems however deteriorates as the noise level increases and the input sequences b ecome less compressible LSTM do es not suer from this problem CONSTANT ERROR BACKPROP EXPONENTIALLY DECAYING ERROR Conventional BPTT eg Williams and Zipser Output unit k s target at time t is denoted by d t Using mean squared error k s error signal is k k 0 net td t y t t f k k k k where i y t f net t i i is the activation of a noninput unit i with dierentiable activation function f i X j net t w y t i ij j is unit is current net input and w is the weight on the connection from unit j to i Some ij nonoutput unit j s backpropagated error signal is X 0 t f net t w t j j ij i j i l The corresp onding contribution to w s total weight up date is ty t where is the j l j learning rate and l stands for an arbitrary unit connected to unit j Outline of