<<

LONG SHORTTERM MEMORY

Neural Computation

J urgen Schmidhuber Sepp Ho chreiter

IDSIA Fakultat f ur Informatik

Corso Elvezia Technische Universitat M unchen

Lugano Switzerland M unchen Germany

juergenidsiach ho chreitinformatiktumuenchende

httpwwwidsiachjuergen httpwwwinformatiktumuenchendehochreit

Abstract

Learning to store information over extended time intervals via recurrent

takes a very long time mostly due to insucient decaying error back ow We briey review

Ho chreiters analysis of this problem then address it by introducing a novel ecient

based metho d called Long ShortTerm Memory LSTM Truncating the gradient

where this do es not do harm LSTM can learn to bridge minimal time lags in excess of

discrete time steps by enforcing constant error ow through constant error carrousels within

sp ecial units Multiplicative gate units learn to op en and close access to the constant error

ow LSTM is lo cal in space and time its computational complexity p er time step and weight

is O Our exp eriments with articial data involve lo cal distributed realvalued and noisy

pattern representations In comparisons with RTRL BPTT Recurrent CascadeCorrelation

Elman nets and Neural Sequence Chunking LSTM leads to many more successful runs and

learns much faster LSTM also solves complex articial long time lag tasks that have never

b een solved by previous recurrent network

INTRODUCTION

Recurrent networks can in principle use their feedback connections to store representations of

recent input events in form of activations shortterm memory as opp osed to longterm mem

ory embo died by slowly changing weights This is p otentially signicant for many applications

including sp eech pro cessing nonMarkovian control and music comp osition eg Mozer

The most widely used algorithms for learning what to put in shortterm memory however take

to o much time or do not work well at all esp ecially when minimal time lags b etween inputs and

corresp onding teacher signals are long Although theoretically fascinating existing metho ds do

not provide clear practical advantages over say backprop in feedforward nets with limited time

windows This pap er will review an analysis of the problem and suggest a remedy

The problem With conventional BackPropagation Through Time BPTT eg Williams

and Zipser Werbos or RealTime Recurrent Learning RTRL eg Robinson and

Fallside error signals owing backwards in time tend to either blow up or vanish

the temp oral evolution of the backpropagated error exp onentially dep ends on the size of the

weights Ho chreiter Case may lead to oscillating weights while in case learning to

bridge long time lags takes a prohibitive amount of time or do es not work at all see section

The remedy This pap er presents Long ShortTerm Memory LSTM a novel recurrent

network architecture in conjunction with an appropriate gradientbased learning LSTM

is designed to overcome these error backow problems It can learn to bridge time intervals in

excess of steps even in case of noisy incompressible input sequences without loss of short

time lag capabilities This is achieved by an ecient gradientbased algorithm for an architecture

enforcing constant thus neither explo ding nor vanishing error ow through internal states of

sp ecial units provided the gradient computation is truncated at certain architecturespecic p oints

this do es not aect longterm error ow though

Outline of pap er Section will briey review previous work Section b egins with an outline

of the detailed analysis of vanishing errors due to Ho chreiter It will then introduce a naive

approach to constant error backprop for didactic purp oses and highlight its problems concerning

information storage and retrieval These problems will lead to the LSTM architecture as describ ed

in Section Section will present numerous exp eriments and comparisons with comp eting

metho ds LSTM outp erforms them and also learns to solve complex articial tasks no other

recurrent net algorithm has solved Section will discuss LSTMs limitations and advantages The

app endix contains a detailed description of the algorithm A and explicit error ow formulae

A

PREVIOUS WORK

This section will fo cus on recurrent nets with timevarying inputs as opp osed to nets with sta

tionary inputs and xp ointbased gradient calculations eg Almeida Pineda

Gradientdescent variants The approaches of Elman Fahlman Williams

Schmidhuber a Pearlmutter and many of the related algorithms in Pearl

mutters comprehensive overview suer from the same problems as BPTT and RTRL see

Sections and

Timedelays Other metho ds that seem practical for short time lags only are TimeDelay

Neural Networks Lang et al and Plates metho d Plate which up dates unit activa

tions based on a weighted sum of old activations see also de Vries and Princip e Lin et al

prop ose variants of timedelay networks called NARX networks

Time constants To deal with long time lags Mozer uses time constants inuencing

changes of unit activations deVries and Princip es ab ovementioned approach may in fact

b e viewed as a mixture of TDNN and time constants For long time lags however the time

constants need external ne tuning Mozer Sun et als alternative approach up dates

the activation of a recurrent unit by adding the old activation and the scaled current net input

The net input however tends to p erturb the stored information which makes longterm storage

impractical

Rings approach Ring also prop osed a metho d for bridging long time lags Whenever

a unit in his network receives conicting error signals he adds a higher order unit inuencing

appropriate connections Although his approach can sometimes b e extremely fast to bridge a

time lag involving steps may require the addition of units Also Rings net do es not

generalize to unseen lag durations

Bengio et als approaches Bengio et al investigate metho ds such as simulated

annealing multigrid random search timeweighted pseudoNewton optimization and discrete

error propagation Their latch and sequence problems are very similar to problem a with

minimal time lag see Exp eriment Bengio and Frasconi also prop ose an EM approach

for propagating targets With n socalled state networks at a given time their system can b e

in one of only n dierent states See also b eginning of Section But to solve continuous problems

such as the adding problem Section their system would require an unacceptable number

of states ie state networks

Kalman lters Puskorius and Feldkamp use Kalman lter techniques to improve

recurrent net p erformance Since they use a discount factor imp osed to decay exp o

nentially the eects of past dynamic there is no reason to b elieve that their Kalman

Filter Trained Recurrent Networks will b e useful for very long minimal time lags

Second order nets We will see that LSTM uses multiplicative units MUs to protect error

ow from unwanted p erturbations It is not the rst recurrent net metho d using MUs though

For instance Watrous and Kuhn use MUs in second order nets Some dierences to LSTM

are Watrous and Kuhns architecture do es not enforce constant error ow and is not designed

to solve long time lag problems It has fully connected secondorder sigmapi units while the

LSTM architectures MUs are used only to gate access to constant error ow Watrous and

Kuhns algorithm costs O W op erations p er time step ours only O W where W is the number

of weights See also Miller and Giles for additional work on MUs

Simple weight guessing To avoid long time lag problems of gradientbased approaches we

may simply randomly initialize all network weights until the resulting net happ ens to classify all

training sequences correctly In fact recently we discovered Schmidhuber and Ho chreiter

Ho chreiter and Schmidhuber that simple weight guessing solves many of the problems

in Bengio Bengio and Frasconi Miller and Giles Lin et al faster than

the algorithms prop osed therein This do es not mean that weight guessing is a go o d algorithm

It just means that the problems are very simple More realistic tasks require either many free

parameters eg input weights or high weight precision eg for continuousvalued parameters

such that guessing b ecomes completely infeasible

Adaptive sequence chunkers Schmidhubers hierarchical chunker systems b

do have a capability to bridge arbitrary time lags but only if there is lo cal predictability across the

subsequences causing the time lags see also Mozer For instance in his p ostdo ctoral thesis

Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning

tasks involving minimal time lags in excess of steps The p erformance of chunker systems

however deteriorates as the noise level increases and the input sequences b ecome less compressible

LSTM do es not suer from this problem

CONSTANT ERROR BACKPROP

EXPONENTIALLY DECAYING ERROR

Conventional BPTT eg Williams and Zipser Output unit k s target at time t is

denoted by d t Using k s error signal is

k

k 0

net td t y t t f

k k k

k

where

i

y t f net t

i i

is the activation of a noninput unit i with dierentiable activation f

i

X

j

net t w y t

i ij

j

is unit is current net input and w is the weight on the connection from unit j to i Some

ij

nonoutput unit j s backpropagated error signal is

X

0

t f net t w t

j j ij i

j

i

l

The corresp onding contribution to w s total weight up date is ty t where is the

j l j

and l stands for an arbitrary unit connected to unit j

Outline of Ho chreiters analysis page Supp ose we have a fully connected

net whose noninput unit indices range from to n Let us fo cus on lo cal error ow from unit u

to unit v later we will see that the analysis immediately extends to global error ow The error

o ccurring at an arbitrary unit u at time step t is propagated back into time for q time steps to

an arbitrary unit v This will scale the error by the following factor

0

f net t w q

t q

v uv

v v

P

n

tq

l

0

f net t q w q

t

v lv

u

v

l

t

u

With l v and l u we obtain

q

q

n n

X X Y

t q

v

0

f net t mw

l l l

m m m1

l

m

t

u

m

l l

1 q 1

Q

q

q 0

pro of by induction The sum of the n terms f net t mw determines the

l l l

m m m1

l

m

m

total error back ow note that since the summation terms may have dierent signs increasing

the number of units n do es not necessarily increase error ow

Intuitive explanation of equation If

0

jf net t mw j

l l l

m m m1

l

m

for all m as can happ en eg with linear f then the largest pro duct increases exp onentially

l

m

with q That is the error blows up and conicting error signals arriving at unit v can lead to

oscillating weights and unstable learning for error blowups or bifurcations see also Pineda

Baldi and Pineda Doya On the other hand if

0

jf net t mw j

l l l

m m m1

l

m

for all m then the largest pro duct decreases exp onentially with q That is the error vanishes and

nothing can b e learned in acceptable time

l 0

m1

is If y is constant If f is the logistic then the maximal value of f

l

m

l

m

0

net w j takes on maximal values where and not equal to zero then jf

l l l

m m m1

l

m

coth net w

l l l

m m m1

l

m1

y

go es to zero for jw j and is less than for jw j eg if the absolute max

l l l l

m m1 m m1

imal weight value w is smaller than Hence with conventional logistic sigmoid activation

max

functions the error ow tends to vanish as long as the weights have absolute values b elow

esp ecially in the b eginning of the training phase In general the use of larger initial weights will

not help though as seen ab ove for jw j the relevant derivative go es to zero faster

l l

m m1

than the absolute weight can grow also some weights will have to change their signs by crossing

zero Likewise increasing the learning rate do es not help either it will not change the ratio of

longrange error ow and shortrange error ow BPTT is to o sensitive to recent distractions A

very similar more recent analysis was presented by Bengio et al

Global error ow The lo cal error ow analysis ab ove immediately shows that global error

ow vanishes to o To see this compute

X

t q

v

t

u

u u output unit

Weak upp er b ound for scaling factor The following slightly extended vanishing error

analysis also takes n the number of units into account For q formula can b e rewritten

as

q

Y

0 0 T 0

WF t m W f net t q W T F t

v v

u

v

m

where the weight matrix W is dened by W w v s outgoing weight vector W is dened by

ij ij v

W W w us incoming weight vector W T is dened by W T W w and for

v i iv iv i ui ui

u u

0 0

m q F t m is the diagonal matrix of rst order derivatives dened as F t m

ij

0 0

if i j and F t m f net t m otherwise Here T is the transp osition op erator

ij i

i

A is the element in the ith column and j th row of matrix A and x is the ith comp onent

ij i

of vector x

Using a matrix norm k k compatible with vector norm k k we dene

A x

0 0

f max fk F t m k g

mq A

max

T

For max fjx jg k x k we get jx y j n k x k k y k Since

in i x x x

0 0 0

jf net t q j k F t q k f

v A

v max

we obtain the following inequality

t q

q

v

q

0 q 0

j j n f k W k k W T k k W k n f k W k

v x x A

u

max max

A

t

u

This inequality results from

k W k k W e k k W k k e k k W k

v x v x A v x A

and

k W T k k e W k k W k k e k k W k

x u x A u x A

u

where e is the unit vector whose comp onents are except for the k th comp onent which is

k

0

Note that this is a weak extreme case upp er b ound it will b e reached only if all k F t m k

A

take on maximal values and if the contributions of all paths across which error ows back from

unit u to unit v have the same sign Large k W k however typically result in small values of

A

0

k F t m k as conrmed by exp eriments see eg Ho chreiter

A

For example with norms

X

k W k max jw j

A r r s

s

and

k x k max jx j

x r r

0

for the logistic sigmoid We observe that if we have f

max

i j jw j w

ij max

n

nw

max

then k W k nw will result in exp onential decay by setting

A max

we obtain

t q

v

q

j j n

t

u

We refer to Ho chreiters thesis for additional results

CONSTANT ERROR FLOW NAIVE APPROACH

A single unit To avoid vanishing error signals how can we achieve constant error ow through

a single unit j with a single connection to itself According to the rules ab ove at time t j s lo cal

0

error back ow is t f net t t w To enforce constant error ow through j we

j j j j j

j

require

0

f net tw

j j j

j

Note the similarity to Mozers xed time constant system a time constant of is

appropriate for p otentially innite time lags

The constant error carrousel Integrating the dierential equation ab ove we obtain

net t

j

for arbitrary net t This means f has to b e linear and unit j s acti f net t

j j j j

w

j j

vation has to remain constant

j j

y t f net t f w y t y t

j j j j j j

1

We do not use the expression time constant in the dierential sense as eg Pearlmutter

In the exp eriments this will b e ensured by using the identity function f f x x x and by

j j

setting w We refer to this as the constant error carrousel CEC CEC will b e LSTMs

j j

central feature see Section

Of course unit j will not only b e connected to itself but also to other units This invokes two

obvious related problems also inherent in all other gradientbased approaches

Input weight conict for simplicity let us fo cus on a single additional input weight w

j i

Assume that the total error can b e reduced by switching on unit j in resp onse to a certain input

and keeping it active for a long time until it helps to compute a desired output Provided i is non

zero since the same incoming weight has to b e used for b oth storing certain inputs and ignoring

others w will often receive conicting weight up date signals during this time recall that j is

j i

linear these signals will attempt to make w participate in storing the input by switching

j i

on j and protecting the input by preventing j from b eing switched o by irrelevant later

inputs This conict makes learning dicult and calls for a more contextsensitive mechanism

for controlling write op erations through input weights

Output weight conict assume j is switched on and currently stores some previous

input For simplicity let us fo cus on a single additional outgoing weight w The same w has

k j k j

to b e used for b oth retrieving j s content at certain times and preventing j from disturbing k

at other times As long as unit j is nonzero w will attract conicting weight up date signals

k j

generated during sequence pro cessing these signals will attempt to make w participate in

k j

accessing the information stored in j and at dierent times protecting unit k from b eing

p erturb ed by j For instance with many tasks there are certain short time lag errors that can b e

reduced in early training stages However at later training stages j may suddenly start to cause

avoidable errors in situations that already seemed under control by attempting to participate in

reducing more dicult long time lag errors Again this conict makes learning dicult and

calls for a more contextsensitive mechanism for controlling read op erations through output

weights

Of course input and output weight conicts are not sp ecic for long time lags but o ccur for

short time lags as well Their eects however b ecome particularly pronounced in the long time

lag case as the time lag increases stored information must b e protected against p erturbation

for longer and longer p erio ds and esp ecially in advanced stages of learning more and

more already correct outputs also require protection against p erturbation

Due to the problems ab ove the naive approach do es not work well except in case of certain

simple problems involving lo cal inputoutput representations and nonrep eating input patterns

see Ho chreiter and Silva et al The next section shows how to do it right

LONG SHORTTERM MEMORY

Memory cells and gate units To construct an architecture that allows for constant error ow

through sp ecial selfconnected units without the disadvantages of the naive approach we extend

the constant error carrousel CEC embo died by the selfconnected linear unit j from Section

by introducing additional features A multiplicative input gate unit is introduced to protect the

memory contents stored in j from p erturbation by irrelevant inputs Likewise a multiplicative

output gate unit is introduced which protects other units from p erturbation by currently irrelevant

memory contents stored in j

The resulting more complex unit is called a memory cell see Figure The j th memory cell

is denoted c Each memory cell is built around a central linear unit with a xed selfconnection

j

the CEC In addition to net c gets input from a multiplicative unit out the output gate

c j j

j

and from another multiplicative unit in the input gate in s activation at time t is denoted

j j

in out

j j

by y t out s by y t We have

j

out in

j j

y t f net t y t f net t

out out in in

j j j j

where

X

u

net t w y t

out out u

j j

u

and

X

u

net t w y t

in in u

j j

u

We also have

X

u

net t w y t

c c u

j j

u

The summation indices u may stand for input units gate units memory cells or even conventional

hidden units if there are any see also paragraph on network top ology b elow All these dierent

types of units may convey useful information ab out the current state of the net For instance

an input gate output gate may use inputs from other memory cells to decide whether to store

access certain information in its memory cell There even may b e recurrent selfconnections like

w It is up to the user to dene the network top ology See Figure for an example

c c

j j

c

j

At time t c s output y t is computed as

j

c out

j j

y t y ths t

c

j

where the internal state s t is

c

j

in

j

s s t s t y tg net t for t

c c c c

j j j j

The dierentiable function g squashes net the dierentiable function h scales memory cell

c

j

outputs computed from the internal state s

c j

in c j j net c s c =s c + g y y j j j

in 1.0 out gg y j h h y j

wi c in out j w ic j j j y y

net in net out j j win i wout i

j j

Figure Architecture of memory cell c the box and its gate units in out The selfrecurrent

j j j

connection with weight indicates feedback with a delay of time step It builds the basis of

the constant error carrousel CEC The gate units open and close access to CEC See text and

appendix A for details

Why gate units To avoid input weight conicts in controls the error ow to memory cell

j

c s input connections w To circumvent c s output weight conicts out controls the error

j c i j j

j

ow from unit j s output connections In other words the net can use in to decide when to keep

j

or override information in memory cell c and out to decide when to access memory cell c and

j j j

when to prevent other units from b eing p erturb ed by c see Figure

j

Error signals trapp ed within a memory cells CEC cannot change but dierent error signals

owing into the cell at dierent times via its output gate may get sup erimp osed The output

gate will have to learn which errors to trap in its CEC by appropriately scaling them The input

gate will have to learn when to release errors again by appropriately scaling them Essentially

the multiplicative gate units op en and close access to constant error ow through CEC

Distributed output representations typically do require output gates Not always are b oth

gate types necessary though one may b e sucient For instance in Exp eriments a and b in

Section it will b e p ossible to use input gates only In fact output gates are not required in case

of lo cal output enco ding preventing memory cells from p erturbing already learned outputs can

b e done by simply setting the corresp onding weights to zero Even in this case however output

gates can b e b enecial they prevent the nets attempts at storing long time lag memories which

are usually hard to learn from p erturbing activations representing easily learnable short time lag

memories This will prove quite useful in Exp eriment for instance

Network top ology We use networks with one input one hidden layer and one output

layer The fully selfconnected hidden layer contains memory cells and corresp onding gate units

for convenience we refer to b oth memory cells and gate units as b eing lo cated in the hidden

layer The hidden layer may also contain conventional hidden units providing inputs to gate

units and memory cells All units except for gate units in all layers have directed connections

serve as inputs to all units in the layer ab ove or to all higher layers Exp eriments a and b

Memory cell blo cks S memory cells sharing the same input gate and the same output gate

form a structure called a memory cell blo ck of size S Memory cell blo cks facilitate information

storage as with conventional neural nets it is not so easy to co de a distributed input within a

single cell Since each memory cell blo ck has as many gate units as a single memory cell namely

two the blo ck architecture can b e even slightly more ecient see paragraph computational

complexity A memory cell blo ck of size is just a simple memory cell In the exp eriments

Section we will use memory cell blo cks of various sizes

Learning We use a variant of RTRL eg Robinson and Fallside which prop erly takes

into account the altered multiplicative dynamics caused by input and output gates However to

ensure nondecaying error backprop through internal states of memory cells as with truncated

BPTT eg Williams and Peng errors arriving at memory cell net inputs for cell c this

j

includes net net net do not get propagated back further in time although they do serve

c in out

j j j

to change the incoming weights Only within memory cells errors are propagated back through

previous internal states s To visualize this once an error signal arrives at a memory cell output

c

j

0

it gets scaled by output gate activation and h Then it is within the memory cells CEC where it

can ow back indenitely without ever b eing scaled Only when it leaves the memory cell through

0

the input gate and g it is scaled once more by input gate activation and g It then serves to

change the incoming weights b efore it is truncated see app endix for explicit formulae

Computational complexity As with Mozers fo cused recurrent backprop algorithm Mozer

s

c

j

need to b e stored and up dated Hence the LSTM algorithm is only the derivatives

w

il

very ecient with an excellent up date complexity of O W where W the number of weights see

details in app endix A Hence LSTM and BPTT for fully recurrent nets have the same up date

complexity p er time step while RTRLs is much worse Unlike full BPTT however LSTM is

there is no need to store activation values observed during sequence local in space and time

pro cessing in a stack with p otentially unlimited size

Abuse problem and solutions In the b eginning of the learning phase error reduction

may b e p ossible without storing information over time The network will thus tend to abuse

memory cells eg as bias cells ie it might make their activations constant and use the outgoing

connections as adaptive thresholds for other units The p otential diculty is it may take a

long time to release abused memory cells and make them available for further learning A similar

abuse problem app ears if two memory cells store the same redundant information There are

at least two solutions to the abuse problem Sequential network construction eg Fahlman

a memory cell and the corresp onding gate units are added to the network whenever the

2

For intracellular backprop in a quite dierent context see also Doya and Yoshizawa

3

Following Schmidhuber we say that a recurrent net algorithm is local in space if the up date complexity

p er time step and weight do es not dep end on network size We say that a metho d is local in time if its storage

requirements do not dep end on input sequence length For instance RTRL is lo cal in time but not in space BPTT

is lo cal in space but not in time output

out 1 out 2 cell 1 cell 2 cell 1 cell 2 block block block block hidden 2 2 1 1 in 1 in 2

input

Figure Example of a net with input units output units and memory cell blocks of size

in marks the input gate out marks the output gate and cel l bl ock marks the rst memory

cell of block cel l bl ock s architecture is identical to the one in Figure with gate units

in and out note that by rotating Figure by degrees anticlockwise it wil l match with the

corresponding parts of Figure The example assumes dense connectivity each gate unit and

each memory cell see al l nonoutput units For simplicity however outgoing weights of only

one type of unit are shown for each layer With the ecient truncated update rule error ows

only through connections to output units and through xed selfconnections within cell blocks not

shown here see Figure Error ow is truncated once it wants to leave memory cells or

gate units Therefore no connection shown above serves to propagate error back to the unit from

which the connection originates except for connections to output units although the connections

themselves are modiable Thats why the truncated LSTM algorithm is so ecient despite its

ability to bridge very long time lags See text and appendix A for details Figure actual ly shows

the architecture used for Experiment a only the bias of the noninput units is omitted

error stops decreasing see Exp eriment in Section Output gate bias each output gate gets

a negative initial bias to push initial memory cell activations towards zero Memory cells with

more negative bias automatically get allo cated later see Exp eriments in Section

Internal state drift and remedies If memory cell c s inputs are mostly p ositive or mostly

j

negative then its internal state s will tend to drift away over time This is p otentially dangerous

j

0

for the h s will then adopt very small values and the gradient will vanish One way to cir

j

cumvent this problem is to choose an appropriate function h But hx x for instance has the

disadvantage of unrestricted memory cell output range Our simple but eective way of solving

drift problems at the b eginning of learning is to initially bias the input gate in towards zero

j

0 in

j

Although there is a tradeo b etween the magnitudes of h s on the one hand and of y and

j

0

f on the other the p otential negative eect of input gate bias is negligible compared to the one

in

j

of the drifting eect With logistic sigmoid activation functions there app ears to b e no need for

netuning the initial bias as conrmed by Exp eriments and in Section

EXPERIMENTS

Introduction Which tasks are appropriate to demonstrate the quality of a novel long time lag

algorithm First of all minimal time lags b etween relevant input signals and corresp onding teacher

signals must b e long for al l training sequences In fact many previous recurrent net algorithms

sometimes manage to generalize from very short training sequences to very long test sequences

See eg Pollack But a real long time lag problem do es not have any short time lag

exemplars in the training set For instance Elmans training pro cedure BPTT oine RTRL

online RTRL etc fail miserably on real long time lag problems See eg Ho chreiter and

Mozer A second imp ortant requirement is that the tasks should b e complex enough such

that they cannot b e solved quickly by simpleminded strategies such as random weight guessing

Guessing can outp erform many long time lag algorithms Recently we discovered

Schmidhuber and Ho chreiter Ho chreiter and Schmidhuber that many long

time lag tasks used in previous work can b e solved more quickly by simple random weight guessing

than by the prop osed algorithms For instance guessing solved a variant of Bengio and Frasconis

parity problem problem much faster than the seven metho ds tested by Bengio et al

and Bengio and Frasconi Similarly for some of Miller and Giles problems Of

course this do es not mean that guessing is a go o d algorithm It just means that some previously

used problems are not extremely appropriate to demonstrate the quality of previously prop osed

algorithms

Whats common to Exp eriments All our exp eriments except for Exp eriment

involve long minimal time lags there are no short time lag training exemplars facilitating

learning Solutions to most of our tasks are sparse in weight space They require either many

parametersinputs or high weight precision such that random weight guessing b ecomes infeasible

We always use online learning as opp osed to batch learning and logistic sigmoids as acti

vation functions For Exp eriments and initial weights are chosen in the range for

the other exp eriments in Training sequences are generated randomly according to the

various task descriptions In slight deviation from the notation in App endix A each discrete

time step of each input sequence involves three pro cessing steps use current input to set the

input units Compute activations of hidden units including input gates output gates mem

ory cells Compute output unit activations Except for Exp eriments a and b sequence

elements are randomly generated online and error signals are generated only at sequence ends

Net activations are reset after each pro cessed input sequence

For comparisons with recurrent nets taught by we give results only for RTRL

except for comparison a which also includes BPTT Note however that untruncated BPTT see

eg Williams and Peng computes exactly the same gradient as oine RTRL With long time

lag problems oine RTRL or BPTT and the online version of RTRL no activation resets online

weight changes lead to almost identical negative results as conrmed by additional simulations

in Ho chreiter see also Mozer This is b ecause oine RTRL online RTRL and full

BPTT all suer badly from exp onential error decay

Our LSTM architectures are selected quite arbitrarily If nothing is known ab out the complex

ity of a given problem a more systematic approach would b e start with a very small net consisting

of one memory cell If this do es not work try two cells etc Alternatively use sequential network

construction eg Fahlman

Outline of exp eriments

Exp eriment fo cuses on a standard b enchmark test for recurrent nets the embedded Reb er

grammar Since it allows for training sequences with short time lags it is not a long time

lag problem We include it b ecause it provides a nice example where LSTMs output

gates are truly b enecial and it is a p opular b enchmark for recurrent nets that has b een

used by many authors we want to include at least one exp eriment where conventional

BPTT and RTRL do not fail completely LSTM however clearly outp erforms them The

embedded Reb er grammars minimal time lags represent a b order case in the sense that it

is still p ossible to learn to bridge them with conventional algorithms Only slightly longer

4

It should b e mentioned however that dierent input representations and dierent types of noise may lead to

worse guessing p erformance p ersonal communication

minimal time lags would make this almost imp ossible The more interesting tasks in our

pap er however are those that RTRL BPTT etc cannot solve at all

Exp eriment fo cuses on noisefree and noisy sequences involving numerous input symbols

distracting from the few imp ortant ones The most dicult task Task c involves hundreds

of distractor symbols at random p ositions and minimal time lags of steps LSTM solves

it while BPTT and RTRL already fail in case of step minimal time lags see also eg

Ho chreiter and Mozer For this reason RTRL and BPTT are omitted in the

remaining more complex exp eriments all of which involve much longer time lags

Exp eriment addresses long time lag problems with noise and signal on the same input

line Exp eriments ab fo cus on Bengio et als sequence problem Because

this problem actually can b e solved quickly by random weight guessing we also include a

far more dicult sequence problem c which requires to learn realvalued conditional

exp ectations of noisy targets given the inputs

Exp eriments and involve distributed continuousvalued input representations and require

learning to store precise real values for very long time p erio ds Relevant input signals

can o ccur at quite dierent p ositions in input sequences Again minimal time lags involve

hundreds of steps Similar tasks never have b een solved by other recurrent net algorithms

Exp eriment involves tasks of a dierent complex type that also has not b een solved by

other recurrent net algorithms Again relevant input signals can o ccur at quite dierent

p ositions in input sequences The exp eriment shows that LSTM can extract information

conveyed by the temp oral order of widely separated inputs

Subsection will provide a detailed summary of exp erimental conditions in two tables for

reference

EXPERIMENT EMBEDDED REBER GRAMMAR

Task Our rst task is to learn the embedded Reb er grammar eg Smith and Zipser

Cleeremans et al and Fahlman Since it allows for training sequences with short

time lags of as few as steps it is not a long time lag problem We include it for two reasons

it is a p opular recurrent net b enchmark used by many authors we wanted to have at least one

exp eriment where RTRL and BPTT do not fail completely and it shows nicely how output

gates can b e b enecial

S X REBER

GRAMMAR T T T S B E B E XP

P P PV REBER

V GRAMMAR

T

Figure Transition diagram for the embedded

Reber grammar Each box represents a copy of

Figure Transition diagram for the Reber

the Reber grammar see Figure

grammar

Starting at the leftmost no de of the directed graph in Figure symbol strings are generated

sequentially b eginning with the empty string by following edges and app ending the asso ciated

symbols to the current string until the rightmost no de is reached Edges are chosen randomly

if there is a choice probability The nets task is to read strings one symbol at a time

and to p ermanently predict the next symbol error signals o ccur at every time step To correctly

predict the symbol b efore last the net has to remember the second symbol

Comparison We compare LSTM to Elman nets trained by Elmans training pro cedure

ELM results taken from Cleeremans et al Fahlmans Recurrent CascadeCorrelation

RCC results taken from Fahlman and RTRL results taken from Smith and Zipser

where only the few successful trials are listed It should b e mentioned that Smith and Zipser

actually make the task easier by increasing the probability of short time lag exemplars We didnt

do this for LSTM

TrainingTesting We use a lo cal inputoutput representation input units output

units Following Fahlman we use training strings and separate test strings The training

set is generated randomly training exemplars are picked randomly from the training set Test

sequences are generated randomly to o but sequences already used in the training set are not

used for testing After string presentation all activations are reinitialized with zeros A trial is

considered successful if all string symbols of all sequences in b oth test set and training set are

predicted correctly that is if the output units corresp onding to the p ossible next symbols

isare always the most active ones

Architectures Architectures for RTRL ELM RCC are rep orted in the references listed

ab ove For LSTM we use memory cell blo cks Each blo ck has memory cells The

output layers only incoming connections originate at memory cells Each memory cell and each

gate unit receives incoming connections from all memory cells and gate units the hidden layer is

fully connected less connectivity may work as well The input layer has forward connections

to all units in the hidden layer The gate units are biased These architecture parameters make it

easy to store at least input signals architectures and are employed to obtain comparable

numbers of weights for b oth architectures for and for Other parameters may b e

appropriate as well however All sigmoid functions are logistic with output range except for

h whose range is and g whose range is All weights are initialized in

except for the output gate biases which are initialized to and resp ectively see abuse

problem solution of Section We tried learning rates of and

Results We use dierent randomly generated pairs of training and test sets With each

such pair we run trials with dierent initial weights See Table for results mean of

trials Unlike the other metho ds LSTM always learns to solve the task Even when we ignore

the unsuccessful trials of the other approaches LSTM learns much faster

Imp ortance of output gates The exp eriment provides a nice example where the output gate

is truly b enecial Learning to store the rst T or P should not p erturb activations representing

the more easily learnable transitions of the original Reb er grammar This is the job of the output

gates Without output gates we did not achieve fast learning

EXPERIMENT NOISEFREE AND NOISY SEQUENCES

Task a noisefree sequences with long time lags There are p p ossible input symbols

denoted a a a x a y a is lo cally represented by the p dimensional vector

p p p i

whose ith comp onent is all other comp onents are A net with p input units and p

output units sequentially observes input symbol sequences one at a time p ermanently trying

to predict the next symbol error signals o ccur at every single time step To emphasize the

long time lag problem we use a training set consisting of only two very similar sequences

y a a a y and x a a a x Each is selected with probability To predict

p p

the nal element the net has to learn to store a representation of the rst element for p time

steps

We compare RealTime Recurrent Learning for fully recurrent nets RTRL BackPropa

gation Through Time BPTT the sometimes very successful net Neural Sequence Chunker

CH Schmidhuber b and our new metho d LSTM In all cases weights are initialized in

Due to limited computation time training is stopp ed after million sequence presen

metho d hidden units weights learning rate of success success after

RTRL some fraction

RTRL some fraction

ELM

RCC

LSTM blo cks size

LSTM blo cks size

LSTM blo cks size

LSTM blo cks size

LSTM blo cks size

Table EXPERIMENT Embedded Reber grammar percentage of successful trials and number

of sequence presentations until success for RTRL results taken from Smith and Zipser

Elman net trained by Elmans procedure results taken from Cleeremans et al Recurrent

CascadeCorrelation results taken from Fahlman and our new approach LSTM Weight

numbers in the rst rows are estimates the corresponding papers do not provide al l the technical

details Only LSTM almost always learns to solve the task only two failures out of trials

Even when we ignore the unsuccessful trials of the other approaches LSTM learns much faster

the number of required training examples in the bottom row varies between and

tations A successful run is one that fullls the following criterion after training during

successive randomly chosen input sequences the maximal absolute error of all output units is

always b elow

Architectures RTRL one selfrecurrent hidden unit p nonrecurrent output units Each

layer has connections from all layers b elow All units use the logistic sigmoid

in

BPTT same architecture as the one trained by RTRL

CH b oth net architectures like RTRLs but one has an additional output for predicting the

hidden unit of the other one see Schmidhuber b for details

LSTM like with RTRL but the hidden unit is replaced by a memory cell and an input gate

no output gate required g is the logistic sigmoid and h is the identity function h hx x x

Memory cell and input gate are added once the error has stopp ed decreasing see abuse problem

solution in Section

Results Using RTRL and a short time step delay p of all trials were successful

No trial was successful with p With long time lags only the neural sequence chunker

and LSTM achieved successful trials while BPTT and RTRL failed With p the net

sequence chunker solved the task in only of all trials LSTM however always learned to solve

the task Comparing successful trials only LSTM learned much faster See Table for details It

should b e mentioned however that a hierarchical chunker can also always quickly solve this task

Schmidhuber c

Task b no lo cal regularities With the task ab ove the chunker sometimes learns to

correctly predict the nal element but only b ecause of predictable lo cal regularities in the input

stream that allow for compressing the sequence In an additional more dicult task involving

many more dierent p ossible sequences we remove compressibility by replacing the determin

istic subsequence a a a by a random subsequence of length p over the alpha

p

b et a a a We obtain classes two sets of sequences fy a a a y j

p i i i

1 2 p1

i i i p g and fx a a a x j i i i p g Again every

p i i i p

1 2 p1

next sequence element has to b e predicted The only totally predictable targets however are x

and y which o ccur at sequence ends Training exemplars are chosen randomly from the classes

Architectures and parameters are the same as in Exp eriment a A successful run is one that

fullls the following criterion after training during successive randomly chosen input

Metho d Delay p Learning rate weights Successful trials Success after

RTRL

RTRL

RTRL

RTRL

RTRL

BPTT

CH

LSTM

Table Task a Percentage of successful trials and number of training sequences until success

for RealTime Recurrent Learning RTRL BackPropagation Through Time BPTT neural

sequence chunking CH and the new method LSTM Table entries refer to means of trials

With time step delays only CH and LSTM achieve successful trials Even when we ignore the

unsuccessful trials of the other approaches LSTM learns much faster

sequences the maximal absolute error of all output units is b elow at sequence end

Results As exp ected the chunker failed to solve this task so did BPTT and RTRL of

course LSTM however was always successful On average mean of trials success for

p was achieved after sequence presentations This demonstrates that LSTM do es not

require sequence regularities to work well

Task c very long time lags no lo cal regularities This is the most dicult task in

this subsection To our knowledge no other recurrent net algorithm can solve it Now there are p

p ossible input symbols denoted a a a a e a b a x a y a a

p p p p p p p

are also called distractor symbols Again a is lo cally represented by the p dimensional vector

i

whose ith comp onent is all other comp onents are A net with p input units and output

units sequentially observes input symbol sequences one at a time Training sequences are randomly

chosen from the union of two very similar subsets of sequences fb y a a a e y j

i i i

1 2

q +k

i i i q g and fb x a a a e x j i i i q g To pro duce a

q k i i i q k

1 2

q +k

training sequence we randomly generate a sequence prex of length q randomly

generate a sequence sux of additional elements b e x y with probability or alternatively

an e with probability In the latter case we conclude the sequence with x or y dep ending

on the second element For a given k this leads to a uniform distribution on the p ossible sequences

with length q k The minimal sequence length is q the exp ected length is

1

X

k

q k q

k

q q

The The exp ected number of o ccurrences of element a i p in a sequence is

i

p p

goal is to predict the last symbol which always o ccurs after the trigger symbol e Error signals

are generated only at sequence ends To predict the nal element the net has to learn to store a

representation of the second element for at least q time steps until it sees the trigger symbol

e Success is dened as prediction error for nal sequence element of b oth output units always

b elow for successive randomly chosen input sequences

ArchitectureLearning The net has p input units and output units Weights are

initialized in To avoid to o much learning time variance due to dierent weight initial

izations the hidden layer gets two memory cells two cell blo cks of size although one would

b e sucient There are no other hidden units The output layer receives connections only from

memory cells Memory cells and gate units receive connections from input units memory cells

and gate units ie the hidden layer is fully connected No bias weights are used h and g are

logistic sigmoids with output ranges and resp ectively The learning rate is

q

q time lag p random inputs weights Success after

p

Table Task c LSTM with very long minimal time lags q and a lot of noise p is the

q

number of available distractor symbols p is the number of input units is the expected

p

number of occurrences of a given distractor symbol in a sequence The rightmost column lists the

number of training sequences required by LSTM BPTT RTRL and the other competitors have

no chance of solving this task If we let the number of distractor symbols and weights increase

in proportion to the time lag learning time increases very slow ly The lower block il lustrates the

expected slowdown due to increased frequency of distractor symbols

Note that the minimal time lag is q the net never sees short training sequences facilitating

the classication of long test sequences

Results trials were made for all tested pairs p q Table lists the mean of the number

of training sequences required by LSTM to achieve success BPTT and RTRL have no chance of

solving nontrivial tasks with minimal time lags of steps

Scaling Table shows that if we let the number of input symbols and weights increase

in prop ortion to the time lag learning time increases very slowly This is a another remarkable

prop erty of LSTM not shared by any other metho d we are aware of Indeed RTRL and BPTT

are far from scaling reasonably instead they app ear to scale exp onentially and app ear quite

useless when the time lags exceed as few as steps

q

Distractor inuence In Table the column headed by gives the exp ected frequency of

p

distractor symbols Increasing this frequency decreases learning sp eed an eect due to weight

oscillations caused by frequently observed input symbols

EXPERIMENT NOISE AND SIGNAL ON SAME CHANNEL

This exp eriment serves to illustrate that LSTM do es not encounter fundamental problems if noise

and signal are mixed on the same input line We initially fo cus on Bengio et als simple

sequence problem in Exp eriment c we will then p ose a more challenging sequence problem

Task a sequence problem The task is to observe and then classify input sequences

There are two classes each o ccurring with probability There is only one input line Only

the rst N realvalued sequence elements convey relevant information ab out the class Sequence

elements at p ositions t N are generated by a Gaussian with mean zero and variance Case

N the rst sequence element is for class and for class Case N the rst

three elements are for class and for class The target at the sequence end is for

class and for class Correct classication is dened as absolute output error at sequence

end b elow Given a constant T the sequence length is randomly selected b etween T and T

T a dierence to Bengio et als problem is that they also p ermit shorter sequences of length

T

Guessing Bengio et al and Bengio and Frasconi tested dierent metho ds

on the sequence problem We discovered however that random weight guessing easily outp er

T N stop ST stop ST weights ST fraction misclassied

Table Task a Bengio et als sequence problem T is minimal sequence length N is the

number of informationconveying elements at sequence begin The column headed by ST ST

gives the number of sequence presentations required to achieve stopping criterion ST ST The

rightmost column lists the fraction of misclassied posttraining sequences with absolute error

from a test set consisting of sequences tested after ST was achieved Al l values are

means of trials We discovered however that this problem is so simple that random weight

guessing solves it faster than LSTM and any other method for which there are published results

forms them all b ecause the problem is so simple See Schmidhuber and Ho chreiter and

Ho chreiter and Schmidhuber for additional results in this vein

LSTM architecture We use a layer net with input unit output unit and cell blo cks

of size The output layer receives connections only from memory cells Memory cells and gate

units receive inputs from input units memory cells and gate units and have bias weights Gate

units and output unit are logistic sigmoid in h in and g in

TrainingTesting All weights except the bias weights to gate units are randomly initialized

in the range The rst input gate bias is initialized with the second with

and the third with The rst output gate bias is initialized with the second with

and the third with The precise initialization values hardly matter though as conrmed by

additional exp eriments The learning rate is All activations are reset to zero at the b eginning

of a new sequence

We stop training and judge the task as b eing solved according to the following criteria ST

none of sequences from a randomly chosen test set is misclassied ST ST is satised and

mean absolute test set error is b elow In case of ST an additional test set consisting of

randomly chosen sequences is used to determine the fraction of misclassied sequences

Results See Table The results are means of trials with dierent weight initializations

in the range LSTM is able to solve this problem though by far not as fast as random

weight guessing see paragraph Guessing ab ove Clearly this trivial problem do es not provide a

very go o d testb ed to compare p erformance of various nontrivial algorithms Still it demonstrates

that LSTM do es not encounter fundamental problems when faced with signal and noise on the

same channel

Task b Architecture parameters etc like in Task a but now with Gaussian noise mean

and variance added to the informationconveying elements t N We stop training

and judge the task as b eing solved according to the following slightly redened criteria ST

less than out of sequences from a randomly chosen test set are misclassied ST ST is

satised and mean absolute test set error is b elow In case of ST an additional test set

consisting of randomly chosen sequences is used to determine the fraction of misclassied

sequences

Results See Table The results represent means of trials with dierent weight initializa

tions LSTM easily solves the problem

Task c Architecture parameters etc like in Task a but with a few essential changes that

make the task nontrivial the targets are and for class and class resp ectively and

there is Gaussian noise on the targets mean and variance stdev To minimize mean

squared error the system has to learn the conditional expectations of the targets given the inputs

Misclassication is dened as absolute dierence b etween output and noisefree target for

5

It should b e mentioned however that dierent input representations and dierent types of noise may lead to

worse guessing p erformance Yoshua Bengio p ersonal communication

T N stop ST stop ST weights ST fraction misclassied

Table Task b modied sequence problem Same as in Table but now the information

conveying elements are also perturbed by noise

T N stop weights fraction misclassied av dierence to mean

Table Task c modied more chal lenging sequence problem Same as in Table but with

noisy realvalued targets The system has to learn the conditional expectations of the targets given

the inputs The rightmost column provides the average dierence between network output and

expected target Unlike a and b this task cannot be solved quickly by random weight guessing

class and for class The network output is considered acceptable if the mean

absolute dierence b etween noisefree target and output is b elow Since this requires high

weight precision Task c unlike a and b cannot be solved quickly by random guessing

TrainingTesting The learning rate is We stop training according to the following

criterion none of sequences from a randomly chosen test set is misclassied and mean

absolute dierence b etween noise free target and output is b elow An additional test set

consisting of randomly chosen sequences is used to determine the fraction of misclassied

sequences

Results See Table The results represent means of trials with dierent weight initial

izations Despite the noisy targets LSTM still can solve the problem by learning the exp ected

target values

EXPERIMENT ADDING PROBLEM

The dicult task in this section is of a type that has never b een solved by other recurrent net al

gorithms It shows that LSTM can solve long time lag problems involving distributed continuous

valued representations

Task Each element of each input sequence is a pair of comp onents The rst comp onent

is a real value randomly chosen from the interval the second is either or

and is used as a marker at the end of each sequence the task is to output the sum of the rst

comp onents of those pairs that are marked by second comp onents equal to Sequences have

T

random lengths b etween the minimal sequence length T and T In a given sequence exactly

two pairs are marked as follows we rst randomly select and mark one of the rst ten pairs

T

whose rst comp onent we call X Then we randomly select and mark one of the rst

still unmarked pairs whose rst comp onent we call X The second comp onents of all remaining

pairs are zero except for the rst and nal pair whose second comp onents are In the rare case

where the rst pair of the sequence gets marked we set X to zero An error signal is generated

X X

1 2

only at the sequence end the target is the sum X X scaled to the interval

A sequence is pro cessed correctly if the absolute error at the sequence end is b elow

Architecture We use a layer net with input units output unit and cell blo cks of size

The output layer receives connections only from memory cells Memory cells and gate units

receive inputs from memory cells and gate units ie the hidden layer is fully connected less

connectivity may work as well The input layer has forward connections to all units in the hidden

T minimal lag weights wrong predictions Success after

out of

out of

out of

Table EXPERIMENT Results for the Adding Problem T is the minimal sequence length

T the minimal time lag wrong predictions is the number of incorrectly processed sequences

error from a test set containing sequences The rightmost column gives the number

of training sequences required to achieve the stopping criterion Al l values are means of trials

For T the number of required training examples varies between and

exceeding in only cases

layer All noninput units have bias weights These architecture parameters make it easy to store

at least input signals a cell blo ck size of works well to o All activation functions are logistic

with output range except for h whose range is and g whose range is

State drift versus initial bias Note that the task requires storing the precise values of

real numbers for long durations the system must learn to protect memory cell contents against

even minor internal state drift see Section To study the signicance of the drift problem

we make the task even more dicult by biasing all noninput units thus articially inducing

internal state drift All weights including the bias weights are randomly initialized in the range

Following Section s remedy for state drifts the rst input gate bias is initialized with

the second with though the precise values hardly matter as conrmed by additional

exp eriments

TrainingTesting The learning rate is Training is stopp ed once the average training

error is b elow and the most recent sequences were pro cessed correctly

Results With a test set consisting of randomly chosen sequences the average test set

error was always b elow and there were never more than incorrectly pro cessed sequences

Table shows details

The exp eriment demonstrates LSTM is able to work well with distributed representations

LSTM is able to learn to p erform calculations involving continuous values Since the system

T

time steps there manages to store continuous values without deterioration for minimal delays of

is no signicant harmful internal state drift

EXPERIMENT MULTIPLICATION PROBLEM

One may argue that LSTM is a bit biased towards tasks such as the Adding Problem from the

previous subsection Solutions to the Adding Problem may exploit the CECs builtin integration

capabilities Although this CEC prop erty may b e viewed as a feature rather than a disadvantage

integration seems to b e a natural subtask of many tasks o ccurring in the real world the question

arises whether LSTM can also solve tasks with inherently nonintegrative solutions To test this

we change the problem by requiring the nal target to equal the pro duct instead of the sum of

earlier marked inputs

Task Like the task in Section except that the rst comp onent of each pair is a real value

randomly chosen from the interval In the rare case where the rst pair of the input sequence

gets marked we set X to The target at sequence end is the pro duct X X

Architecture Like in Section All weights including the bias weights are randomly

initialized in the range

TrainingTesting The learning rate is We test p erformance twice as so on as less

than n of the most recent training sequences lead to absolute errors exceeding where

seq

n and n Why these values n is sucient to learn storage of the

seq seq seq

relevant inputs It is not enough though to netune the precise nal outputs n however

seq

T minimal lag weights n wrong predictions MSE Success after

seq

out of

out of

Table EXPERIMENT Results for the Multiplication Problem T is the minimal sequence

length T the minimal time lag We test on a test set containing sequences as soon as less

than n of the most recent training sequences lead to error wrong predictions

seq

is the number of test sequences with error MSE is the mean squared error on the test

set The rightmost column lists numbers of training sequences required to achieve the stopping

criterion Al l values are means of trials

leads to quite satisfactory results

Results For n n with a test set consisting of randomly chosen

seq seq

sequences the average test set error was always b elow and there were never more

than incorrectly pro cessed sequences Table shows details A net with additional

standard hidden units or with a hidden layer ab ove the memory cells may learn the netuning

part more quickly

The exp eriment demonstrates LSTM can solve tasks involving b oth continuousvalued repre

sentations and nonintegrative information pro cessing

EXPERIMENT TEMPORAL ORDER

In this subsection LSTM solves other dicult but articial tasks that have never b een solved by

previous recurrent net algorithms The exp eriment shows that LSTM is able to extract information

conveyed by the temp oral order of widely separated inputs

Task a two relevant widely separated symbols The goal is to classify sequences

Elements and targets are represented lo cally input vectors with only one nonzero bit The

sequence starts with an E ends with a B the trigger symbol and otherwise consists of randomly

chosen symbols from the set fa b c dg except for two elements at p ositions t and t that are either

X or Y The sequence length is randomly chosen b etween and t is randomly chosen

b etween and and t is randomly chosen b etween and There are sequence classes

Q R S U which dep end on the temp oral order of X and Y The rules are X X Q X Y

R Y X S Y Y U

Task b three relevant widely separated symbols Again the goal is to classify

sequences Elementstargets are represented lo cally The sequence starts with an E ends with

a B the trigger symbol and otherwise consists of randomly chosen symbols from the set

fa b c dg except for three elements at p ositions t t and t that are either X or Y The sequence

length is randomly chosen b etween and t is randomly chosen b etween and t is

randomly chosen b etween and and t is randomly chosen b etween and There are

sequence classes Q R S U V A B C which dep end on the temp oral order of the X s and Y s The

rules are X X X Q X X Y R X Y X S X Y Y U Y X X V Y X Y

A Y Y X B Y Y Y C

There are as many output units as there are classes Each class is lo cally represented by a

binary target vector with one nonzero comp onent With b oth tasks error signals o ccur only at

the end of a sequence The sequence is classied correctly if the nal absolute error of all output

units is b elow

Architecture We use a layer net with input units cell blo cks of size and

output units for Task a b Again all noninput units have bias weights and the output

layer receives connections from memory cells only Memory cells and gate units receive inputs

from input units memory cells and gate units ie the hidden layer is fully connected less

connectivity may work as well The architecture parameters for Task a b make it easy to

store at least input signals All activation functions are logistic with output range

except for h whose range is and g whose range is

TrainingTesting The learning rate is for Exp eriment a b Training is stopp ed

once the average training error falls b elow and the most recent sequences were classied

correctly All weights are initialized in the range The rst input gate bias is initialized

with the second with and for Exp eriment b the third with again we conrmed

by additional exp eriments that the precise values hardly matter

Results With a test set consisting of randomly chosen sequences the average test set

error was always b elow and there were never more than incorrectly classied sequences

Table shows details

The exp eriment shows that LSTM is able to extract information conveyed by the temp oral

order of widely separated inputs In Task a for instance the delays b etween rst and second

relevant input and b etween second relevant input and sequence end are at least time steps

task weights wrong predictions Success after

Task a out of

Task b out of

Table EXPERIMENT Results for the Temporal Order Problem wrong predictions is

the number of incorrectly classied sequences error for at least one output unit from a

test set containing sequences The rightmost column gives the number of training sequences

required to achieve the stopping criterion The results for Task a are means of trials those

for Task b of trials

Typical solutions In Exp eriment a how do es LSTM distinguish b etween temp oral orders

X Y and Y X One of many p ossible solutions is to store the rst X or Y in cell blo ck and

the second X Y in cell blo ck Before the rst X Y o ccurs blo ck can see that it is still empty

by means of its recurrent connections After the rst X Y blo ck can close its input gate Once

blo ck is lled and closed this fact will b ecome visible to blo ck recall that all gate units and

all memory cells receive connections from all nonoutput units

Typical solutions however require only one memory cell blo ck The blo ck stores the rst X

or Y once the second X Y o ccurs it changes its state dep ending on the rst stored symbol

Solution type exploits the connection b etween memory cell output and input gate unit the

following events cause dierent input gate activations X o ccurs in conjunction with a lled

blo ck X o ccurs in conjunction with an empty blo ck Solution type is based on a strong

p ositive connection b etween memory cell output and memory cell input The previous o ccurrence

of X Y is represented by a p ositive negative internal state Once the input gate op ens for the

second time so do es the output gate and the memory cell output is fed back to its own input

This causes X Y to b e represented by a p ositive internal state b ecause X contributes to the

new internal state twice via current internal state and cell output feedback Similarly Y X

gets represented by a negative internal state

SUMMARY OF EXPERIMENTAL CONDITIONS

The two tables in this subsection provide an overview of the most imp ortant LSTM parameters

and architectural details for Exp eriments The conditions of the simple exp eriments a and

b dier slightly from those of the other more systematic exp eriments due to historical reasons

Task p lag b s in out w c ogb igb bias h g

F r ga h g

F r ga h g

to b e continued on next page

continued from previous page

Task p lag b s in out w c ogb igb bias h g

F r ga h g

F r ga h g

F r ga h g

a B no og none none id g

b B no og none none id g

c F none none none h g

c F none none none h g

c F none none none h g

c F none none none h g

c F none none none h g

c F none none none h g

c F none none none h g

c F none none none h g

c F none none none h g

a F b h g

b F b h g

c F b h g

F r all h g

F r all h g

F r all h g

F r r all h g

a F r all h g

b F r all h g

Table Summary of experimental conditions for LSTM Part I st column task number nd

column minimal sequence length p rd column minimal number of steps between most recent

relevant input information and teacher signal th column number of cell blocks b th column

block size s th column number of input units in th column number of output units out th

column number of weights w th column c describes connectivity F means output layer

receives connections from memory cells memory cells and gate units receive connections from

input units memory cells and gate units B means each layer receives connections from al l

layers below th column initial output gate bias og b where r stands for randomly chosen

from the interval and no og means no output gate used th column initial input

gate bias ig b see th column th column which units have bias weights b stands for al l

hidden units ga for only gate units and al l for al l noninput units th column the

function h where id is identity function h is logistic sigmoid in th column the

g where g is sigmoid in g in th column learning rate

Task select interval test set size stopping criterion success

t training test correctly pred see text

a t no test set after million exemplars ABS

b t after million exemplars ABS

c t after million exemplars ABS

a t ST and ST see text ABS

b t ST and ST see text ABS

c t ST and ST see text see text

t ST ABS

t see text ABS

a t ST ABS

b t ST ABS

Table Summary of experimental conditions for LSTM Part II st column task number

nd column training exemplar selection where t stands for randomly chosen from training

set t for randomly chosen from classes and t for randomly generated online rd

column weight initialization interval th column test set size th column stopping criterion

for training where ST stands for average training error below and the most recent

sequences were processed correctly th column success correct classication criterion where

ABS stands for absolute error of al l output units at sequence end is below

DISCUSSION

Limitations of LSTM

The particularly ecient truncated backprop version of the LSTM algorithm will not easily

solve problems similar to strongly delayed XOR problems where the goal is to compute the

XOR of two widely separated inputs that previously o ccurred somewhere in a noisy sequence

The reason is that storing only one of the inputs will not help to reduce the exp ected error

the task is nondecomp osable in the sense that it is imp ossible to incrementally reduce

the error by rst solving an easier subgoal

In theory this limitation can b e circumvented by using the full gradient p erhaps with ad

ditional conventional hidden units receiving input from the memory cells But we do not

recommend computing the full gradient for the following reasons It increases computa

tional complexity Constant error ow through CECs can b e shown only for truncated

LSTM We actually did conduct a few exp eriments with nontruncated LSTM There

was no signicant dierence to truncated LSTM exactly b ecause outside the CECs error

ow tends to vanish quickly For the same reason full BPTT do es not outp erform truncated

BPTT

Each memory cell blo ck needs two additional units input and output gate In comparison

to standard recurrent nets however this do es not increase the number of weights by more

than a factor of each conventional hidden unit is replaced by at most units in the

LSTM architecture increasing the number of weights by a factor of in the fully connected

case Note however that our exp eriments use quite comparable weight numbers for the

architectures of LSTM and comp eting approaches

Generally sp eaking due to its constant error ow through CECs within memory cells LSTM

runs into problems similar to those of feedforward nets seeing the entire input string at once

For instance there are tasks that can b e quickly solved by random weight guessing but not by

the truncated LSTM algorithm with small weight initializations such as the step parity

problem see introduction to Section Here LSTMs problems are similar to the ones of

a feedforward net with inputs trying to solve bit parity Indeed LSTM typically

b ehaves much like a feedforward net trained by backprop that sees the entire input But

thats also precisely why it so clearly outp erforms previous approaches on many nontrivial

tasks with signicant search spaces

LSTM do es not have any problems with the notion of recency that go b eyond those of

other approaches All gradientbased approaches however suer from practical inability to

precisely count discrete time steps If it makes a dierence whether a certain signal o ccurred

or steps ago then an additional counting mechanism seems necessary Easier tasks

however such as one that only requires to make a dierence b etween say and steps

do not p ose any problems to LSTM For instance by generating an appropriate negative

connection b etween memory cell output and input LSTM can give more weight to recent

inputs and learn decays where necessary

Advantages of LSTM

The constant error backpropagation within memory cells results in LSTMs ability to bridge

very long time lags in case of problems similar to those discussed ab ove

For long time lag problems such as those discussed in this pap er LSTM can handle noise

distributed representations and continuous values In contrast to nite state automata or

hidden Markov mo dels LSTM do es not require an a priori choice of a nite number of states

In principle it can deal with unlimited state numbers

For problems discussed in this pap er LSTM generalizes well even if the p ositions of widely

separated relevant inputs in the input sequence do not matter Unlike previous approaches

ours quickly learns to distinguish b etween two or more widely separated o ccurrences of a

particular element in an input sequence without dep ending on appropriate short time lag

training exemplars

There app ears to b e no need for parameter ne tuning LSTM works well over a broad range

of parameters such as learning rate input gate bias and output gate bias For instance to

some readers the learning rates used in our exp eriments may seem large However a large

learning rate pushes the output gates towards zero thus automatically countermanding its

own negative eects

The LSTM algorithms up date complexity p er weight and time step is essentially that of

BPTT namely O This is excellent in comparison to other approaches such as RTRL

Unlike full BPTT however LSTM is local in both space and time

CONCLUSION

Each memory cells internal architecture guarantees constant error ow within its constant error

carrousel CEC provided that truncated backprop cuts o error ow trying to leak out of memory

cells This represents the basis for bridging very long time lags Two gate units learn to op en and

close access to error ow within each memory cells CEC The multiplicative input gate aords

protection of the CEC from p erturbation by irrelevant inputs Likewise the multiplicative output

gate protects other units from p erturbation by currently irrelevant memory contents

Future work To nd out ab out LSTMs practical limitations we intend to apply it to real

world data Application areas will include time series prediction music comp osition and

sp eech pro cessing It will also b e interesting to augment sequence chunkers Schmidhuber

b by LSTM to combine the advantages of b oth

ACKNOWLEDGMENTS

Thanks to Mike Mozer Wilfried Brauer Nic Schraudolph and several anonymous referees for valu

able comments and suggestions that help ed to improve a previous version of this pap er Ho chreiter

and Schmidhuber This work was supp orted by DFG grant SCHM from Deutsche

Forschungsgemeinschaft

APPENDIX

A ALGORITHM DETAILS

In what follows the index k ranges over output units i ranges over hidden units c stands for

j

v

the j th memory cell blo ck c denotes the v th unit of memory cell blo ck c u l m stand for

j

j

arbitrary units t ranges over all time steps of a given input sequence

The gate unit logistic sigmoid with range used in the exp eriments is

f x

expx

The function h with range used in the exp eriments is

hx

expx

The function g with range used in the exp eriments is

g x

expx

Forward pass

The net input and the activation of hidden unit i are

X

u

net t w y t

i iu

u

i

y t f net t

i i

The net input and the activation of in are

j

X

u

net t w y t

in in u

j j

u

in

j

y t f net t

in in

j j

The net input and the activation of out are

j

X

u

net t w y t

out out u

j j

u

out

j

y t f net t

out out

j j

v

c

j

v v

The net input net the internal state s and the output activation y of the v th memory

c c

j j

cell of memory cell blo ck c are

j

X

u

v v

net t w y t

c c u

j j

u

in

j

v v v

s t s t y tg net t

c c c

j j j

v

c out

j

j

v

y ths t t y

c

j

The net input and the activation of output unit k are

X

u

net t w y t

k k u

u u not a gate

k

y t f net t

k k

The backward pass to b e describ ed later is based on the following truncated backprop formulae

Approximate derivatives for truncated backprop The truncated version see Section

only approximates the partial derivatives which is reected by the signs in the notation

tr

b elow It truncates error ow once it leaves memory cells or gate units Truncation ensures that

there are no lo ops across which an error that left some memory cell through its input or input

gate can reenter the cell through its output or output gate This in turn ensures constant error

ow through the memory cells CEC

In the truncated backprop version the following derivatives are replaced by zero

net t

in

j

u

tr

u

y t

net t

out

j

u

tr

u

y t

and

net t

c

j

u

tr

u

y t

Therefore we get

in

j

net t

y t

in

j

0

f net t u

in tr

in j

j

u u

y t y t

out

j

net t

y t

out

j

0

f net t u

out tr

out j

j

u u

y t y t

and

c c c c

j j j j

net t net t net t

y t y t y t y t

out in c

j j j

u

tr

u u u u

y t net t y t net t y t net t y t

out in c

j j j

v v

in out g in out that is l fc This implies for all w not on connections to c

j j j j lm

j j

v v

c c

u

X

j j

y t y t y t

tr

u

w y t w

lm lm

u

The truncated derivatives of output unit k are

u k

X

y t y t

0 m

w net t f y t

k u k k l tr

k

w w

lm lm

u u not a gate

v v

S S

j j

c c

X X X X

j j

y t y t

0

v v v

net t w f w

k in l out l k c c l k c

j j

k

j j j

w w

lm lm

v v

j j

i

X

y t

m

w y t

k i k l

w

lm

i i hidden unit

m

y t l k

v

c

j

y t

v

v

l c w

k c

j

w

j

lm

0

v

c

f net t

k

P

k

j

S

y t

j

v

w l in OR l out

k c j j

v

w

j

lm

i

P

y t

l otherwise w

k i

i i hidden unit

w

lm

where is the Kronecker delta if a b and otherwise and S is the size of memory

ab j

cell blo ck c The truncated derivatives of a hidden unit i that is not part of a memory cell are

j

i

net t y t

i

0 0 m

f net t f net ty t

i tr li i

i i

w w

lm lm

Note here it would b e p ossible to use the full gradient without aecting constant error ow

through internal states of memory cells

Cell blo ck c s truncated derivatives are

j

in

j

net t

y t

in

j

0 0 m

f net t f net ty t

in tr in l in

j j j

in in

j j

w w

lm lm

out

j

net t

y t

out

j

0 0 m

f net t f net ty t

out tr out l out

j j j

out out

j j

w w

lm lm

in

v v v

j

s t s t net t

c c c

y t

j j j

in 0

j

v v

g net t y tg net t

c c tr

j j

w w w w

lm lm lm lm

in

v

j

s t

c

y t

j

v v

g net t

in l c in l c l

j j

j j

w w

lm lm

v

net t

c

j

in 0

j

v v

y tg net t

c l c

j j

w

lm

v

s t

c

j

0 m

v v

f net t g net t y t

in l c l in l in c

j j in j

j

j j

w

lm

in 0 m

j

v v

y t g net t y t

c l c

j j

v

c

out

v

j

s t

j

c

y t y t

j

0 out

j

v v

hs t h s t y t

c c tr

j j

w w w

lm lm lm

out

v

j

s t

c

y t

j

0 out

j

v v v

hs t h s t y t

c in l c l c out l

j j

j j j

w w

lm lm

To eciently up date the system at time t the only truncated derivatives that need to b e stored

v

s t

c

j

v

or l in where l c at time t are

j

j

w

lm

Backward pass We will describ e the backward pass only for the particularly ecient truncated

gradient version of the LSTM algorithm For simplicity we will use equal signs even where

approximations are made according to the truncated backprop equations ab ove

The squared error at time t is given by

X

k k

E t t t y t

k k output unit

k

where t t is output unit k s target at time t

Time ts contribution to w s gradientbased up date with learning rate is

lm

E t

w t

lm

w

lm

We dene some unit l s error at time step t by

E t

e t

l

net t

l

Using almost standard backprop we rst compute up dates for weights to output units l k

weights to hidden units l i and weights to output gates l out We obtain compare

j

formulae

0 k k

l k output e t f net t t t y t

k k

k

X

0

l i hidden e t f net t w e t

i i k i k

i

k k output unit

l out output gates

j

S

j

X X

0

A

v v

hs t e t f net t w e t

c out out k c k

j j

out

j

j j

v

k k output unit

For all p ossible l time ts contribution to w s up date is

lm

m

w t e t y t

lm l

v

The remaining up dates for weights to input gates l in and to cell units l c are less

j

j

v

conventional We dene some internal state s s error

c

j

E t

e

v

s

c

v

j

s t

c

j

X

0

v v

f net t h s t w e t

out out c k c k

j j

j j

k k output unit

v

We obtain for l in or l c v S

j j

j

S

j

v

X

s t

c

E t

j

e t

v

s

c

j

w w

lm lm

v

The derivatives of the internal states with resp ect to weights and the corresp onding weight

up dates are as follows compare expression

l in input gates

j

v v

s t s t

c c

j j

m 0

v

net t y t g net t f

in c

in j

j

j

w w

in m in m

j j

therefore time ts contribution to w s up date is compare expression

in m

j

S

j

v

X

s t

c

j

w t e t

v

in m s

j

c

j

w

in m

j

v

Similarly we get compare expression

v

memory cells l c

j

v v

s t s t

c c

j j

0 m

v

g net t f net t y t

c in in

j j

j

v v

w w

c m c m

j j

v

s up date is compare expression therefore time ts contribution to w

m c

j

v

s t

c

j

v

t e t w

v

m s c

c

j

v

j

w

m c

j

All we need to implement for the backward pass are equations

Each weights total up date is the sum of the contributions of all time steps

Computational complexity LSTMs up date complexity p er time step is

O KH KCS HI CSI O W

where K is the number of output units C is the number of memory cell blo cks S is the size

of the memory cell blo cks H is the number of hidden units I is the maximal number of units

forwardconnected to memory cells gate units and hidden units and

W KH KCS CSI CI HI O KH KCS CSI HI

is the number of weights Expression is obtained by considering all computations of the

backward pass equation needs K steps needs KH steps needs KSC steps

needs K H C steps for output units HI steps for hidden units CI steps for output gates

needs KCS steps needs CSI steps needs CSI steps needs CSI steps

needs CSI steps The total is K KH KC KSC HI CI CSI steps or

O KH KSC HI CSI steps We conclude LSTM algorithms up date complexity p er time

step is just like BPTTs for a fully recurrent net

v

s

c

j

values from equations and At a given time step only the CSI most recent

w

lm

need to b e stored Hence LSTMs storage complexity also is O W it do es not dep end on the

input sequence length

A ERROR FLOW

We compute how much an error signal is scaled while owing back through a memory cell for q time

steps As a byproduct this analysis reconrms that the error ow within a memory cells CEC is

indeed constant provided that truncated backprop cuts o error ow trying to leave memory cells

see also Section The analysis also highlights a p otential for undesirable longterm drifts of

s see b elow as well as the b enecial countermanding inuence of negatively biased input

c

j

gates see b elow

Using the truncated backprop we obtain

s t k

c

j

s t k

c

j

in

j

net t k

y t k

c

j

in 0

j

net t k g net t k y t k g

c c

j j

s t k s t k

c c

j j

u in

j X

y t k y t k

g net t k

c

j

u

y t k s t k

c

j

u

u

X

net t k

y t k

c

j

in 0

j

net t k y t k g

c tr

j

u

y t k s t k

c

j

u

The sign indicates equality due to the fact that truncated backprop replaces by zero the

tr

in

j net tk

c

y tk

j

u and u following derivatives

u u

y tk y tk

In what follows an error t starts owing back at c s output We redene

j j

X

t w t

j ic i

j

i

Following the denitionsconventions of Section we compute error ow for the truncated

backprop learning rule The error o ccurring at the output gate is

out c

j j

y t y t

t t

out tr j

j

out

j

net t y t

out

j

The error o ccurring at the internal state is

c

j

s t

y t

c

j

t t t

s s j

c c

j j

s t s t

c c

j j

P

w t Since we use truncated backprop we have t

ic i j

j

i i no gate and no memory cell

therefore we get

X

t t

j i

w

ic tr

j

t t

s s

c c

j j

i

The previous equations and imply constant error ow through internal states of

memory cells

t

s

s t

c

c

j j

tr

t s t

s c

c j

j

The error o ccurring at the memory cell input is

g net t s t

c c

j j

t t

c s

j c

j

net t g net t

c c

j j

The error o ccurring at the input gate is

in

j

s t

y t

c

j

t t

in tr s

j c

in

j

j

net t y t

in

j

No external error ow Errors are propagated back from units l to unit v along outgoing

connections with weights w This external error note that for conventional units there is

lv

nothing but external error at time t is

v

X

y t net t

l

e

t t

l

v

v

net t y t

v

l

We obtain

e

t

v

t

j

v

t net t net t net t t t

y t

out out in c in c

j j j j j j

tr

v v v

net t t y t t y t t y t

v j j j

We observe the error arriving at the memory cell output is not backpropagated to units v via

j

external connections to in out c

j j j

Error ow within memory cells We now fo cus on the error back ow within a memory

cells CEC This is actually the only type of error ow that can bridge several time steps Supp ose

error t arrives at c s output at time t and is propagated back for q steps until it reaches in

j j j

tq

v

where v in c We rst or the memory cell input g net It is scaled by a factor of

j j c

j

t

j

compute

c

j

y t

q

t q

s

s t

c

c

j

j

tr tq

s tq s

c c

j j

t

j

q

s tq t

c j

j

Expanding equation we obtain

t q

s

t q t q

c

v v

j

tr tr

t t t q

j s j

c

j

c

Y j

s t m

t q y t

c

v

j

tr

s t m s t t q

s c c

c j j

j

mq

0 in

j

g net t q y t q v c

c j

out 0 j

j

y th s t

c

0

j

g net t q f net t q v in

c in j

j j

in

j

Consider the factors in the previous equations last expression Obviously error ow is scaled

only at times t when it enters the cell and t q when it leaves the cell but not in b etween

constant error ow through the CEC We observe

out

j

The output gates eect is y t scales down those errors that can b e reduced early

during training without using the memory cell Likewise it scales down those errors resulting

from using activatingdeactivating the memory cell at later training stages without the output

gate the memory cell might for instance suddenly start causing avoidable errors in situations that

already seemed under control b ecause it was easy to reduce the corresp onding errors without

memory cells See output weight conict and abuse problem in Sections

If there are large p ositive or negative s t values b ecause s has drifted since time step

c c

j j

0

t q then h s t may b e small assuming that h is a logistic sigmoid See Section Drifts

c

j

of the memory cells internal state s can b e countermanded by negatively biasing the input gate

c

j

in see Section and next p oint Recall from Section that the precise bias value do es not

j

matter much

in 0

j

y t q and f net t q are small if the input gate is negatively biased assume

in

j

in

j

f is a logistic sigmoid However the p otential signicance of this is negligible compared to the

in

j

p otential signicance of drifts of the internal state s

c

j

Some of the factors ab ove may scale down LSTMs overall error ow but not in a manner

that dep ends on the length of the time lag The ow will still b e much more eective than an

exp onentially of order q decaying ow without memory cells

References

Almeida L B A learning rule for asynchronous p erceptrons with feedback in a combina

torial environment In IEEE st International Conference on Neural Networks San Diego

volume pages

Baldi P and Pineda F Contrastive learning and neural oscillator Neural Computation

Bengio Y and Frasconi P Credit assignment through time Alternatives to backpropaga

tion In Cowan J D Tesauro G and Alsp ector J editors Advances in Neural Information

Processing Systems pages San Mateo CA Morgan Kaufmann

Bengio Y Simard P and Frasconi P Learning longterm dep endencies with gradient

descent is dicult IEEE Transactions on Neural Networks

Cleeremans A ServanSchreiber D and McClelland J L Finitestate automata and

simple recurrent networks Neural Computation

de Vries B and Princip e J C A theory for neural networks with time delays In

Lippmann R P Mo o dy J E and Touretzky D S editors Advances in Neural Information

Processing Systems pages San Mateo CA Morgan Kaufmann

Doya K Bifurcations in the learning of recurrent neural networks In Proceedings of

IEEE International Symposium on Circuits and Systems pages

Doya K and Yoshizawa S Adaptive neural oscillator using continuoustime back

propagation learning Neural Networks

Elman J L Finding structure in time Technical Rep ort CRL Technical Rep ort

Center for Research in Language University of California San Diego

Fahlman S E The recurrent cascadecorrelation learning algorithm In Lippmann R P

Mo o dy J E and Touretzky D S editors Advances in Neural Information Processing

Systems pages San Mateo CA Morgan Kaufmann

Ho chreiter J Untersuchungen zu dynamischen neuronalen Netzen Diploma the

sis Institut f ur Informatik Lehrstuhl Prof Brauer Technische Universitat M unchen See

wwwinformatiktumuenchendehochreit

Ho chreiter S and Schmidhuber J Long shortterm memory Technical Rep ort FKI

Fakultat f ur Informatik Technische Universitat M unchen

Ho chreiter S and Schmidhuber J Bridging long time lags by weight guessing and Long

ShortTerm Memory In Silva F L Princip e J C and Almeida L B editors Spa

tiotemporal models in biological and articial systems pages IOS Press Amsterdam

Netherlands Serie Frontiers in Articial Intelligence and Applications Volume

Ho chreiter S and Schmidhuber J LSTM can solve hard long time lag problems In

Advances in Neural Information Processing Systems MIT Press Cambridge MA Presented

at NIPS

Lang K Waibel A and Hinton G E A timedelay neural network architecture for

isolated word recognition Neural Networks

Miller C B and Giles C L Exp erimental comparison of the eect of order in recurrent

neural networks International Journal of and Articial Intel ligence

Mozer M C A fo cused backpropagation algorithm for temp oral sequence recognition

Complex Systems

Mozer M C Induction of multiscale temp oral structure In Lippman D S Mo o dy

J E and Touretzky D S editors Advances in Neural Information Processing Systems

pages San Mateo CA Morgan Kaufmann

Pearlmutter B A Learning state space tra jectories in recurrent neural networks Neural

Computation

Pearlmutter B A Gradient calculations for dynamic recurrent neural networks A survey

IEEE Transactions on Neural Networks

Pineda F J Generalization of backpropagation to recurrent neural networks Physical

Review Letters

Pineda F J Dynamics and architecture for neural computation Journal of Complexity

Plate T A Holographic recurrent networks In S J Hanson J D C and Giles C L

editors Advances in Neural Information Processing Systems pages San Mateo CA

Morgan Kaufmann

Pollack J B Language induction by phase transition in dynamical recognizers In

Lippmann R P Mo o dy J E and Touretzky D S editors Advances in Neural Information

Processing Systems pages San Mateo CA Morgan Kaufmann

Puskorius G V and Feldkamp L A Neuro control of nonlinear dynamical systems with

Kalman lter trained recurrent networks IEEE Transactions on Neural Networks

Ring M B Learning sequential tasks by incrementally adding higher orders In S J Han

son J D C and Giles C L editors Advances in Neural Information Processing Systems

pages Morgan Kaufmann

Robinson A J and Fallside F The utility driven dynamic error propagation network

Technical Rep ort CUEDFINFENGTR Cambridge University Engineering Department

Schmidhuber J The Neural Bucket Brigade A lo cal learning algorithm for dynamic

feedforward and recurrent networks Connection Science

Schmidhuber J a A xed size storage O n time complexity learning algorithm for fully

recurrent continually running networks Neural Computation

Schmidhuber J b Learning complex extended sequences using the principle of history

compression Neural Computation

Schmidhuber J c Learning unambiguous reduced sequence descriptions In Mo o dy J E

Hanson S J and Lippman R P editors Advances in Neural Information Processing Sys

tems pages San Mateo CA Morgan Kaufmann

Schmidhuber J Netzwerkarchitekturen Zielfunktionen und Kettenregel Habilitations

schrift Institut f ur Informatik Technische Universitat M unchen

Schmidhuber J and Ho chreiter S Guessing can outp erform many long time lag algo

rithms Technical Rep ort IDSIA IDSIA

Silva G X Amaral J D Langlois T and Almeida L B Faster training of recurrent

networks In Silva F L Princip e J C and Almeida L B editors Spatiotemporal models

in biological and articial systems pages IOS Press Amsterdam Netherlands Serie

Frontiers in Articial Intelligence and Applications Volume

Smith A W and Zipser D Learning sequential structures with the realtime recurrent

learning algorithm International Journal of Neural Systems

Sun G Chen H and Lee Y Time warping invariant neural networks In S J Hanson

J D C and Giles C L editors Advances in Neural Information Processing Systems

pages San Mateo CA Morgan Kaufmann

Watrous R L and Kuhn G M Induction of nitestate languages using secondorder

recurrent networks Neural Computation

Werbos P J Generalization of backpropagation with application to a recurrent gas market

mo del Neural Networks

Williams R J Complexity of exact gradient computation algorithms for recurrent neural

networks Technical Rep ort Technical Rep ort NUCCS Boston Northeastern Univer

sity College of Computer Science

Williams R J and Peng J An ecient gradientbased algorithm for online training of

recurrent network tra jectories Neural Computation

Williams R J and Zipser D Gradientbased learning algorithms for recurrent networks

and their computational complexity In Backpropagation Theory Architectures and Appli

cations Hillsdale NJ Erlbaum