<<

Recurrent Neural Networks with Hessian-Free Optimization

James Martens [email protected] [email protected] University of Toronto, Canada

Abstract In this work we resolve the long-outstanding problem of how to effectively train recurrent neu- ral networks (RNNs) on complex and difficult sequence modeling problems which may con- tain long-term dependencies. Utilizing re- cent advances in the Hessian-free optimization Figure 1. The architecture of a . approach (Martens, 2010), together with a novel -based training of RNNs might appear straight- damping scheme, we successfully train RNNs on forward because, unlike many rich probabilistic sequence two sets of challenging problems. First, a col- models (Murphy, 2002), the exact can be cheaply lection of pathological synthetic datasets which computed by the Through Time (BPTT) are known to be impossible for standard op- (Rumelhart et al., 1986). Unfortunately, gradi- timization approaches (due to their extremely ent descent and other 1st-order methods completely fail to long-term dependencies), and second, on three properly train RNNs on large families of seemingly sim- natural and highly complex real-world sequence ple yet pathological synthetic problems that separate a tar- datasets where we find that our method sig- get output and from its relevant input by many time steps nificantly outperforms the previous state-of-the- (Bengio et al., 1994; Hochreiter and Schmidhuber, 1997). art method for training neural sequence mod- In fact, 1st-order approaches struggle even when the sep- els: the Long Short-term Memory approach of aration is only 10 timesteps (Bengio et al., 1994). An un- Hochreiter and Schmidhuber (1997). Addition- fortunate consequence of these failures is that these highly ally, we offer a new interpretation of the gen- expressive and potentially very powerful time-series mod- eralized Gauss-Newton of Schraudolph els are seldom used in practice. (2002) which is used within the HF approach of Martens. The extreme difficulty associated with training RNNs is likely due to the highly volatile relationship between the parameters and the hidden states. One way that this volatil- ity manifests itself, which has a direct impact on the per- 1. Introduction formance of gradient-descent, is in the so-called “vanish- A Recurrent Neural Network () is a neural network ing/exploding gradients” phenomenon (Bengio et al., 1994; that operates in time. At each timestep, it accepts an in- Hochreiter, 1991), where the error-signals exhibit expo- put vector, updates its (possibly high-dimensional) hid- nential decay/growth as they are back-propagated through den state via non-linear activation functions, and uses it time. In the case of decay, this leads to the long-term error to make a of its output. RNNs form a rich signals being effectively lost as they are overwhelmed by model class because their hidden state can store informa- un-decayed short-term signals, and in the case of exponen- tion as high-dimensional distributed representations (as op- tial growth there is the opposite problem that the short-term posed to a , whose hidden state is es- error signals are overwhelmed by the long-term ones. sentially log n-dimensional) and their nonlinear dynamics During the 90’s there was intensive research by the ma- can implement rich and powerful computations, allowing chine learning community into identifying the source of the RNN to perform modeling and prediction tasks for se- difficultly in training RNNs as well as proposing meth- quences with highly complex structure. ods to address it. However, none of these methods be- came widely adopted, and an analysis by Hochreiter and Appearing in Proceedings of the 28 th International Conference Schmidhuber (1996) showed that they were often no bet- on , Bellevue, WA, USA, 2011. Copyright 2011 ter than random guessing. In an attempt to sidestep the by the author(s)/owner(s). difficulty of training RNNs on problems exhibiting long- Learning Recurrent Neural Networks with Hessian-Free Optimization term dependencies, Hochreiter and Schmidhuber (1997) Other contributions we make include the development of proposed a modified architecture called the Long Short- the aforementioned structural damping scheme which, as term Memory (LSTM) and successfully applied it to speech we demonstrate through experiments, can significantly im- and handwritten text recognition (Graves and Schmidhu- prove robustness of the HF optimizer in the setting of RNN ber, 2009; 2005) and robotic control (Mayer et al., 2007). training, and a new interpretation of the generalized Gauss- The LSTM consists of a standard RNN augmented with Newton matrix of Schraudolph (2002) which forms a key “memory-units” which specialize in transmitting long-term component of the HF approach of Martens. information, along with a set of “gating” units which al- low the memory units to selectively interact with the usual 2. Recurrent Neural Networks RNN hidden state. Because the memory units are forced to have fixed linear dynamics with a self-connection of We now formally define the standard RNN (Rumelhart value 1, the error signal neither decays nor explodes as it et al., 1986) which forms the focus of this work. Given n is backpropagated through them. However, it is not clear a sequence of inputs x1, x2, . . . , xT , each in R , the net- if the approach of training with and en- work computes a sequence of hidden states h1, h2, . . . , hT , m abling long-term memorization with the specialized LSTM each in R , and a sequence of yˆ1, yˆ2,..., yˆT , architecture is harnessing the true power of recurrent neu- each in Rk, by iterating the equations ral computation. In particular, gradient descent may be de- ti = Whxxi + Whhhi−1 + bh (1) ficient for RNN learning in ways that are not compensated for by using LSTMs. Another recent attempt to resolve hi = e(ti) (2) the problem of RNN training which has received atten- si = Wyhhi + by (3) tion is the Echo-State-Network (ESN) of Jaeger and Haas yˆi = g(si) (4) (2004), which gives up on learning the problematic hidden- where Wyh,Whx,Whh are the weight matrices and to-hidden weights altogether in favor of using fixed sparse k bh, by are the , t1, t2, . . . , tT , each in R , and connections which are generated randomly. However, since k this approach cannot learn new non-linear dynamics, in- s1, s2, . . . , sT , each in R , are the inputs to the hidden and stead relying on those present in the random “reservoir” output units (resp.) and e and g are pre-defined vector val- which is effectively created by the random initialization, ued functions which are typically non-linear and applied its power is clearly limited. coordinate-wise (although the only formal requirement is that they have a well-defined Jacobian). The RNN also has init m Recently, a special variant of the Hessian-Free (HF) opti- a special initial bh ∈ R which replaces the formally mization approach (aka truncated-Newton or Newton-CG) undefined expression Whhh0 at time i = 1. For a sub- was successfully applied to learning deep multilayered neu- scripted variable zi, the variable without a subscript, z, will ral networks from random initializations (Martens, 2010), refer to all of the zi’s from i = 1 to T concatenated into a an optimization problem for which gradient descent and large vector, and similarly will use θ to refer to all of the even quasi-Newton methods like L-BFGS have never been parameters as one large vector. demonstrated to be effective. Martens examines the prob- lem of learning deep auto-encoders and is able to obtain the The objective associated with RNNs for a single training pair (x, y) is defined as f(θ) = L(ˆy; y), where best-known results for these problems, significantly sur- 1 passing the benchmark set by the seminal L is a distance function which measures the deviation of the predictions yˆ from the target outputs y. Examples of L work of Hinton and Salakhutdinov (2006). P 2 include the squared error kyˆi − yik /2 and the cross- P P i Inspired by the success of the HF approach on deep neu- entropy error − i j yij log(ˆyij)+(1−yij) log(1−yˆij). ral networks, in this paper we revisit and resolve the long- The overall objective function for the whole training set is standing open problem of RNN training. In particular, our simply given by the average of the individual objectives results show that the HF optimization approach of Martens, associated with each training example. augmented with a novel “structural-damping” which we develop, can effectively train RNNs on the aforementioned pathological long-term dependency problems (adapted di- 3. Hessian-Free Optimization rectly from Hochreiter and Schmidhuber (1997)), thus Hessian-Free optimization is concerned with the minimiza- overcoming the main objection made against using RNNs. tion of an objective f : RN → R, with respect to N- From there we go on to address the question of whether dimensional input vector θ. Its operation is viewed as the these advances generalize to real-world sequence model- iterative minimization of simpler sub-objectives based on ing problems by considering both a high-dimensional mo- local approximations to f(θ). Specifically, given a param- tion video prediction task, a MIDI-music modeling task, eter setting θn, the next one, θn+1, is found by partially and a speech modelling task. We find that RNNs trained optimizing the sub-objective using our method are highly effective on these tasks and q (θ) ≡ M (θ) + λR (θ), (5) significantly outperform similarly sized LSTMs. θn θn θn 1Not necessarily symmetric or sub-additive. Learning Recurrent Neural Networks with Hessian-Free Optimization

2 In this equation, Mθn (θ) is a θn-dependent “local” that it gave no substantive benefit for RNNs . quadratic approximation to f(θ) given by 3.1. Multiplication by the Generalized Gauss-Newton 0 > > Matrix Mθn (θ) = f(θn) + f (θn) δn + δn Bδn/2, (6) We now define the aforementioned Gauss-Newton matrix where B is an approximation to the of f, the term Gf , discuss its properties, and describe the algorithm for G v δn is given by δn = θ − θn, and Rθn (θ) is a damping func- the curvature matrix-vector product f (which tion that penalizes the solution according to the difference was first applied to neural networks by Schraudolph (2002), between θ and θn, thus encouraging θ to remain close to extending the work of Pearlmutter (1994)). θ n. The use of a damping function is generally necessary The Gauss-Newton matrix for a single training example because the accuracy of the local approximation M (θ) θn (x, y) is given by degrades as θ moves further from θn. > 00 Gf ≡ J (L◦g) Js,θ (7) In standard Hessian-free optimization (or ‘truncated- s,θ θ=θn Newton’, as it is also known), M is chosen to be the same θn where the J is the Jacobian of s w.r.t. θ, and (L ◦ g)00 quadratic as is optimized in Newton’s method: the full 2nd- s,θ is the Hessian of the L(g(s); y) with respect to s. To order Taylor series approximation to f, which is obtained compute the curvature matrix-vector product G v we first by taking B = f 00(θ ) in eq. 6, and the damping func- f n compute J v = Rs using a single forward pass of the tion R (θ) is chosen to be kδ k2/2. Unlike with quasi- s,θ θn n “R{}-method” described in Pearlmutter (1994). Next, we Newton approaches like L-BFGS there is no low-rank or run standard backpropagation through time with the vector diagonal approximation involved, and as a consequence, (L◦g)00J v which effectively accomplishes the multipli- fully optimizing the sub-objective q can require a matrix s,θ θn cation by J> , giving inversion or some similarly expensive operation. The HF s,θ approach circumvents this problem by performing a much > 00 > 00 J ((L◦g) Js,θv) = (J (L◦g) Js,θ) v = Gf v s,θ θ=θn s,θ θ=θn cheaper partial optimization of qθn using the linear con- jugate gradient algorithm (CG). By utilizing the complete If s is precomputed then the total cost of this operation is curvature information given by B, CG running within HF the same as a standard forward pass and back-propagation can take large steps in directions of low reduction and cur- through time operation - i.e., it is essentially linear in the vature which are effectively invisible to gradient descent. It number of parameters. turns out that HF’s ability to find and strongly pursue these directions is the critical property which allows it to success- A very important property of the Gauss-Newton matrix is 00 fully optimize deep neural networks (Martens, 2010). that it is positive semi-definite when (L◦g) is, which hap- pens when L(ˆy; y) is convex w.r.t. s. Moreover, it can be Although the HF approach has been known and studied shown that Gf approaches the Hessian H when the loss for decades within the optimization literature, the short- L(ˆy(θ); y) approaches 0. comings of existing versions of the approach made them impractical or even completely ineffective for neural net The usual view of the Gauss-Newton matrix is that it is training (Martens, 2010). The version of HF developed by simply a positive semi-definite approximation to the Hes- Martens makes a series of important modifications and de- sian. There is an alternative view, which is will known sign choices to the basic approach, which include (among in the context of non-linear least-squares problems, which others): using the positive semi-definite Gauss-Newton we now extend to encompass Schraudolph’s generalized curvature matrix in place of the possibly indefinite Hessian, Gauss-Newton formulation. Consider the approximation f˜ 2 of f obtained by “linearizing” the network up to the activa- using a quadratic damping function for Rθn (θ) = kδnk /2 with the damping parameter λ adjusted by Levenberg- tions of the output units s, after which the final nonlinearity Marquardt style (Nocedal and Wright, 1999), us- g and the usual error function are applied. To be precise, we replace s(θ) with s(θn)+ Js,θ| δn (recall δn ≡ θ−θn), ing a criterion based directly on the value of qθn in order to θ=θn terminate CG (as opposed to the usual residual-error based where Js,θ is the Jacobian of s w.r.t. to θ, giving: criterion), and computing the curvature-matrix products Bv   f˜ (θ) = L g( s(θ ) + J | δ ); y (8) using mini-batches (as opposed to the whole training set) θn n s,θ θ=θn n and the gradients with much larger mini-batches. 2The probable reason for this is that RNNs have no large axis- In applying HF to RNNs we deviate from the approach of aligned (i.e. associated to specific parameters) scale differences, Martens in two significant ways. First, we use a different since the gradients and for the various parameters are damping function for R (developed in section 3.2), which formed from the summed contributions of both decay and un- we found improves the overall robustness of the HF al- decayed error signals. In contrast, deep nets exhibit substantial -dependent (and thus axis aligned) scale variation due to the gorithm on RNNs by more accurately identifying regions fact that all signals which reach a given parameter go through the where M is likely to deviate significantly from f. Second, same number of layers and are thus decayed by roughly the same we do not use diagonal preconditioning because we found amount. Learning Recurrent Neural Networks with Hessian-Free Optimization

˜ Because function Js,θδn is affine in θ, it follows that f place of H performed much better in practice in the context is convex when L ◦ g is (since it will be the composition of neural network learning. of a convex and affine function), and so its Hessian will be positive semi-definite. A sufficient condition for L ◦ g 3.2. Structural Damping being convex is when the output non-linearity function g While we have found that applying the HF approach of ‘matches’ L (see Schraudolph, 2002), as is the case when g Martens (2010) to RNNs achieves robust performance is the and L the cross-entropy error. Note for most of the pathological long-term dependency prob- that the choice to linearize up to s is arbitrary in a sense, lems (section 4) without any significant modification, for and we could instead linearize up to yˆ (for example), as truly robust performance on all of these problems for 100 long as L is convex w.r.t. yˆ. timesteps and beyond we found that it was necessary to In the case of non-linear least-squares, once we linearize incorporate an additional idea which we call “structural ˜ the network, L ◦ g is simply squared L2 norm and so f damping” (named so since it is a damping strategy that is already quadratic in θ. However, in the more general makes use of the specific structure of the objective). setting we only require that L ◦ g be convex and twice dif- In general, 2nd-order optimization , and HF in ferentiable. To obtain a (convex) quadratic we can simply particular, perform poorly when the quadratic approxima- compute the 2nd-order Taylor series approximation of f˜. tion M becomes highly inaccurate as δn approaches the q The gradient of f˜(θ) is given by optimum of the quadratic sub-objective θn . Damping, which is critical to the success of HF, encourages the op- ∇ L(g(s(θ ) + J | δ ); y)J | (9) s n s,θ θ=θn n s,θ θ=θn timal point to be close to δn = 0, where the approximation becomes exact but trivial. The basic damping strategy used ˜ The 1st-order term in the quadratic approximation of f cen- by Martens is the classic , which tered at θ is the gradient of f˜ as computed above, evalu- 1 2 n penalizes δn by λR(δn) for R(δn) = 2 kδnk . This is ac- ated at θ = θn (i.e. at δn = 0), and is given by complished by adding λI to the curvature matrix B. ∇ L(g(s(θ )); y)J | Our initial experience training RNNs with HF on long-term s n s,θ θ=θn dependency tasks was that when λ was small there would which is equal to the gradient of the exact f evaluated at be a rapid decrease in the accuracy of the quadratic approx- θ = θn. Thus the 1st-order terms of the quadratic approxi- imation M as δn was driven by CG towards the optimum mations to f and fˆ are identical. of qθn . This would cause the Levenburg-Marquardt heuris- Computing the Hessian of f˜ we obtain: tics to (rightly) compensate by adjusting λ to be much larger. Unfortunately, we found that such a scenario was > 00   J (L◦g) g(s(θn) + Js,θ| δn); y Js,θ| associated with poor training outcomes on the more diffi- s,θ θ=θ θ=θn θ=θn n cult long-term dependency tasks of Hochreiter and Schmid- The 2nd-order term of the Taylor-series approximation of huber (1997). In particular, we saw that the parameters ˜ f is this quantity evaluated at θ = θn. Unlike the first- seemed to approach a bad local minimum where little-to-no order term, it is not the same as the corresponding term long-term information was being propagated through the in the Taylor series of f, but is in fact the Gauss-Newton hidden states. matrix G of f. Thus we showed that the quadratic model f One way to explain this observation is that for large val- of f which uses the Gauss-Newton matrix in place of the ues of λ any Tikhonov-damped 2nd-order optimization ap- f˜ Hessian is in fact the Taylor-series approximation to . proach (such as HF) will behave similarly to a 1st-order ap- Martens (2010) found that using the Gauss-Newton ma- proach3 and crucial low-curvature directions whose associ- trix Gf instead of H within the quadratic model for f was ated curvature is significantly smaller than λ will be effec- highly preferable for several reasons. Firstly, because H is tively masked-out. Martens (2010) found that the Tikhonov in general indefinite, the sub-objective qθn may not even be damping strategy was very effective for training deep auto- bounded below when B = H + λI. In particular, infinite encoders presumably because, during any point in the opti- reduction will be possible along any directions of negative mization, there was a value of λ which would allow M to curvature that have a non-zero inner product with the gra- remain an accurate approximation of f (at qθn ’s optimum) dient. While this problem can be remedied by choosing without being so high as to reduce the method to mostly λ to be larger in magnitude than the most-negative eigen- 1st-order approach. Unfortunately our experience seems to value of H, there is no efficient way to compute this in indicate that is less the case with RNNs. λ general, and moreover, using large values of can have Our hypothesis is that for certain small changes in θ there a highly detrimental effect on the performance of HF as can be large and highly non-linear changes in the hidden it effectively masks out the important low curvature direc- state sequence h (given that W is applied iteratively to tions. Secondly, Martens (2010) made the qualitative ob- hh 3 servation that, even when putting the issues of negative cur- it can be easily shown that as λ → ∞, minimizing qθn is vature aside, the search directions produced by using Gf in equivalent to a small step of gradient descent. Learning Recurrent Neural Networks with Hessian-Free Optimization potentially hundreds of temporal ‘layers’) and these will The fact that the gradient of the damping term is zero is not be accurately approximated by M. These ideas mo- important since otherwise including it with qθn would re- tivate the development of a damping function R that can sult in a biased optimization process that could no longer penalize directions in parameter space, which despite not be said to be optimizing the objective f. being large in magnitude can nonetheless lead to large Moreover, just as we did when creating a quadratic approx- changes in the hidden-state sequence. With such a damping imation of f, we may use the Gauss-Newton matrix G of scheme we can penalize said directions without requiring S S in place of the true Hessian. In particular, we define: λ to be high as to prevent the strong pursuit of other low- curvature directions whose effect on f are modeled more > 00 GS ≡ Jt,θ(D◦e) Jt,θ accurately. To this end we define the “structural damping θ=θn function” as where Jt,θ is the Jacobian of t(θ) w.r.t. θ (recall eq. 2: hi = 00 1 e(ti) and (D◦e) is the Hessian of D(e(t); h(θn)) w.r.t t). R (θ) = kδ k2 + µS (θ) (10) θn 2 n θn This matrix will be positive semi-definite if the distance function D matches the non-linearity e in the same way where µ > 0 is a weighting constant and S (θ) is a func- θn that we required L to match g in order to obtain the Gauss- tion which quantifies change in the hidden units as a func- Newton matrix for f. tion of the change in parameters. It is given by And just as with G we may view the Gauss-Newton ma- S (θ) ≡ D( h(θ); h(θ )), f θn n trix for S as resulting from the Taylor-series approximation where D is a distance function similar to L. Note that we of the following convex approximation S˜ to S: still include the usual Tikhonov term within this new damp- S˜ = D( e(t(θn) + Jt,θ| δn); h(θn)) ing function and so just as before, if λ becomes too high, all θ=θn of the low curvature directions become essentially “masked What remains is to find an efficient method for comput- out”. However, with the inclusion of the µSθn (θ) term we hope that the quadratic model will tend to be more accu- ing the contribution of the Gauss-Newton matrix for S to rate at its optimum even for smaller values of λ and fortu- the usual curvature matrix-vector product to be used within nately this is what we observed in practice, in addition to the HF optimizer. While it almost certainly would require improved performance on the pathological long-term de- a significant amount of additional work to compute both pendency problems. Gf v and λµGSv separately, it turns out that we can com- pute their sum Gf v + λµGSv = (Gf + λµGS)v, which is R (θ) kδ k2/2 When the damping function θn is given by n , what we actually need, for essentially the same cost as just the function qθ is a quadratic with respect to δn and thus n computing Gf v by itself. CG can be directly applied with B = Gf + λI in order to minimize it. However, CG will not work with the structural We can formally include λµS as part of the objective, treat- damping function because S is not generally quadratic in θ. ing h = e(t) as an additional output of the network so that Thus, just as we approximated f by the quadratic function we linearize up to t in addition to s. This gives the com- M, so too must we approximate S. bined Gauss-Newton matrix: > 00 One option is to use the standard Taylor-series approxima- J (L◦g + D◦e) J(s,t),θ (s,t),θ θ=θn tion: θ=θn

> dS 1 > where the inner Hessian is w.r.t. the concatenation of s and S(θn) + (θ − θn) + (θ − θn) HS(θ − θn) dθ 2 t (denoted (s, t)). Because the linearized prediction of t θ=θn is already computed as an intermediate step when comput- where HS is the Hessian of S at θn. Note that because D is ing the linearized prediction of s, the required multiplica- D(h(θ), h(θ )) θ = θ a distance function, n is minimized at n tion by the Jacobian J(s,t),θ|θ=θn requires no extra work. with value 0. This implies that Moreover, as long as we perform backpropagation starting from both s and t together (as we would when applying dD(h(θ), h(θn)) D(h(θ), h(θ )) = 0 and = 0 reverse-mode automatic differentiation) the required mul- n dh(θ) tiplication by the transpose Jacobian term doesn’t either. The only extra computation required will be the multipli- Using these facts we can eliminate the constant and lin- cation by (D ◦ e)00, but this is very cheap compared to the ear terms from the quadratic approximation of S since weight-matrix multiplication. Equivalently, we may just S(θ ) = D(h(θ ), h(θ )) = 0 and n n n straightforwardly apply Pearlmutter’s technique (involving

dS dD(h(θ), h(θn)) a forward and backwards pass of the R-operator) for com- = dθ dθ puting Hessian-vector products to any algorithm which ef- θ=θn θ=θn ˜ ficiently computes f + λµS˜. And said algorithm for com- dD(h(θ), h(θn)) dh(θ) = = 0 puting f˜+ λµS˜ can itself be generated by applying a for- dh(θ) dθ θ=θn ward pass of R-operator to a function which computes s Learning Recurrent Neural Networks with Hessian-Free Optimization and t (jointly, thus saving work), feeding the results into L◦g and D◦e), respectively, and taking the finally taking the sum. The result of applying either of these techniques is given as Algorithm 1 in the supplementary materials for specific choices of neural activation functions and error functions. Figure 2. An illustration of the addition problem, a typical prob- Note that the required modification to the usual algorithm lem with pathological long term dependencies. The target is the for computing the curvature matrix-vector product is sim- sum of the two marked numbers (indicated with the black arrows). ple and elegant and requires virtually no additional com- The “...” represent a large number of timesteps. putation. However, it does require the extended storage of the Rti’s, which could otherwise be discarded immediately after they are used to compute Rhi. In our second set of experiments we trained both RNNs (using HF) and LSTMs4 (using backprop-through-time) While we have derived the method of ‘structural damping’ on more realistic high-dimensional time-series prediction specifically for the hidden unit inputs t in the RNN model, tasks. In doing so we examine the hypothesis that the it is easy to see that the idea and derivation generalize to any specialized LSTM approach, while being effective at tasks intermediate quantity z in any parameterized function we where the main difficulty is the need for long-term memo- wish to learn. Moreover, the required modification to the rization and recall, may not be fully exploiting the power of associated matrix-vector product algorithm for computing recurrent computation in RNNs, and that our more general the damped Gauss-Newton matrix would similarly be to HF approach can do significantly better. add λµ(D◦e)00Rz to z∗ where D is the distance function that penalizes change in z. With a few exceptions we used the same meta-parameters and initializations over all of the experiments. See the pa- There is valid concern that one can raise regarding the po- per supplement for details. tential effectiveness of the structural damping approach. Because S is a highly non-linear function of δ it could be n 4.1. Pathological synthetic problems the case that the quadratic model will occasionally under- estimate (or overestimate) the amount of change, perhaps In this section we describe the experiments we performed severely. Moreover, if the linear model for the hidden units training RNNs with HF on seven different problems that h is accurate for a particular δn then we probably don’t all exhibit pathological long term dependencies. Each of need to penalize change in h, and if it’s not accurate then these problems has a similar form, where the inputs are structural damping may not help because, since it relies on long sequences whose start (and possibly middle) are rel- the linear model, it could fail to detect any large changes evant, and the goal of the network is to output a partic- in h. Fortunately, we can (mostly) address this concerns. ular function of the relevant inputs at the final few time- First, even when the change in the hidden states (which are steps. The irrelevant part of the sequence is usually ran- high-dimensional vectors) are not accurately predicted by dom, which makes the problem harder. A typical problem the linear model, it could still be the case that the overall is illustrated in figure 2. The difficulty of these problems magnitude of the change is, since it’s just a single scalar. increases with T (the sequence length), since longer se- Second, while the amount of change may be over or under- quences exhibit longer range dependencies. Problems 1-6 estimated for any particular training case, we are averag- are taken from Hochreiter and Schmidhuber (1997), so we ing our computations over many (usually 1000s of training will forgo giving a detailed description of them here.5 cases) and thus there may be some beneficial averaging ef- fects. And third, the linear model detecting a large change In our experiments, we trained RNNs with 100 hidden units in the hidden state sequence, while not a necessary condi- using HF, with and without structural damping, on prob- tion for the break-down of the approximation, is at least a lems 1-7 for T = 30, 50, 100, 200. Our goal is to deter- sufficient one (since we can’t trust such large changes), and mine the amount of work the optimizer requires to train the one which will be detected by structural damping term. RNN to solve each problem. To decide if the optimizer has succeeded we use the same criterion as Hochreiter and Schmidhuber. 4. Experiments 1. THEADDITIONPROBLEM In our first set of experiments we trained RNNs, using In this problem, the input to the RNN is a sequence of ran- our HF-based approach, on various pathological synthetic dom numbers, and its target output, which is located at the problems known to be effectively impossible for gradient end of the sequence, is the sum of the two “marked” num- descent (e.g., Hochreiter et al., 2001). These problems were taken directly from the original LSTM paper (Hochre- 4Specifically, the architecture that incorporates ‘forget gates’ iter and Schmidhuber, 1997). which is described in Gers et al. (1999) 5For a more formal description of these problems and of the learning setup, refer to the supplementary materials. Learning Recurrent Neural Networks with Hessian-Free Optimization

add mult temporal2 temporal3 random perm xor 5 bit mem 20 bit mem 4.0e4 3.5e4 3.0e4 2.5e4 2.0e4 1.5e4 1.0e4 0.5e4 0.0e4 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

1.2

1.0

0.8

0.6

0.4

0.2

0.0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Figure 3. For each problem, the plots in the top row give the mean number of minibatches processed before the success condition was achieved (amongst trials that didn’t fail) for various values of T (x-axis), with the min and max given by the vertical bars. Here, a minibatch is a set of 1000 sequences of length T that may be used for evaluating the gradient or a curvature matrix-vector product. A run using over 50,000 minibatches is defined as a failure. The failure rates are shown in the bottom row. Red is HF with only standard Tikhonov damping and blue is HF with structural damping included.

bers. The marked numbers are far from the sequence’s end, RESULTS AND DISCUSSION their position varies, and they are distant from each other. The results from our experiments training RNNs with HF We refer to Hochreiter and Schmidhuber (1997) for the for- on the pathological synthetic problems are shown in fig- mal description of the addition problem, and to fig. 2 for an ure 3. We refer the reader to Hochreiter and Schmidhuber intuitive illustration. (1997) for a detailed analysis of the performance of LSTMs on these tasks. Our results establish that our HF approach 2. THE MULTIPLICATION PROBLEM is capable of training RNNs to solve problems with very The multiplication problem is precisely analogous to the long-term dependencies that are known to confound ap- addition problem except for the different operation. proaches such as gradient descent. Moreover, the inclusion 3. THE XORPROBLEM of structural damping adds some additional speed and ro- This problem is similar to the addition problem, but the bustness (especially in the 5 and 20 bit memorization tasks, noisy inputs are binary and the operation is the Xor. where it appears to be required when the minimal time-lag This task is difficult for both our method and for LSTMs is larger than 50 steps). For some random seeds the explod- (Hochreiter and Schmidhuber, 1997, sec. 5.6) because the ing/vanishing gradient problem seemed to be too extreme RNN cannot obtain a partial solution by discovering a rela- for the HF optimizer to handle and the parameters would tionship between one of the marked inputs and the target. converge to a bad local minimum (these are reported as fail- ures in figure 3). This is a problem that could potentially 4. THETEMPORALORDERPROBLEM be addressed with a more careful random initialization than See Hochreiter and Schmidhuber (1997, task 6a). the one we used, as evidenced by the significant amount of problem-specific hand tweaking done by Hochreiter and 5. THE 3-BITTEMPORALORDERPROBLEM Schmidhuber (1997, Table 10). See Hochreiter and Schmidhuber (1997, task 6b). It should be noted that the total amount of computation re- 6. THE RANDOM PERMUTATION PROBLEM quired by our HF approach in order to successfully train an See Hochreiter and Schmidhuber (1997, task 2b). RNN is considerably higher than what is required to train the LSTM using its specialized variant of stochastic gradi- 7. NOISELESS MEMORIZATION ent descent. However, the LSTM architecture was designed The goal of this problem is to learn to memorize and repro- specifically so that long-term memorization and recall can duce long sequences of bits. The input sequence starts with occur naturally via the “memory units”, so representation- a string of 5 bits followed by T occurrences of a constant ally it is a model ideally suited to such tasks. In contrast, input. There is a target in every timestep, which is constant, the problem of training a standard RNN to succeed at these except in the last 5 timesteps, which are the original 5 bits. tasks using only its general-purpose units, which lack the There is also a special input in the 5th timestep before last convenient “gating ” (a type of 3-way neural con- signaling that the output needs to be presented. We also nection), is a much harder one, so the extra computation experimented with a harder variant of this problem, where required by the HF approach seems justified. And it should the goal is to memorize and sequentially reproduce 10 ran- also be noted that the algorithm used to train LSTMs is dom integers from {1,..., 5} which contain over 20 bits of fully online and thus doesn’t benefit from parallel compu- information. tation nearly as much as a semi-online approach such as Learning Recurrent Neural Networks with Hessian-Free Optimization

ety of sequence modeling tasks that are difficult, not only Table 1. Average test-set performance for HF-trained RNN and for their long term dependencies, but also for their highly the LSTM on the three natural sequence modeling problems. complex shorter-term evolution, as our experiments with Dataset (measure) RNN+HF LSTM+GD the natural problems suggest. Given these successes we Bouncing Balls (error) 22 35 feel that RNNs, which are very powerful and general se- MIDI (log-likelihood) -569 -683 quence models in principle, but have always been very hard Speech (error) 22 41 to train, may now begin to realize their true potential and see more wide-spread application to challenging sequence HF (where the training cases in a given mini-batch may be modeling problems. processed all in parallel). Acknowledgments The authors would like to thank Richard Zemel and Geof- 4.2. Natural problems frey Hinton for their helpful suggestions. This work was For these problems, the output sequence is equal to the in- supported by NSERC and a Fellowship. put sequence shifted forward by one time-step, so the goal REFERENCES of training is to have the model predict the next timestep by either minimizing the squared error, a KL-divergence, Y. Bengio, P. Simard, and P. Frasconi. Learning long-term depen- or by maximizing the log likelihood. We trained an RNN dencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5:157–166, 1994. with HF and an LSTM with backpropagation-through-time F.A Gers, J. Schmidhuber, and Cumminsm F. Learning to forget: (verified for correctness by numerical differentiation). The Continual prediction with lstm. Neural Computation, 12:2451– RNN had 300 hidden units, and the LSTM had 30 “memory 2471, 1999. blocks” (each of which has 10 “memory cells”) resulting in A. Graves and J. Schmidhuber. Framewise phoneme classifica- nearly identically sized parameterizations for both models tion with bidirectional LSTM and other neural network archi- (the LSTM had slightly more). Full details of the datasets tectures. Neural Networks, 18:602–610, 2005. A. Graves and J. Schmidhuber. Offline Handwriting Recogni- and of the training are given in the supplementary material. tion with Multidimensional Recurrent Neural Networks. In Advances in Neural Information Processing Systems, 2009. 1. BOUNCINGBALLMOTIONVIDEOPREDICTION G Hinton and R. Salakhutdinov. Reducing the dimensionality of The bouncing balls problem consists of synthetic low- data with neural networks. Science, 313:504–507, 2006. resolution (15×15) videos of three balls bouncing in a box. S. Hochreiter. Untersuchungen zu dynamischen neuronalen Net- We trained the models on sequences of length 30. zen. Diploma thesis. PhD thesis, Institut fur Informatik, Tech- nische Universitat Munchen, 1991. 2. MUSICPREDICTION S. Hochreiter and J. Schmidhuber. Bridging long time lags by We downloaded a large number of midi files and converted weight guessing and ”long short term memory”. In Spatiotem- each into a “piano-roll”. A piano-roll is a sequence of 128- poral models in biological and artificial systems, 1996. dimensional binary vectors, each describing the set of notes S. Hochreiter and J. Schmidhuber. Long short-term memory. Neu- ral Computation, (8):1735–1780, 1997. played in the current timestep (which is 0.05 seconds long). S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. A Field The models were trained on sequences of length 200 due to Guide to Dynamical Recurrent Neural Networks, chapter Gra- the significant long-term dependencies exhibited by music. dient flow in recurrent nets: the difficulty of learning long-term dependencies. IEEE press, 2001. 3. SPEECHPREDICTION H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting We used a dataset to obtain sequences chaotic systems and saving energy in wireless communication. of 39-dimensional vectors representing the underlying Science, 304:78–80, 2004. speech signal, and used sequences of length 100. J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine RESULTS Learning (ICML), 2010. The results are shown in table 1. The HF-trained RNNs H. Mayer, F. Gomez, D. Wierstra, I. Nagy, A. Knoll, and outperform the LSTMs trained with gradient descent by a J. Schmidhuber. A system for robotic heart surgery that learns to tie knots using recurrent neural networks. In 2006 Interna- large margin. It should be noted that training the LSTMs tional Conference on Intelligent Robots and Systems, 2007. took just as long or longer than the RNNs. K.P. Murphy. Dynamic bayesian networks: representation, infer- ence and learning. PhD thesis, UC Berkeley, 2002. 5. Conclusions J. Nocedal and S.J. Wright. Numerical optimization. Springer Verlag, 1999. In this paper, we demonstrated that the HF optimizer of B.A. Pearlmutter. Fast exact multiplication by the Hessian. Neu- Martens (2010), augmented with our structural damping ral Computation, 6(1):147–160, 1994. approach, is capable of robustly training RNNs to solve D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning rep- tasks that exhibit long term dependencies. Unlike previous resentations by back-propagating errors. Nature, 323(6088): approaches like LSTMs, the method is not specifically de- 533–536, 1986. N. Schraudolph. Fast curvature matrix-vector products for signed to address the underlying source of difficulty present second-order gradient descent. Neural Computation, 14(7): in these problems (but succeeds nonetheless), which makes 1723–1738, 2002. it more likely to improve RNN training for a wider vari-