<<

On the difficulty of training recurrent neural networks

Razvan Pascanu [email protected] Universit´ede Montr´eal,2920, chemin de la Tour, Montr´eal,Qu´ebec, , H3T 1J8 Tomas Mikolov [email protected] Speech@FIT, Brno University of Technology, Brno, Czech Republic [email protected] Universit´ede Montr´eal,2920, chemin de la Tour, Montr´eal,Qu´ebec, Canada, H3T 1J8 Abstract

t There are two widely known issues with prop- ut xt E erly training recurrent neural networks, the vanishing and the exploding prob- Figure 1. Schematic of a . The lems detailed in Bengio et al. (1994). In recurrent connections in the hidden allow information this paper we attempt to improve the under- to persist from one input to another. standing of the underlying issues by explor- ing these problems from an analytical, a geo- 1.1. Training recurrent networks metric and a dynamical systems perspective. A generic recurrent neural network, with input u and Our analysis is used to justify a simple yet ef- t state x for time step t, is given by: fective solution. We propose a gradient norm t clipping strategy to deal with exploding gra- xt = F (xt−1, ut, θ) (1) dients and a soft for the vanishing problem. We validate empirically In the theoretical section of this paper we will some- our hypothesis and proposed solutions in the times make use of the specific parametrization: 1 experimental section.

xt = Wrecσ(xt−1) + Winut + b (2)

1. Introduction In this case, the parameters of the model are given by the recurrent weight matrix Wrec, the bias b and A recurrent neural network (RNN), e.g. Fig. 1, is a input weight matrix Win, collected in θ for the gen- neural network model proposed in the 80’s (Rumelhart eral case. x0 is provided by the user, set to zero or et al., 1986; Elman, 1990; Werbos, 1988) for modeling learned, and σ is an element-wise . A cost P time series. The structure of the network is similar to = t measures the performance of the net- E 1≤t≤T E that of a standard multilayer , with the dis- work on some given task, where t = (xt). tinction that we allow connections among hidden units E L associated with a time delay. Through these connec- One approach for computing the necessary gradients is tions the model can retain information about the past, through time (BPTT), where the re- enabling it to discover temporal correlations between current model is represented as a multi-layer one (with events that are far away from each other in the data. an unbounded number of layers) and backpropagation is applied on the unrolled model (see Fig. 2). While in principle the recurrent network is a simple and powerful model, in practice, it is hard to train We will diverge from the classical BPTT equations at properly. Among the main reasons why this model is this point and re-write the gradients in order to better so unwieldy are the vanishing gradient and exploding 1For any model respecting eq. (2) we can construct gradient problems described in Bengio et al. (1994). another model following the more widely known equation x = σ(W x + W u + b) that behaves the same Proceedings of the 30 th International Conference on Ma- t rec t−1 in t (for e.g. by redefining E := E (σ(x ))) and vice versa. chine Learning, Atlanta, Georgia, USA, 2013. JMLR: t t t Therefore the two formulations are equivalent. We chose W&CP volume 28. Copyright 2013 by the author(s). eq. (2) for convenience. On the difficulty of training recurrent neural networks

t 1 E − Et Et+1 ∂ t 1 ∂ t ∂ grow exponentially more than short term ones. The E − E t+1 ∂xt 1 ∂xt ∂xE − t+1 vanishing gradients problem refers to the opposite be- xt 1 xt xt+1 − haviour, when long term components go exponentially ∂xt 1 ∂xt ∂xt+1 ∂xt+2 − ∂x ∂xt 2 t 1 ∂xt ∂xt+1 − − fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. ut 1 ut ut+1 − 2.1. The mechanics Figure 2. Unrolling recurrent neural networks in time by creating a copy of the model for each time step. We denote To understand this phenomenon we need to look at the by xt the hidden state of the network at time t, by ut the form of each temporal component, and in particular at ∂x input of the network at time t and by Et the error obtained the matrix factors t (see eq. (5)) that take the form ∂xk from the output at time t. of a product of t k Jacobian matrices. In the same way a product of t− k real numbers can shrink to zero − highlight the exploding gradients problem: or explode to infinity, so does this product of matrices (along some direction v). ∂ X ∂ t E = E (3) ∂θ ∂θ In what follows we will try to formalize these intuitions 1≤t≤T (extending similar derivations done in Bengio et al.  +  ∂ t X ∂ t ∂xt ∂ xk (1994) where a single hidden unit case was considered). E = E (4) ∂θ ∂xt ∂xk ∂θ 1≤k≤t If we consider a linear version of the model (i.e. set σ to the identity function in eq. (2)) we can use the power ∂xt Y ∂xi Y T 0 iteration method to formally analyze this product of = = Wrecdiag(σ (xi−1)) (5) ∂xk ∂xi−1 Jacobian matrices and obtain tight conditions for when t≥i>k t≥i>k the gradients explode or vanish (see the supplementary These equations were obtained by writing the gradi- materials). It is sufficient for ρ < 1, where ρ is the + ∂ xk ents in a sum-of-products form. ∂θ refers to the spectral radius of the recurrent weight matrix Wrec, 2 “immediate” partial of the state xk with for long term components to vanish (as t ) and → ∞ respect to θ, where xk−1 is taken as a constant with necessary for ρ > 1 for them to explode. respect to θ. Specifically, considering eq. 2, the value + We generalize this result for nonlinear functions σ ∂ xk of any row i of the matrix ( ) is just σ(xk−1). 0 0 ∂Wrec where σ (x) is bounded, diag(σ (xk)) γ , Eq. (5) also provides the form of Jacobian matrix by relying| on| singular values.k k ≤ ∈ ∂xi for the specific parametrization given in eq. (2), ∂xi−1 1 where diag converts a vector into a diagonal matrix, We first prove that it is sufficient for λ1 < γ , where and σ0 computes element-wise the derivative of σ. λ1 is the largest singular value of Wrec, for the vanish- ing gradient problem to occur. Note that we assume ∂Et Any gradient component ∂θ is also a sum (see eq. the parametrization given by eq. (2). The Jacobian (4)), whose terms we refer to as temporal contribu- ∂xk+1 T 0 matrix is given by Wrecdiag(σ (xk)). The 2- tions or temporal components. One can see that each ∂xk + norm of this Jacobian is bounded by the product of such temporal contribution ∂Et ∂xt ∂ xk measures how ∂xt ∂xk ∂θ the norms of the two matrices (see eq. (6)). Due to θ at step k affects the cost at step t > k. The fac- our assumption, this implies that it is smaller than 1. tors ∂xt (eq. (5)) transport the error “in time“ from ∂xk step t back to step k. We would further loosely distin- ∂xk+1 T 0 1 k, Wrec diag(σ (xk)) < γ < 1 guish between long term and short term contributions, ∀ ∂xk ≤ k k γ where long term refers to components for which k t (6)  and short term to everything else. Let η R be such that k, ∂xk+1 η < 1. The ∈ ∀ ∂xk ≤ existence of η is given by eq. (6). By induction over i, 2. Exploding and vanishing gradients we can show that Introduced in Bengio et al. (1994), the exploding gra- t−1 ! ∂ t Y ∂xi+1 t−k ∂ t dients problem refers to the large increase in the norm E η E (7) ∂xt ∂xi ≤ ∂xt of the gradient during training. Such events are due to i=k the explosion of the long term components, which can As η < 1, it follows that, according to eq. (7), long 2We use “immediate” partial in order to term contributions (for which t k is large) go to 0 − avoid confusion, though one can use the concept of total exponentially fast with t k.  derivative and the proper meaning of partial derivative to − express the same property By inverting this proof we get the necessary condition On the difficulty of training recurrent neural networks for exploding gradients, namely that the largest singu- ing attractor or from a disappearing one qualifies as 1 lar value λ1 is larger than γ (otherwise the long term crossing a boundary between attractors, this term en- components would vanish instead of exploding). For capsulates also the observations from Doya (1993). 1 tanh we have γ = 1 while for sigmoid we have γ = /4. We will re-use the one-hidden unit model (and plot) 2.2. Drawing similarities with dynamical from Doya (1993) (see Fig. 3) to depict our exten- systems sion, though , as the original hypothesis, our extension does not depend on the dimensionality of the model. We can improve our understanding of the exploding The x-axis covers the parameter b and the y-axis the gradients and vanishing gradients problems by employ- asymptotic state x . The bold line follows the move- ing a dynamical systems perspective, as it was done ∞ ment of the final point attractor, x , as b changes. before in Doya (1993); Bengio et al. (1993). ∞ At b1 we have a bifurcation boundary where a new at- We recommend reading Strogatz (1994) for a formal tractor emerges (when b decreases from ), while at ∞ and detailed treatment of dynamical systems theory. b2 we have another that results in the disappearance For any parameter assignment θ, depending on the ini- of one of the two attractors. In the interval (b1, b2) tial state x0, the state xt of an autonomous dynamical we are in a rich regime, where there are two attractors system converges, under the repeated application of and the change in position of boundary between them, the map F , to one of several possible different attrac- as we change b, is traced out by a dashed line. The tor states. The model could also find itself in a chaotic gray dashed arrows describe the evolution of the state regime, a case in which some of the following observa- x if the network is initialized in that region. Cross- tions may not hold, but that is not treated in depth ing the boundary between basins of attractions is de- here. Attractors describe the asymptotic behaviour of picted with unfilled circles, where a small change in the model. The state space is divided into basins of at- the state at time 0 results in a sudden large change in traction, one for each attractor. If the model is started xt. Crossing a bifurcation ((Doya, 1993) original hy- in one basin of attraction, the model will converge to pothesis) is shown with filled circles. Note how in the the corresponding attractor as t grows. figure, there are only two values of b with a bifurca- tion, but a whole range of values for which there can Dynamical systems theory says that as θ changes, the be a boundary crossing. asymptotic behaviour changes smoothly almost every- where except for certain crucial points where drastic changes occur (the new asymptotic behaviour ceases to be topologically equivalent to the old one). These points form bifurcation boundaries and are caused by attractors that appear, disappear or change shape. Doya (1993) hypothesizes that such bifurcation cross- ings could cause the gradients to explode. We would like to extend this observation into a sufficient condi- tion for gradients to explode, by addressing the issue of crossing boundaries between basins of attraction. We argue that bifurcations are global events that can Figure 3. Bifurcation diagram of a single hidden unit RNN have locally no effect and therefore crossing them is (with fixed recurrent weight of 5.0 and adjustable bias b; example introduced in Doya (1993)). See text. neither sufficient nor necessary for the gradients to ex- plode. For example due to a bifurcation one attractor Another limitation of previous analysis is that it con- can disappear, but if the model’s state is not found in siders autonomous systems and assumes the observa- the basin of attraction of said attractor, learning will tions hold for input-driven models as well. In (Ben- not be affected by it. However when either a change in gio et al., 1994) input is dealt with by assuming it the state or in the position of the boundary between is bounded noise. The downside of this approach is basins of attraction (which can be caused by a change that it limits how one can reason about the input. In in θ) is such that the model falls in a different basin practice, the input is supposed to drive the dynamical than before, the state will be attracted in a different system, being able to leave the model in some attrac- direction resulting in the explosion of the gradients. tor state, or kick it out of the basin of attraction when This crossing of boundaries between basins of attrac- certain triggering patterns present themselves. tion is a local event and it is sufficient for the gradients We propose to extend our analysis to input driven to explode. By assuming that crossing into an emerg- models by folding the input into the map. We con- On the difficulty of training recurrent neural networks sider the family of maps F , where we apply a different avoiding the vanishing gradient means staying close to Ft at each step. Intuitively, for the gradients to ex- the boundaries between basins of attractions. plode we require the same behaviour as before, where 2.3. The geometrical interpretation (at least in some direction) the maps F1, .., Ft agree and change direction. Fig. 4 describes this behaviour. Let us consider a simple one hidden unit model (eq. (8)) where we provide an initial state x0 and train the F2 F2 model to have a specific target value after 50 steps. F3 Note that for simplicity we assume no input. F3 F1 F1 xt = wσ(xt−1) + b (8) xt xt ∆x0 2 Fig. 6 shows the error surface 50 = (σ(x50) 0.7) , ∆xt E − where x0 = .5 and σ to be the . Figure 4. This diagram illustrates how the change in xt, ∆xt, can be large for a small ∆x0. The blue vs red (left vs right) trajectories are generated by the same maps F1,F2,... for two different initial states.

For the specific parametrization in eq. (2) we can take the analogy one step further by decomposing the maps Ft into a fixed map F˜ and a time-varying one Ut. F (x) = Wrecσ(x) + b corresponds to an input-less recurrent network, while Ut(x) = x+Winut describes the effect of the input. This is depicted in in Fig. 5. Since Ut changes with time, it can not be analyzed us- ing standard dynamical systems tools, but F˜ can. This Figure 6. We plot the error surface of a single hidden unit means that when a boundary between basins of attrac- recurrent network, highlighting the existence of high cur- tions is crossed for F˜, the state will move towards a vature walls. The solid lines depicts standard trajectories that might follow. Using dashed arrow different attractor, which for large t could lead (un- the diagram shows what would happen if the gradients is less the input maps Ut are opposing this) to a large rescaled to a fixed size when its norm is above a threshold. discrepancy in xt. Therefore studying the asymptotic ˜ behaviour of F can provide useful information about We can easily analyze the behavior of the model in the where such events are likely to happen. t linear case, with b = 0, i.e. xt = x0w . We have that ∂x ∂2x xt xt t t−1 t t−2 ∆xt ∂w = tx0w and ∂w2 = t(t 1)x0w , implying that when the first derivative explodes,− so does the second derivative. F2 F 2 U2 U2 ˜ ˜ In the general case, when the gradients explode they F F F1 F 1 do so along some directions v. There exists, in such U U1 1 ∂Et t situations, a v such that ∂θ v Cα , where C, α F˜ F˜ ≥ ∈ ∆x0 R and α > 1 (for e.g., in the linear case, v is the eigenvector corresponding to the largest eigenvalue of Figure 5. Illustrates how one can break apart the maps Wrec). If this bound is tight, we hypothesize that F1, ..Ft into a constant map F˜ and the maps U1, .., Ut. The dotted vertical line represents the boundary between basins when gradients explode so does the along v, of attraction, and the straight dashed arrow the direction leading to a wall in the error surface, like the one seen of the map F˜ on each side of the boundary. This diagram in Fig. 6. Based on this hypothesis we devise a simple is an extension of Fig. 4. solution to the exploding gradient problem (see Fig 6). If both the gradient and the leading eigenvector of the One interesting observation from the dynamical sys- curvature are aligned with the exploding direction v, it tems perspective with respect to vanishing gradients follows that the error surface has a steep wall perpen- is the following. If the factors ∂xt go to zero (for t k dicular to v (and consequently to the gradient). This ∂xk − large), it means that xt does not depend on xk (if means that when stochastic gradient descent (SGD) we change xk by some ∆, xt stays the same). This reaches the wall and does a gradient descent step, it translates into the model at xt being close to conver- will be forced to jump across the valley moving perpen- gence towards some attractor (which it would reach dicular to the steep walls, possibly leaving the valley from anywhere in the neighbourhood of xk). Therefore and disrupting the learning process. On the difficulty of training recurrent neural networks

The dashed arrows in Fig. 6 correspond to ignoring known, it might not be trivial to initialize the model the norm of this large step, ensuring that the model accordingly. Also, such initialization does not prevent stays close to the wall. One key insight is that all crossing the boundary between basins of attraction. the steps taken when the gradient explodes are aligned Teacher forcing refers to using targets for some or all with v and ignore other descent direction. At the wall, hidden units. When computing the state at time t, a bounded norm step in the direction of the gradient we use the targets at t 1 as the value of all the hid- therefore merely pushes us back inside the smoother den units in x which− have a target defined. It has low-curvature region besides the wall, whereas a regu- t−1 been shown that in practice it can reduce the chance lar gradient step would bring us very far, thus slowing that gradients explode, and even allow training gen- or preventing further training. erator models or models that work with unbounded The important assumption in this scenario, compared amounts of memory(Pascanu and Jaeger, 2011; Doya to the classical high curvature valley, is that we assume and Yoshizawa, 1991). One important downside is that that the valley is wide, as we have a large region around it requires a target to be defined at every time step. the wall where if we land we can rely on first order Hochreiter and Schmidhuber (1997); Graves et al. methods to move towards a local minima. The effec- (2009) propose the LSTM model to deal with the van- tiveness of clipping provides indirect evidence, as oth- ishing gradients problem. It relies on special type of erwise, even with clipping, SGD would not be able to linear unit with a self connection of value 1. The flow explore descent directions of low curvature and there- of information into and out of the unit is guarded by fore further minimize the error. learned input and output gates. There are several vari- Our hypothesis, when it holds, could also provide an- ations of this basic structure. This solution does not other argument in favor of the Hessian-Free approach address explicitly the exploding gradients problem. compared to other second order methods (among other Sutskever et al. (2011) use the Hessian-Free optimizer existing arguments). Hessian-Free, and truncated in conjunction with structural damping. This approach Newton methods in general, compute a new estimate seems able to address the vanishing gradient problem, of the before each update step and can though more detailed analysis is missing. Presumably take into account abrupt changes in curvature (such as this method works because in high dimensional spaces the ones suggested by our hypothesis) while other ap- there is a high probability for long term components to proaches use a smoothness assumption, averaging 2nd be orthogonal to short term ones. This would allow the order signals over many steps. Hessian to rescale these components independently. 3. Dealing with the exploding and In practice, one can not guarantee that this property vanishing gradient holds. The method addresses the exploding gradient as well, as it takes curvature into account. Structural 3.1. Previous solutions damping is an enhancement that forces the Jacobian ∂xt Using an L1 or L2 penalty on the recurrent weights can matrices ∂θ to have small norm, hence further helping help with exploding gradients. Assuming weights are with the exploding gradients problem. The need for initialized to small values, the largest singular value λ1 this extra term when solving the pathological prob- of Wrec is probably smaller than 1. The L1/L2 term lems might suggest that second order derivatives do can ensure that during training λ1 stays smaller than not always grow at same rate as first order ones. 1, and in this regime gradients can not explode (see Echo State Networks (Jaeger and Haas, 2004) avoid sec. 2.1). This approach limits the model to single the exploding and vanishing gradients problem by not point attractor at the origin, where any information learning W and W . They are sampled from hand inserted in the model dies out exponentially fast. This rec in crafted distributions. Because the spectral radius of prevents the model to learn generator networks, nor W is, by construction, smaller than 1, information can it exhibit long term memory traces. rec fed in to the model dies out exponentially fast. An Doya (1993) proposes to pre-program the model (to extension to the model is given by leaky integration initialize the model in the right regime) or to use units (Jaeger et al., 2007), where teacher forcing. The first proposal assumes that if the xk = αxk−1 + (1 α)σ(Wrecxk−1 + Winuk + b). model exhibits from the beginning the same kind of − asymptotic behaviour as the one required by the tar- These units can be used to solve the standard bench- get, then there is no need to cross a bifurcation bound- mark proposed by Hochreiter and Schmidhuber (1997) ary. The downside is that one can not always know for learning long term dependencies (Jaeger, 2012). the required asymptotic behaviour, and, even if it is On the difficulty of training recurrent neural networks

We would make a final note about the approach pro- order method would do), and a more natural choice posed by Tomas Mikolov in his PhD (Mikolov, might be a regularization term. 2012)(and implicitly used in the state of the art re- The regularizer we propose prefers solutions for which sults on language modelling (Mikolov et al., 2011)). It the error preserves norm as it travels back in time: involves clipping the gradient’s temporal components  2 element-wise (clipping an entry when it exceeds in ab- ∂E ∂xk+1 X X ∂xk+1 ∂xk solute value a fixed threshold). Ω = Ωk =  1 (9) ∂E − k k ∂xk+1 3.2. Scaling down the gradients In order to be computationally efficient, we only use As suggested in section 2.3, one mechanism to deal the “immediate” partial derivative of Ω with respect to with the exploding gradient problem is to rescale their ∂E Wrec (we consider that xk and as being constant norm whenever it goes over a threshold: ∂xk+1 with respect to Wrec when computing the derivative 1 Pseudo-code for norm clipping of Ωk), as depicted in eq. (10). This can be done effi- gˆ ∂E ciently because we get the values of ∂E from BPTT. ← ∂θ ∂xk if gˆ threshold then We use to compute these gradients (Bergstra k kthreshold ≥ gˆ gˆ et al., 2010; Bastien et al., 2012). ← kgˆk end if + + ∂ Ω = P ∂ Ωk ∂Wrec k ∂Wrec 2 2  ∂E T 0  This algorithm is similar to the one proposed by Tomas W diag(σ (xk)) + ∂xk+1 rec ∂  2 −1 ∂E Mikolov and we only diverged from the original pro- = P ∂xk+1 posal in an attempt to provide a better theoretical jus- k ∂Wrec tification (see section 2.3; we also move in a descent (10) direction for the current mini-batch). We regard this Note that our regularization term only forces the Ja- cobian matrices ∂xk+1 to preserve norm in the relevant work as an investigation of why clipping works. In ∂xk practice both variants behave similarly. direction of the error ∂E , not for any direction (we ∂xk+1 The proposed clipping is simple and computationally do not enforce that all eigenvalues are close to 1). The efficient, but it does however introduce an additional second observation is that we are using a soft con- hyper-parameter, namely the threshold. One good straint, therefore we are not ensured the norm of the heuristic for setting this threshold is to look at statis- error signal is preserved. This means that we still need tics on the average norm over a sufficiently large num- to deal with the exploding gradient problem, as a sin- ber of updates. In our experience values from half gle step induced by it can disrupt learning. From the to ten times this average can still yield convergence, dynamical systems perspective we can see that pre- though convergence speed can be affected. venting the vanishing gradient problem implies that we are pushing the model towards the boundary of 3.3. Vanishing gradient regularization the current basin of attraction (such that during the We opt to address the vanishing gradients problem us- N steps it does not have time to converge), making it ing a regularization term that represents a preference more probable for the gradients to explode. for parameter values such that back-propagated gra- dients neither increase or decrease in magnitude. Our 4. Experiments and results intuition is that increasing the norm of ∂xt means the ∂xk 4.1. Pathological synthetic problems error at time t is more sensitive to all inputs u , .., u t k As done in Martens and Sutskever (2011), we address ( ∂xt is a factor in ∂Et ). In practice some of these in- ∂xk ∂uk some of the pathological problems from Hochreiter and puts will be irrelevant for the prediction at time t and Schmidhuber (1997) that require learning long term will behave like noise that the network needs to learn correlations. We refer the reader to this original pa- to ignore. The network can not learn to ignore these per for a detailed description of the tasks and to the irrelevant inputs unless there is an error signal. These supplementary materials for the complete description two issues can not be solved in parallel, and it seems of setup for any of the following experiments. 3 natural to expect that we might need to force the net- ∂xt work to increase ∂x at the expense of larger errors 4.1.1. The temporal order problem k k k (caused by the irrelevant input entries) and then wait We consider the temporal order problem as the pro- for it to learn to ignore these input entries. This sug- totypical pathological problem, extending our results gests that moving towards increasing the norm of ∂xt ∂xk to the other proposed tasks afterwards. The input is can not be always done while following a descent di- rection of the error (which is, for e.g., what a second 3Code: https://github.com/pascanur/trainingRNNs E On the difficulty of training recurrent neural networks

sigmoid basic tanh smart tanh 1.0 1.0 1.0 MSGD 0.8 0.8 0.8 MSGD-C

0.6 MSGD 0.6 MSGD 0.6 MSGD-CR MSGD-C MSGD-C 0.4 MSGD-CR 0.4 MSGD-CR 0.4

Rate0 of success .2 Rate0 of success .2 Rate0 of success .2

0.0 0.0 0.0 50 100 150 200 250 50 100 150 200 250 50 100 150 200 250 Sequence length Sequence length Sequence length Figure 7. Rate of success for solving the temporal order problem versus sequence length for different initializations (from left to right: sigmoid, basic tanh and smart tanh). See text. a long stream of discrete symbols. At the beginning any sequence of length 50 up to 200 (by providing and middle of the sequence a symbol within A, B sequences of different random lengths in this interval is emitted. The task consists in classifying the{ order} for different MSGD steps). We achieve a success rate (either AA, AB, BA, BB) at the end of the sequence. of 100% over 5 seeds in this regime as well (all runs had 0 misclassified sequences in a set of 10000 ran- Fig. 7 shows the success rate of standard mini- domly generated sequences of different lengths). We batch stochastic gradient descent MSGD, MSGD-C used a RNN of 50 tanh units initialized in the “ba- (MSGD enhanced with out clipping strategy) and sic tanh” regime. The same trained model can MSGD-CR (MSGD with the clipping strategy and address sequences of length up to 5000 steps, the regularization term) 4. It was previously shown lengths never seen during training. Specifically (Sutskever, 2012) that initialization plays an impor- the same model produced 0 mis-classified sequences tant role for training RNNs. We consider three differ- (out of 10000 sequences of same length) for lengths ent initializations. “sigmoid” is the most adversarial of either 50, 100, 150, 200, 250, 500, 1000, 2000, 3000 initialization, where we use a sigmoid unit network and 5000. This provides some evidence that the model where W , W , W (0, 0.01). “basic tanh” rec in out might use attractors to form some kind of long term uses a tanh unit network∼ where N W , W , W rec in out memory, rather then relying on transient dynamics (0, 0.1). “smart tanh” also uses tanh units and∼ (as for example ESN networks probably do in Jaeger WN , W (0, 0.01). W is sparse (each unit in out rec (2012)). has only 15∼ non-zero N incoming connections) with the spectral radius fixed to 0.95. In all cases b = bout = 0. 4.1.2. Other pathological tasks The graph shows the success rate out of 5 runs (with SGD-CR is able to solve (100% success on the lengths different random seeds) for a 50 hidden unit model, listed below, for all but one task) other pathological where the x-axis contains the length of the sequence. problems from Hochreiter and Schmidhuber (1997), We use a constant of 0.01 with no mo- namely the addition problem, the multiplication prob- mentum. When clipping the gradients, we used a lem, the 3-bit temporal order problem, the random threshold of 1.,and the regularization weight was fixed permutation problem and the noiseless memorization to 4. A run is successful if the number of misclassified problem in two variants (when the pattern needed to sequences was under 1% out of 10000 freshly random be memorized is 5 bits in length and when it contains generated sequences. We allowed a maximum number over 20 bits of information; see Martens and Sutskever of 5M updates, and use minibatches of 20 examples. (2011)). For every task we used 5 different runs (with This task provides empirical evidence that exploding different random seeds). For the first 4 problems we gradients are linked with tasks that require long mem- used a single model for lengths up to 200, while for ory traces. As the length of the sequence increases, the noiseless memorization we used a different model using clipping becomes more important to achieve a for each sequence length (50, 100, 150 and 200). The better success rate. More memory implies larger spec- hardest problems for which only two runs out of 5 suc- tral radius, which leads to rich regimes where gradients ceeded was the random permutation problem. For the are likely to explode. addition and multiplication task we observe successful generalization to sequences up to 500 steps (we notice Furthermore, we can train a single model to deal with an increase in error with sequence length, though it 4When using just the regularization term, without clip- stays below 1%). Note that for the addition and mul- ping, learning usually fails due to exploding gradients tiplication problem a sequence is misclassified with the On the difficulty of training recurrent neural networks

Table 1. Results on polyphonic music prediction in negative log likelihood per time step and natural language task in bits per character. Lower is better. Bold face shows state of the art for RNN models. Data State of the State of the Data set MSGD MSGD+C MSGD+CR fold art for RNN art Piano-midi.de train 6.87 6.81 7.01 7.04 6.32 (nll) test 7.56 7.53 7.46 7.57 7.05 Nottingham train 3.67 3.21 2.95 3.20 1.81 (nll) test 3.80 3.48 3.36 3.43 2.31 MuseData train 8.25 6.54 6.43 6.47 5.20 (nll) test 7.11 7.00 6.97 6.99 5.60 Penn Treebank train 1.46 1.34 1.36 N/A N/A 1 step (bits/char) test 1.50 1.42 1.41 1.41 1.37 Penn Treebank train N/A 3.76 3.70 N/A N/A 5 steps (bits/char) test N/A 3.89 3.74 N/A N/A square error is larger than .04. In most cases, these tropy model with up to 15-gram features has state of results outperforms Martens and Sutskever (2011) in the art on Penn Treebank(character-level). However, terms of success rate, they deal with longer sequences RNN models with some clipping strategy have state of than in Hochreiter and Schmidhuber (1997) and com- the art for word modeling and character modeling on pared to (Jaeger, 2012) they generalize to longer se- larger datasets (Mikolov et al., 2011; 2012) quences. For details see supplementary materials. For the polyphonic music prediction we have state of 4.2. Natural problems the art for RNN models, though RNN-NADE, a prob- abilistic recurrent model, trained with Hessian Free We address the task of polyphonic music prediction, (Bengio et al., 2012) does better. using the datasets Piano-midi.de, Nottingham and MuseData described in Boulanger-Lewandowski et al. 5. Summary and conclusions (2012) and language modelling at the character level on the Penn Treebank dataset (Mikolov et al., 2012). We provided different perspectives through which one exploding and vanishing We also explore a modified version of the task, where can gain more insight into the gradients we require to predict the 5th character in the future issue. We put forward a hypothesis stat- (instead of the next). We assume that for solving this ing that when gradients explode we have a cliff-like modified task long term correlations are more impor- structure in the error surface and devise a simple so- tant than short term ones, and hence our regulariza- lution based on this hypothesis, clipping the norm of tion term should be more helpful. the exploded gradients. The effectiveness of our pro- posed solutions provides some indirect empirical evi- The training and test scores for all natural problems dence towards the validity of our hypothesis, though are reported in Table 1 as well as state of the art further investigations are required. In order to deal for these tasks. See additional material for hyper- with the vanishing gradient problem we use a regular- parameters and experimental setup. We note that ization term that forces the error signal not to vanish keeping the regularization weight fixed (as for the pre- as it travels back in time. This regularization term vious tasks) seems to harm learning. One needs to use ∂xi forces the Jacobian matrices ∂x to preserve norm 1 i−1 a /t decreasing schedule for this term. We hypothe- only in relevant directions. In practice we show that sis that minimizing the regularization term affects the these solutions improve performance of RNNs on the ability of the model to learn short term correlations pathological synthetic datasets considered, polyphonic which are important for these tasks, though a more music prediction and language modelling. careful investigation is lacking. Acknowledgements These results suggest that clipping solves an optimiza- tion issue and does not act as a regularizer, as both We would like to thank Theano development team. the training and test error improve in general. Results We acknowledge RQCHP, Compute Canada, NSERC, on Penn Treebank reach the state of the art for RNN FQRNT and CIFAR for the resources they provide. achieved by Mikolov et al. (2012), who used a differ- ent clipping algorithm similar to ours, thus providing evidence that both behave similarly. A maximum en- On the difficulty of training recurrent neural networks

References Jaeger, H. (2012). Long short-term memory in echo state networks: Details of a simulation study. Tech- Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., nical report, Jacobs University Bremen. Goodfellow, I., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed Jaeger, H. and Haas, H. (2004). Harnessing nonlinear- improvements. Submited to and Un- ity: predicting chaotic systems and saving energy supervised NIPS 2012 Workshop. in wireless telecommunication. Science, 304(5667), 78–80. Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies in re- Jaeger, H., Lukosevicius, M., Popovici, D., and Siew- current networks. pages 1183–1195, San Francisco. ert, U. (2007). Optimization and applications of IEEE Press. (invited paper). echo state networks with leaky- integrator neurons. Neural Networks, 20(3), 335–352. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learn- ing long-term dependencies with gradient descent is Martens, J. and Sutskever, I. (2011). Learning recur- difficult. IEEE Transactions on Neural Networks, rent neural networks with Hessian-free optimization. 5(2), 157–166. In Proc. ICML’2011 . ACM. Mikolov, T. (2012). Statistical Language Models based Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, on Neural Networks. Ph.D. thesis, Brno University R. (2012). Advances in optimizing recurrent net- of Technology. works. Technical Report arXiv:1212.0901, U. Mon- treal. Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011). Empirical evaluation and Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., combination of advanced language modeling tech- Pascanu, R., Desjardins, G., Turian, J., Warde- niques. In Proc. 12th annual conference of the in- Farley, D., and Bengio, Y. (2010). Theano: a CPU ternational speech communication association (IN- and GPU math expression compiler. In Proceedings TERSPEECH 2011). of the Python for Scientific Computing Conference (SciPy). Oral Presentation. Mikolov, T., Sutskever, I., Deoras, A., Le, H.-S., Kombrink, S., and Cernocky, J. (2012). Subword Boulanger-Lewandowski, N., Bengio, Y., and Vincent, language modeling with neural networks. preprint P. (2012). Modeling temporal dependencies in high- (http://www.fit.vutbr.cz/ imikolov/rnnlm/char.pdf). dimensional sequences: Application to polyphonic music generation and transcription. In Proceed- Pascanu, R. and Jaeger, H. (2011). A neurodynamical ings of the Twenty-nine International Conference on model for working memory. Neural Netw., 24, 199– (ICML’12). ACM. 207. Rumelhart, D. E., Hinton, G. E., and Williams, Doya, K. (1993). Bifurcations of recurrent neural net- R. J. (1986). Learning representations by back- works in gradient descent learning. IEEE Transac- propagating errors. Nature, 323(6088), 533–536. tions on Neural Networks, 1, 75–80. Strogatz, S. (1994). Nonlinear Dynamics And Chaos: Doya, K. and Yoshizawa, S. (1991). Adaptive synchro- With Applications To Physics, Biology, Chemistry, nization of neural and physical oscillators. In J. E. And Engineering (Studies in Nonlinearity). Studies Moody, S. J. Hanson, and R. Lippmann, editors, in nonlinearity. Perseus Books Group, 1 edition. NIPS, pages 109–116. Morgan Kaufmann. Sutskever, I. (2012). Training Recurrent Neural Net- Elman, J. (1990). Finding structure in time. Cognitive works. Ph.D. thesis, University of Toronto. Science, 14(2), 179–211. Sutskever, I., Martens, J., and Hinton, G. (2011). Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Generating text with recurrent neural networks. In Bunke, H., and Schmidhuber, J. (2009). A Novel L. Getoor and T. Scheffer, editors, Proceedings of the Connectionist System for Unconstrained Handwrit- 28th International Conference on Machine Learning ing Recognition. IEEE Transactions on Pattern (ICML-11), ICML ’11, pages 1017–1024, New York, Analysis and Machine Intelligence, 31(5), 855–868. NY, USA. ACM.

Hochreiter, S. and Schmidhuber, J. (1997). Long Werbos, P. J. (1988). Generalization of backpropa- short-term memory. Neural Computation, 9(8), gation with application to a recurrent gas market 1735–1780. model. Neural Networks, 1(4), 339–356.