On the Difficulty of Training Recurrent Neural Networks

On the difficulty of training recurrent neural networks Razvan Pascanu [email protected] Universitéde Montréal,2920, chemin de la Tour, Montréal,Québec, Canada, H3T 1J8 Tomas Mikolov [email protected] Speech@FIT, Brno University of Technology, Brno, Czech Republic Yoshua Bengio [email protected] Universitéde Montréal,2920, chemin de la Tour, Montréal,Québec, Canada, H3T 1J8 Abstract t There are two widely known issues with prop- ut xt E erly training recurrent neural networks, the vanishing and the exploding gradient prob- Figure 1. Schematic of a recurrent neural network. The lems detailed in Bengio et al. (1994). In recurrent connections in the hidden layer allow information this paper we attempt to improve the under- to persist from one input to another. standing of the underlying issues by explor- ing these problems from an analytical, a geo- 1.1. Training recurrent networks metric and a dynamical systems perspective. A generic recurrent neural network, with input u and Our analysis is used to justify a simple yet ef- t state x for time step t, is given by: fective solution. We propose a gradient norm t clipping strategy to deal with exploding gra- xt = F (xt−1; ut; θ) (1) dients and a soft constraint for the vanishing gradients problem. We validate empirically In the theoretical section of this paper we will some- our hypothesis and proposed solutions in the times make use of the specific parametrization: 1 experimental section. xt = Wrecσ(xt−1) + Winut + b (2) 1. Introduction In this case, the parameters of the model are given by the recurrent weight matrix Wrec, the bias b and A recurrent neural network (RNN), e.g. Fig. 1, is a input weight matrix Win, collected in θ for the gen- neural network model proposed in the 80's (Rumelhart eral case. x0 is provided by the user, set to zero or et al., 1986; Elman, 1990; Werbos, 1988) for modeling learned, and σ is an element-wise function. A cost P time series. The structure of the network is similar to = t measures the performance of the net- E 1≤t≤T E that of a standard multilayer perceptron, with the dis- work on some given task, where t = (xt). tinction that we allow connections among hidden units E L associated with a time delay. Through these connec- One approach for computing the necessary gradients is tions the model can retain information about the past, backpropagation through time (BPTT), where the re- enabling it to discover temporal correlations between current model is represented as a multi-layer one (with events that are far away from each other in the data. an unbounded number of layers) and backpropagation is applied on the unrolled model (see Fig. 2). While in principle the recurrent network is a simple and powerful model, in practice, it is hard to train We will diverge from the classical BPTT equations at properly. Among the main reasons why this model is this point and re-write the gradients in order to better so unwieldy are the vanishing gradient and exploding 1For any model respecting eq. (2) we can construct gradient problems described in Bengio et al. (1994). another model following the more widely known equation x = σ(W x + W u + b) that behaves the same Proceedings of the 30 th International Conference on Ma- t rec t−1 in t (for e.g. by redefining E := E (σ(x ))) and vice versa. chine Learning, Atlanta, Georgia, USA, 2013. JMLR: t t t Therefore the two formulations are equivalent. We chose W&CP volume 28. Copyright 2013 by the author(s). eq. (2) for convenience. On the difficulty of training recurrent neural networks t 1 E − Et Et+1 ∂ t 1 ∂ t ∂ grow exponentially more than short term ones. The E − E t+1 ∂xt 1 ∂xt ∂xE − t+1 vanishing gradients problem refers to the opposite be- xt 1 xt xt+1 − haviour, when long term components go exponentially ∂xt 1 ∂xt ∂xt+1 ∂xt+2 − ∂x ∂xt 2 t 1 ∂xt ∂xt+1 − − fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. ut 1 ut ut+1 − 2.1. The mechanics Figure 2. Unrolling recurrent neural networks in time by creating a copy of the model for each time step. We denote To understand this phenomenon we need to look at the by xt the hidden state of the network at time t, by ut the form of each temporal component, and in particular at @x input of the network at time t and by Et the error obtained the matrix factors t (see eq. (5)) that take the form @xk from the output at time t. of a product of t k Jacobian matrices. In the same way a product of t− k real numbers can shrink to zero − highlight the exploding gradients problem: or explode to infinity, so does this product of matrices (along some direction v). @ X @ t E = E (3) @θ @θ In what follows we will try to formalize these intuitions 1≤t≤T (extending similar derivations done in Bengio et al. + @ t X @ t @xt @ xk (1994) where a single hidden unit case was considered). E = E (4) @θ @xt @xk @θ 1≤k≤t If we consider a linear version of the model (i.e. set σ to the identity function in eq. (2)) we can use the power @xt Y @xi Y T 0 iteration method to formally analyze this product of = = Wrecdiag(σ (xi−1)) (5) @xk @xi−1 Jacobian matrices and obtain tight conditions for when t≥i>k t≥i>k the gradients explode or vanish (see the supplementary These equations were obtained by writing the gradi- materials). It is sufficient for ρ < 1, where ρ is the + @ xk ents in a sum-of-products form. @θ refers to the spectral radius of the recurrent weight matrix Wrec, 2 \immediate" partial derivative of the state xk with for long term components to vanish (as t ) and ! 1 respect to θ, where xk−1 is taken as a constant with necessary for ρ > 1 for them to explode. respect to θ. Specifically, considering eq. 2, the value + We generalize this result for nonlinear functions σ @ xk of any row i of the matrix ( ) is just σ(xk−1). 0 0 @Wrec where σ (x) is bounded, diag(σ (xk)) γ , Eq. (5) also provides the form of Jacobian matrix by relyingj onj singular values.k k ≤ 2 R @xi for the specific parametrization given in eq. (2), @xi−1 1 where diag converts a vector into a diagonal matrix, We first prove that it is sufficient for λ1 < γ , where and σ0 computes element-wise the derivative of σ. λ1 is the largest singular value of Wrec, for the vanishing gradient problem to occur. Note that we assume @Et Any gradient component @θ is also a sum (see eq. the parametrization given by eq. (2). The Jacobian (4)), whose terms we refer to as temporal contribu- @xk+1 T 0 matrix is given by Wrecdiag(σ (xk)). The 2- tions or temporal components. One can see that each @xk + norm of this Jacobian is bounded by the product of such temporal contribution @Et @xt @ xk measures how @xt @xk @θ the norms of the two matrices (see eq. (6)). Due to θ at step k affects the cost at step t > k. The fac- our assumption, this implies that it is smaller than 1. tors @xt (eq. (5)) transport the error \in time\ from @xk step t back to step k. We would further loosely distin- @xk+1 T 0 1 k; Wrec diag(σ (xk)) < γ < 1 guish between long term and short term contributions, 8 @xk ≤ k k γ where long term refers to components for which k t (6) and short term to everything else. Let η R be such that k; @xk+1 η < 1. The 2 8 @xk ≤ existence of η is given by eq. (6). By induction over i, 2. Exploding and vanishing gradients we can show that Introduced in Bengio et al. (1994), the exploding gra- t−1 ! @ t Y @xi+1 t−k @ t dients problem refers to the large increase in the norm E η E (7) @xt @xi ≤ @xt of the gradient during training. Such events are due to i=k the explosion of the long term components, which can As η < 1, it follows that, according to eq. (7), long 2We use \immediate" partial derivatives in order to term contributions (for which t k is large) go to 0 − avoid confusion, though one can use the concept of total exponentially fast with t k. derivative and the proper meaning of partial derivative to − express the same property By inverting this proof we get the necessary condition On the difficulty of training recurrent neural networks for exploding gradients, namely that the largest singu- ing attractor or from a disappearing one qualifies as 1 lar value λ1 is larger than γ (otherwise the long term crossing a boundary between attractors, this term en- components would vanish instead of exploding). For capsulates also the observations from Doya (1993). 1 tanh we have γ = 1 while for sigmoid we have γ = =4. We will re-use the one-hidden unit model (and plot) 2.2. Drawing similarities with dynamical from Doya (1993) (see Fig. 3) to depict our exten- systems sion, though , as the original hypothesis, our extension does not depend on the dimensionality of the model.

Load more