On the Vanishing and Exploding Gradient Problem in Gated Recurrent Units

Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020 On the vanishing and exploding gradient problem in Gated Recurrent Units Alexander Rehmer, Andreas Kroll Department of Measurement and Control, Institute for System Analytics and Control, Faculty of Mechanical Engineering, University of Kassel, Germany, (e-mail: alexander.rehmer, andreas.kroll @mrt.uni-kassel.de)f g Abstract: Recurrent Neural Networks are applied in areas such as speech recognition, natural language and video processing, and the identification of nonlinear state space models. Conventional Recurrent Neural Networks, e.g. the Elman Network, are hard to train. A more recently developed class of recurrent neural networks, so-called Gated Units, outperform their counterparts on virtually every task. This paper aims to provide additional insights into the differences between RNNs and Gated Units in order to explain the superior perfomance of gated recurrent units. It is argued, that Gated Units are easier to optimize not because they solve the vanishing gradient problem, but because they circumvent the emergence of large local gradients. Keywords: Nonlinear system identification, Recurrent Neural Networks, Gated Recurrent Units. 1. INTRODUCTION ics makes the GRU less sensitive to its initial choice of parameters and thus simplifies the optimization problem. Gated Units, such as the Long Short-Term Memory In the end GRU and RNNs will be compared on a simple (LSTM) and the Gated Recurrent Unit (GRU) were academic example and on a real nonlinear identification originally developed to overcome the vanishing gradient task. problem, which occurs in the Elman Recurrent Neural Network (RNN) (Pascanu et al., 2012). They have since 2. RECURRENT NEURAL NETWORKS outperformed RNNs on a number of tasks, such as natural language, speech and video processing (Jordan et al., 2019) In this section the Simple Recurrent Neural Network and recently also on a nonlinear system identification task (RNN), also known as Elman Network, and the Gated (Rehmer and Kroll, 2019). However, it will be shown, that Recurrent Unit (GRU) will be introduced. the gradient also vanishes in Gated Units, other still un- accounted for mechanisms have to be responsible for their 2.1 Simple Recurrent Neural Network success. Pascanu et al. (2012) show, that small changes in the parameters θ of the RNN can lead to drastic changes The RNN as depicted in figure 1 is a straightforward in the dynamic behavior of the system, when crossing realization of a nonlinear state space model (Nelles, 2001). certain critical bifurcation points. This in turn results in It consists of one hidden recurrent layer with nonlinear f a huge change in the evolution of the hidden state x^ , activation function h, which aims to approximate the k f which leads to a locally large, or exploding, gradient of state equation, as well as one hidden feedforward layer g the loss function. In this paper the GRU will be examined and one linear output layer, which together aim to approx- and compared to the RNN with the purpose to provide an imate the output equation. For simplicity of notation, the alternative explanation to why GRUs outperform RNNs. linear output layer is omitted in the following equations First, it will be shown, that the gradient of the GRU is and figures: x^ = f (W x^ + W u + b ) ; in fact smaller than that of the RNN, at least for the pa- k+1 h x k u k h (1) rameterizations considered in this paper, although GRUs y^k = f g (W yx^k + bg) ; were originally designed to solve the vanishing gradient n×1 m×1 l×1 n×n with x^k R , y^k R , uk R , W x R , problem. Secondly, it will be shown, that GRUs are not W 2n×l, b n2×1, W 2m×n, b m2×1 and only capable to represent highly nonlinear dynamics, but u R h R y R g R f : 2n×1 n×21, f : m×1 2 m×1. Usually2 tanh( ) is are also able to represent approximately linear dynamics h R R g R R employed as! nonlinear activation! function. When training· via a number of different parameterizations. Since a linear an RNN, the recurrent model is unfolded over the whole model is always a good first guess, the easy accessibility training sequence of length N, and the gradient of the of different parameterizations that produce linear dynam- loss function with respect to the model parameters θ is L ? Sponsor and financial support acknowledgment goes here. Paper calculated. As a consequence of the feedback, the gradient titles should be written in uppercase and lowercase letters, not all of the error target uppercase. e = y^ y (2) k k − k Copyright lies with the authors 1265 Preprints of the 21st IFAC World Congress (Virtual) Berlin, Germany, July 12-17, 2020 W u W y u f h f g y^ x^k x^k+1 x~k 1- W x f r f z f c Fig. 1. Representation of the Elman Network: Layers of W r W z W c neurons are represented as rectangles, connections between layers represent fully connected layers. uk at time step k with respect to the model parameters Fig. 2. The Gated Recurrent Unit (GRU). Gates are θ = W x; W u; W y; bh; bg depends on the previous state x^ f, which depends againg on the model parameters: depicted as rectangles with their respective activation k−1 functions. @e @e @y^ @x^ @x^ @x^ k = k k k + k k−1 (3) f = σ (W r [x^k; uk] + br) ; @θ @y^k @x^k @θ @x^k−1 @θ r · f = σ (W [x^ ; u ] + b ) ; (6) For example, the gradient of the hidden state x^k with z z · k k z respect to W is f = tanh (W c [x~k; uk] + bc) ; x c · N where W ; W ; W n×n+l, b ; b ; b n×1 and @x^ X 0(k−τ+1) Y 0(k−β) (k−β) r z c R r z c R k n×1 2 n×1 2 = x^k−τ f h (·) f h (·) W x f ; f ; f : . σ ( ) denotes the logistic @W x (4) r z c R R τ=1 β function.In order to map! the states· estimated by the GRU β = τ − 2; τ − 3;::: 8β ≥ 0: to the output, the GRU has to be equipped either with an High indices in brackets indicate the particular time step. output layer, as the RNN, or with an output gate, as the The product term in (4), which also appears when com- LSTM. puting the gradient with respect to the other parameters, decreases exponentially with τ, if f 0 ρ (W ) < 1, where 3. GRADIENT OF THE STATE EQUATIONS j h · j ρ (W x) is the spectral radius of W x. Essentially, backprop- agating an error one time step involves a multiplication of In this section, the gradients of the state equations of RNN the state with a derivative that is possibly smaller than (1) and GRU (5) w.r.t. their parameters will be compared one and a matrix whose spectral radius is possibly smaller to each other. In the cases examined the gradient of the than one. Hence, the gradient vanishes after a certain GRU is, somewhat surprisingly, at most as large as that amount of time steps. In the Machine Learning community of the RNN, but usually smaller. it is argued that the vanishing gradient prevents learning In order to allow for an easily interpretable visualization, of so-called long-term dependencies in acceptable time the analysis will be restricted to one dimensional and (Hochreiter and Schmidhuber, 1997; Goodfellow et al., autonomous systems. Also, the GRU will be simplified by 2016), i.e. when huge time lags exist between input uk eliminating the reset gate f r from (6), such that x~k = x^k. and output y^k. Gated recurrent units, like LSTM and Taking the gradient of the RNN's state equation in (1) GRU were developed to solve this problem and have since w.r.t. wx yields then outperformed classical RNNs on virtually any task. @x^k+1 @x^k 0 However, it can be shown that the gradient also vanishes = (^xk + wx ) tanh (wxx^k + bx) : (7) @w @w · in gated recurrent units. Additionally, the vanishing of the x x gradient over time is a desirable property. In most systems, The gradient of the GRUs state equation (5) with respect the influence of a previous state x on the current state to wz is k−τ x^k decreases over time. Unless one wants to design a @x^k+1 0 @x^k = (1 tanh (^xk; θc)) σ (^xk; θz) x^k + wz marginally stable or unstable system, e.g. when performing @wz − @wz tasks like unbounded counting, or when dealing with large 0 @x^k dead times, the vanishing gradient has no negative effect + (1 σ (^xk; θz)) tanh (^xk; θc) ; @w on the optimization procedure. − z (8) 2.2 The Gated Recurrent Unit and with respect to wc @x^k+1 0 @x^k Gated Recurrent Unit (GRU) (Cho et al., 2014) is be- = σ (^xk; θz) (^xk tanh (^xk; θc)) @w @w − sides LSTM the most often applied architecture of gated c c @x^k recurrent units. The general concept of gated recurrent + σ (^xk; θz) units is to manipulate the state x^ through the addition @wc k or multiplication of the activations of so called gates, see 0 @x^k + (1 σ (^xk; θz)) tanh (^xk; θc) x^k + wc : figure 2.

Load more