Recurrent Highway Networks
Total Page:16
File Type:pdf, Size:1020Kb
Recurrent Highway Networks Julian Georg Zilly * 1 Rupesh Kumar Srivastava * 2 Jan Koutník 2 Jürgen Schmidhuber 2 Abstract layers usually do not take advantage of depth (Pascanu et al., 2013). For example, the state update from one time step to Many sequential processing tasks require com- the next is typically modeled using a single trainable linear plex nonlinear transition functions from one step transformation followed by a non-linearity. to the next. However, recurrent neural networks with "deep" transition functions remain difficult Unfortunately, increased depth represents a challenge when to train, even when using Long Short-Term Mem- neural network parameters are optimized by means of error ory (LSTM) networks. We introduce a novel the- backpropagation (Linnainmaa, 1970; 1976; Werbos, 1982). oretical analysis of recurrent networks based on Deep networks suffer from what are commonly referred to Geršgorin’s circle theorem that illuminates several as the vanishing and exploding gradient problems (Hochre- modeling and optimization issues and improves iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001), since our understanding of the LSTM cell. Based on the magnitude of the gradients may shrink or explode expo- this analysis we propose Recurrent Highway Net- nentially during backpropagation. These training difficulties works, which extend the LSTM architecture to al- were first studied in the context of standard RNNs where low step-to-step transition depths larger than one. the depth through time is proportional to the length of input Several language modeling experiments demon- sequence, which may have arbitrary size. The widely used strate that the proposed architecture results in pow- Long Short-Term Memory (LSTM; Hochreiter & Schmid- erful and efficient models. On the Penn Treebank huber, 1997; Gers et al., 2000) architecture was introduced corpus, solely increasing the transition depth from to specifically address the problem of vanishing/exploding 1 to 10 improves word-level perplexity from 90.6 gradients for recurrent networks. to 65.4 using the same number of parameters. On The vanishing gradient problem also becomes a limitation the larger Wikipedia datasets for character predic- when training very deep feedforward networks. Highway tion (text8 and enwik8), RHNs outperform Layers (Srivastava et al., 2015b) based on the LSTM cell all previous results and achieve an entropy of 1.27 addressed this limitation enabling the training of networks bits per character. even with hundreds of stacked layers. Used as feedforward connections, these layers were used to improve performance in many domains such as speech recognition (Zhang et al., 1. Introduction 2016) and language modeling (Kim et al., 2015; Jozefowicz et al., 2016), and their variants called Residual networks Network depth is of central importance in the resurgence of have been widely useful for many computer vision problems neural networks as a powerful machine learning paradigm (He et al., 2015). (Schmidhuber, 2015). Theoretical evidence indicates that arXiv:1607.03474v5 [cs.LG] 4 Jul 2017 deeper networks can be exponentially more efficient at rep- In this paper we first provide a new mathematical analysis of resenting certain function classes (see e.g. Bengio & Le- RNNs which offers a deeper understanding of the strengths Cun (2007); Bianchini & Scarselli (2014) and references of the LSTM cell. Based on these insights, we introduce therein). Due to their sequential nature, Recurrent Neu- LSTM networks that have long credit assignment paths not ral Networks (RNNs; Robinson & Fallside, 1987; Werbos, just in time but also in space (per time step), called Recurrent 1988; Williams, 1989) have long credit assignment paths Highway Networks or RHNs. Unlike previous work on deep and so are deep in time. However, certain internal function RNNs, this model incorporates Highway layers inside the mappings in modern RNNs composed of units grouped in recurrent transition, which we argue is a superior method of increasing depth. This enables the use of substantially * 1 2 Equal contribution ETH Zürich, Switzerland The Swiss AI more powerful and trainable sequential models efficiently, Lab IDSIA (USI-SUPSI) & NNAISENSE, Switzerland. Corre- spondence to: Julian Zilly <[email protected]>, Rupesh Srivastava significantly outperforming existing architectures on widely <[email protected]>. used benchmarks. Recurrent Highway Networks 2. Related Work on Deep Recurrent Transitions In recent years, a common method of utilizing the computa- tional advantages of depth in recurrent networks is stacking recurrent layers (Schmidhuber, 1992), which is analogous to using multiple hidden layers in feedforward networks. Training stacked RNNs naturally requires credit assignment across both space and time which is difficult in practice. These problems have been recently addressed by architec- (a) tures utilizing LSTM-based transformations for stacking (Zhang et al., 2016; Kalchbrenner et al., 2015). A general method to increase the depth of the step-to-step recurrent state transition (the recurrence depth) is to let an RNN tick for several micro time steps per step of the se- quence (Schmidhuber, 1991; Srivastava et al., 2013; Graves, 2016). This method can adapt the recurrence depth to the (b) problem, but the RNN has to learn by itself which parame- ters to use for memories of previous events and which for Figure 1: Comparison of (a) stacked RNN with depth d standard deep nonlinear processing. It is notable that while and (b) Deep Transition RNN of recurrence depth d, both Graves (2016) reported improvements on simple algorith- operating on a sequence of T time steps. The longest credit mic tasks using this method, no performance improvements assignment path between hidden states T time steps is d×T were obtained on real world data. for Deep Transition RNNs. Pascanu et al. (2013) proposed to increase the recurrence depth by adding multiple non-linear layers to the recurrent 3. Revisiting Gradient Flow in Recurrent transition, resulting in Deep Transition RNNs (DT-RNNs) Networks and Deep Transition RNNs with Skip connections (DT(S)- Let L denote the total loss for an input sequence of length RNNs). While being powerful in principle, these architec- [t] m [t] n tures are seldom used due to exacerbated gradient propaga- T . Let x 2 R and y 2 R represent the output of a standard RNN at time t, W 2 Rn×m and R 2 Rn×n the tion issues resulting from extremely long credit assignment n paths1. In related work Chung et al. (2015) added extra input and recurrent weight matrices, b 2 R a bias vector and f a point-wise non-linearity. Then y[t] = f(Wx[t] + connections between all states across consecutive time steps [t−1] in a stacked RNN, which also increases recurrence depth. Ry + b) describes the dynamics of a standard RNN. However, their model requires many extra connections with The derivative of the loss L with respect to parameters θ of increasing depth, gives only a fraction of states access to a network can be expanded using the chain rule: the largest depth, and still faces gradient propagation issues along the longest paths. dL X dL[t2] X X @L[t2] @y[t2] @y[t1] = = : dθ dθ @y[t2] @y[t1] @θ Compared to stacking recurrent layers, increasing the recur- 1≤t2≤T 1≤t2≤T 1≤t1≤t2 rence depth can add significantly higher modeling power (1) to an RNN. Figure 1 illustrates that stacking d RNN layers [t ] The Jacobian matrix @y 2 , the key factor for the transport allows a maximum credit assignment path length (number @y[t1] of non-linear transformations) of d + T − 1 between hidden of the error from time step t2 to time step t1, is obtained by states which are T time steps apart, while a recurrence depth chaining the derivatives across all time steps: of d enables a maximum path length of d × T . While this allows greater power and efficiency using larger depths, it [t2] [t] @y Y @y Y > 0 [t−1] also explains why such architectures are much more difficult := = R diag f (Ry ) ; @y[t1] @y[t−1] to train compared to stacked RNNs. In the next sections, t1<t≤t2 t1<t≤t2 we address this problem head on by focusing on the key (2) mechanisms of the LSTM and using those to design RHNs, where the input and bias have been omitted for simplicity. We can now obtain conditions for the gradients to vanish which do not suffer from the above difficulties. [t] or explode. Let A := @y be the temporal Jacobian, 1Training of our proposed architecture is compared to these @y[t−1] 0 [t−1] models in subsection 5.1. γ be a maximal bound on f (Ry ) and σmax be the largest singular value of R>. Then the norm of the Jacobian Recurrent Highway Networks centered around the diagonal values aii of A with radius Pn j=1;j6=i jaijj equal to the sum of the absolute values of the non-diagonal entries in each row of A. Two example Geršgorin circles referring to differently initialized RNNs are depicted in Figure 2. Using GCT we can understand the relationship between the entries of R and the possible locations of the eigenvalues of the Jacobian. Shifting the diagonal values aii shifts the possible locations of eigenvalues. Having large off-diagonal entries will allow for a large spread of eigenvalues. Small off-diagonal entries yield smaller radii and thus a more confined distribution of eigenvalues around the diagonal entries aii. Figure 2: Illustration of the Geršgorin circle theorem. Two Geršgorin circles are centered around their diagonal entries Let us assume that matrix R is initialized with a zero-mean aii. The corresponding eigenvalues lie within the radius Gaussian distribution. We can then infer the following: of the sum of absolute values of non-diagonal entries aij. Circle (1) represents an exemplar Geršgorin circle for an • If the values of R are initialized with a standard devia- RNN initialized with small random values.