Recurrent Highway Networks

Total Page:16

File Type:pdf, Size:1020Kb

Recurrent Highway Networks Recurrent Highway Networks Julian Georg Zilly * 1 Rupesh Kumar Srivastava * 2 Jan Koutník 2 Jürgen Schmidhuber 2 Abstract layers usually do not take advantage of depth (Pascanu et al., 2013). For example, the state update from one time step to Many sequential processing tasks require com- the next is typically modeled using a single trainable linear plex nonlinear transition functions from one step transformation followed by a non-linearity. to the next. However, recurrent neural networks with "deep" transition functions remain difficult Unfortunately, increased depth represents a challenge when to train, even when using Long Short-Term Mem- neural network parameters are optimized by means of error ory (LSTM) networks. We introduce a novel the- backpropagation (Linnainmaa, 1970; 1976; Werbos, 1982). oretical analysis of recurrent networks based on Deep networks suffer from what are commonly referred to Geršgorin’s circle theorem that illuminates several as the vanishing and exploding gradient problems (Hochre- modeling and optimization issues and improves iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001), since our understanding of the LSTM cell. Based on the magnitude of the gradients may shrink or explode expo- this analysis we propose Recurrent Highway Net- nentially during backpropagation. These training difficulties works, which extend the LSTM architecture to al- were first studied in the context of standard RNNs where low step-to-step transition depths larger than one. the depth through time is proportional to the length of input Several language modeling experiments demon- sequence, which may have arbitrary size. The widely used strate that the proposed architecture results in pow- Long Short-Term Memory (LSTM; Hochreiter & Schmid- erful and efficient models. On the Penn Treebank huber, 1997; Gers et al., 2000) architecture was introduced corpus, solely increasing the transition depth from to specifically address the problem of vanishing/exploding 1 to 10 improves word-level perplexity from 90.6 gradients for recurrent networks. to 65.4 using the same number of parameters. On The vanishing gradient problem also becomes a limitation the larger Wikipedia datasets for character predic- when training very deep feedforward networks. Highway tion (text8 and enwik8), RHNs outperform Layers (Srivastava et al., 2015b) based on the LSTM cell all previous results and achieve an entropy of 1.27 addressed this limitation enabling the training of networks bits per character. even with hundreds of stacked layers. Used as feedforward connections, these layers were used to improve performance in many domains such as speech recognition (Zhang et al., 1. Introduction 2016) and language modeling (Kim et al., 2015; Jozefowicz et al., 2016), and their variants called Residual networks Network depth is of central importance in the resurgence of have been widely useful for many computer vision problems neural networks as a powerful machine learning paradigm (He et al., 2015). (Schmidhuber, 2015). Theoretical evidence indicates that arXiv:1607.03474v5 [cs.LG] 4 Jul 2017 deeper networks can be exponentially more efficient at rep- In this paper we first provide a new mathematical analysis of resenting certain function classes (see e.g. Bengio & Le- RNNs which offers a deeper understanding of the strengths Cun (2007); Bianchini & Scarselli (2014) and references of the LSTM cell. Based on these insights, we introduce therein). Due to their sequential nature, Recurrent Neu- LSTM networks that have long credit assignment paths not ral Networks (RNNs; Robinson & Fallside, 1987; Werbos, just in time but also in space (per time step), called Recurrent 1988; Williams, 1989) have long credit assignment paths Highway Networks or RHNs. Unlike previous work on deep and so are deep in time. However, certain internal function RNNs, this model incorporates Highway layers inside the mappings in modern RNNs composed of units grouped in recurrent transition, which we argue is a superior method of increasing depth. This enables the use of substantially * 1 2 Equal contribution ETH Zürich, Switzerland The Swiss AI more powerful and trainable sequential models efficiently, Lab IDSIA (USI-SUPSI) & NNAISENSE, Switzerland. Corre- spondence to: Julian Zilly <[email protected]>, Rupesh Srivastava significantly outperforming existing architectures on widely <[email protected]>. used benchmarks. Recurrent Highway Networks 2. Related Work on Deep Recurrent Transitions In recent years, a common method of utilizing the computa- tional advantages of depth in recurrent networks is stacking recurrent layers (Schmidhuber, 1992), which is analogous to using multiple hidden layers in feedforward networks. Training stacked RNNs naturally requires credit assignment across both space and time which is difficult in practice. These problems have been recently addressed by architec- (a) tures utilizing LSTM-based transformations for stacking (Zhang et al., 2016; Kalchbrenner et al., 2015). A general method to increase the depth of the step-to-step recurrent state transition (the recurrence depth) is to let an RNN tick for several micro time steps per step of the se- quence (Schmidhuber, 1991; Srivastava et al., 2013; Graves, 2016). This method can adapt the recurrence depth to the (b) problem, but the RNN has to learn by itself which parame- ters to use for memories of previous events and which for Figure 1: Comparison of (a) stacked RNN with depth d standard deep nonlinear processing. It is notable that while and (b) Deep Transition RNN of recurrence depth d, both Graves (2016) reported improvements on simple algorith- operating on a sequence of T time steps. The longest credit mic tasks using this method, no performance improvements assignment path between hidden states T time steps is d×T were obtained on real world data. for Deep Transition RNNs. Pascanu et al. (2013) proposed to increase the recurrence depth by adding multiple non-linear layers to the recurrent 3. Revisiting Gradient Flow in Recurrent transition, resulting in Deep Transition RNNs (DT-RNNs) Networks and Deep Transition RNNs with Skip connections (DT(S)- Let L denote the total loss for an input sequence of length RNNs). While being powerful in principle, these architec- [t] m [t] n tures are seldom used due to exacerbated gradient propaga- T . Let x 2 R and y 2 R represent the output of a standard RNN at time t, W 2 Rn×m and R 2 Rn×n the tion issues resulting from extremely long credit assignment n paths1. In related work Chung et al. (2015) added extra input and recurrent weight matrices, b 2 R a bias vector and f a point-wise non-linearity. Then y[t] = f(Wx[t] + connections between all states across consecutive time steps [t−1] in a stacked RNN, which also increases recurrence depth. Ry + b) describes the dynamics of a standard RNN. However, their model requires many extra connections with The derivative of the loss L with respect to parameters θ of increasing depth, gives only a fraction of states access to a network can be expanded using the chain rule: the largest depth, and still faces gradient propagation issues along the longest paths. dL X dL[t2] X X @L[t2] @y[t2] @y[t1] = = : dθ dθ @y[t2] @y[t1] @θ Compared to stacking recurrent layers, increasing the recur- 1≤t2≤T 1≤t2≤T 1≤t1≤t2 rence depth can add significantly higher modeling power (1) to an RNN. Figure 1 illustrates that stacking d RNN layers [t ] The Jacobian matrix @y 2 , the key factor for the transport allows a maximum credit assignment path length (number @y[t1] of non-linear transformations) of d + T − 1 between hidden of the error from time step t2 to time step t1, is obtained by states which are T time steps apart, while a recurrence depth chaining the derivatives across all time steps: of d enables a maximum path length of d × T . While this allows greater power and efficiency using larger depths, it [t2] [t] @y Y @y Y > 0 [t−1] also explains why such architectures are much more difficult := = R diag f (Ry ) ; @y[t1] @y[t−1] to train compared to stacked RNNs. In the next sections, t1<t≤t2 t1<t≤t2 we address this problem head on by focusing on the key (2) mechanisms of the LSTM and using those to design RHNs, where the input and bias have been omitted for simplicity. We can now obtain conditions for the gradients to vanish which do not suffer from the above difficulties. [t] or explode. Let A := @y be the temporal Jacobian, 1Training of our proposed architecture is compared to these @y[t−1] 0 [t−1] models in subsection 5.1. γ be a maximal bound on f (Ry ) and σmax be the largest singular value of R>. Then the norm of the Jacobian Recurrent Highway Networks centered around the diagonal values aii of A with radius Pn j=1;j6=i jaijj equal to the sum of the absolute values of the non-diagonal entries in each row of A. Two example Geršgorin circles referring to differently initialized RNNs are depicted in Figure 2. Using GCT we can understand the relationship between the entries of R and the possible locations of the eigenvalues of the Jacobian. Shifting the diagonal values aii shifts the possible locations of eigenvalues. Having large off-diagonal entries will allow for a large spread of eigenvalues. Small off-diagonal entries yield smaller radii and thus a more confined distribution of eigenvalues around the diagonal entries aii. Figure 2: Illustration of the Geršgorin circle theorem. Two Geršgorin circles are centered around their diagonal entries Let us assume that matrix R is initialized with a zero-mean aii. The corresponding eigenvalues lie within the radius Gaussian distribution. We can then infer the following: of the sum of absolute values of non-diagonal entries aij. Circle (1) represents an exemplar Geršgorin circle for an • If the values of R are initialized with a standard devia- RNN initialized with small random values.
Recommended publications
  • A Simple RNN-Plus-Highway Network for Statistical Parametric Speech Synthesis
    ISSN 1346-5597 NII Technical Report A simple RNN-plus-highway network for statistical parametric speech synthesis Xin Wang, Shinji Takaki, Junichi Yamagishi NII-2017-003E Apr. 2017 A simple RNN-plus-highway network for statistical parametric speech synthesis Xin Wang, Shinji Takaki, Junichi Yamagishi Abstract In this report, we proposes a neural network structure that combines a recurrent neural network (RNN) and a deep highway network. Com- pared with the highway RNN structures proposed in other studies, the one proposed in this study is simpler since it only concatenates a highway network after a pre-trained RNN. The main idea is to use the `iterative unrolled estimation' of a highway network to finely change the output from the RNN. The experiments on the proposed network structure with a baseline RNN and 7 highway blocks demonstrated that this network per- formed relatively better than a deep RNN network with a similar mode size. Furthermore, it took less than half the training time of the deep RNN. 1 Introduction Statistical parametric speech synthesis is widely used in text-to-speech (TTS) synthesis. We assume a TTS system using a pipeline structure in which a front- end derives the information on the pronunciation and prosody of the input text, then the back-end generates the acoustic features based on the output of the front-end module. While various approaches can be used for the back-end acoustic modeling, we focus on the neural-network (NN)-based acoustic models. Suppose the output from the front-end is a sequence of textual feature vectors x1:T = [x1; ··· ; xT ] in T frames, where each xt may include the phoneme identity, pitch accent, and other binary or continuous-valued linguistic features.
    [Show full text]
  • Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition
    INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim1, Mostafa El-Khamy1, Jungwon Lee1 1Samsung Semiconductor, Inc. 4921 Directors Place, San Diego, CA, USA [email protected], [email protected], [email protected] Abstract train more than 100 convolutional layers for image classifica- tion and detection. The key insight in the residual network is to In this paper, a novel architecture for a deep recurrent neural provide a shortcut path between layers that can be used for an network, residual LSTM is introduced. A plain LSTM has an additional gradient path. Highway network [18] is an another internal memory cell that can learn long term dependencies of way of implementing a shortcut path in a feed-forward neural sequential data. It also provides a temporal shortcut path to network. [18] presented successful MNIST training results with avoid vanishing or exploding gradients in the temporal domain. 100 layers. The residual LSTM provides an additional spatial shortcut path Shortcut paths have also been investigated for RNN and from lower layers for efficient training of deep networks with LSTM networks. The maximum entropy RNN (ME-RNN) multiple LSTM layers. Compared with the previous work, high- model [19] has direct connections between the input and out- way LSTM, residual LSTM separates a spatial shortcut path put layers of an RNN layer. Although limited to RNN net- with temporal one by using output layers, which can help to works with a single hidden layer, the perplexity improved by avoid a conflict between spatial and temporal-domain gradient training the direct connections as part of the whole network.
    [Show full text]
  • Dense Recurrent Neural Network with Attention Gate
    Under review as a conference paper at ICLR 2018 DENSE RECURRENT NEURAL NETWORK WITH ATTENTION GATE Anonymous authors Paper under double-blind review ABSTRACT We propose the dense RNN, which has fully connections from each hidden state directly to multiple preceding hidden states of all layers. As the density of the connection increases, the number of paths through which the gradient flows is increased. The magnitude of gradients is also increased, which helps to prevent the vanishing gradient problem in time. Larger gradients, however, may cause exploding gradient problem. To complement the trade-off between two problems, we propose an attention gate, which controls the amounts of gradient flows. We describe the relation between the attention gate and the gradient flows by approx- imation. The experiment on the language modeling using Penn Treebank corpus shows dense connections with the attention gate improve the model’s performance of the RNN. 1 INTRODUCTION In order to analyze sequential data, it is important to choose an appropriate model to represent the data. Recurrent neural network (RNN), as one of the model capturing sequential data, has been applied to many problems such as natural language (Mikolov et al., 2013), machine translation (Bahdanau et al., 2014), speech recognition (Graves et al., 2013). There are two main research issues to improve the RNNs performance: 1) vanishing and exploding gradient problems and 2) regularization. The vanishing and exploding gradient problems occur as the sequential data has long-term depen- dency (Hochreiter, 1998; Pascanu et al., 2013). One of the solutions is to add gate functions such as the long short-term memory (LSTM) and gated recurrent unit (GRU).
    [Show full text]
  • LSTM, GRU, Highway and a Bit of Attention: an Empirical Overview for Language Modeling in Speech Recognition
    INTERSPEECH 2016 September 8–12, 2016, San Francisco, USA LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition Kazuki Irie, Zoltan´ Tuske,¨ Tamer Alkhouli, Ralf Schluter,¨ Hermann Ney Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany firie,tuske,alkhouli,schlueter,[email protected] Abstract The success of the LSTM has opened room for creativity in designing neural networks with multiplicative gates. While ear- Popularized by the long short-term memory (LSTM), multi- lier works [12] motivated multiplications to make higher-order plicative gates have become a standard means to design arti- neural networks, recent works use them as a means to control ficial neural networks with intentionally organized information the information flow inside the network. As a result, many con- flow. Notable examples of such architectures include gated re- cepts have been emerged. The gated recurrent unit [13] was current units (GRU) and highway networks. In this work, we proposed as a simpler alternative to the LSTM. The highway first focus on the evaluation of each of the classical gated ar- network [14, 15] has gates to ensure the unobstructed informa- chitectures for language modeling for large vocabulary speech tion flow across the depth of the network. Also, more elemen- recognition. Namely, we evaluate the highway network, lateral tary gating can be found in tensor networks [16], also known as network, LSTM and GRU. Furthermore, the motivation under- the lateral network [17]. lying the highway network also applies to LSTM and GRU. An The objective of this paper is to evaluate the effectiveness of extension specific to the LSTM has been recently proposed with these concepts for language modeling with application to large an additional highway connection between the memory cells of vocabulary speech recognition.
    [Show full text]
  • Long Short-Term Memory Recurrent Neural Network for Short-Term Freeway Traffic Speed Prediction
    LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK FOR SHORT-TERM FREEWAY TRAFFIC SPEED PREDICTION Junxiong Huang A thesis submitted in partial fulfillment of the requirements for the degree of: Master of Science From the University of Wisconsin – Madison Graduate School Department of Civil and Environmental Engineering April 2018 ii EXECUTIVE SUMMARY Traffic congestions on Riverside Freeway, California, SR 91 usually happen on weekdays from 0:00 to 23:55 of January 3, 2011 to March 16, 2012. Traffic congestions can be detected from average mainline speed, the free flow speed fluctuates around 60 to 70 mph, the congestion speed fluctuate around 30 to 40 mph. This research aims at predicting average mainline speed of the segment which is downstream to the merging road of a mainline and a ramp. The first objective is to determine the independent traffic features which can be used into the experiments, the second objective is to do data processing in order to form the data instances for the experiments, the third objective is to determine the predicted average speed with different time lags and to do the prediction, the fourth objective is to evaluate the performance of predicted average mainline speed. This research data source is the Caltrans Performance Measurement System (PeMS), which is collected in real-time from over 39000 individual detectors. These sensors span the freeway system across all major metropolitan areas of the State of California. PeMS is also an Archived Data User Service (ADUS) that provides over ten years of data for historical analysis. It integrates a wide variety of information from Caltrans and other local agency systems includes traffic detectors, census traffic counts, incidents, vehicle classification, lane closures, weight-in-motion, toll tags and roadway inventory.
    [Show full text]
  • UNIVERSITY of CALIFORNIA Los Angeles Neural Network Based
    UNIVERSITY OF CALIFORNIA Los Angeles Neural network based representation learning and modeling for speech and speaker recognition A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical and Computer Engineering by Jinxi Guo 2019 c Copyright by Jinxi Guo 2019 ABSTRACT OF THE DISSERTATION Neural network based representation learning and modeling for speech and speaker recognition by Jinxi Guo Doctor of Philosophy in Electrical and Computer Engineering University of California, Los Angeles, 2019 Professor Abeer A. H. Alwan, Chair Deep learning and neural network research has grown significantly in the fields of automatic speech recognition (ASR) and speaker recognition. Compared to traditional methods, deep learning-based approaches are more powerful in learning representation from data and build- ing complex models. In this dissertation, we focus on representation learning and modeling using neural network-based approaches for speech and speaker recognition. In the first part of the dissertation, we present two novel neural network-based methods to learn speaker-specific and phoneme-invariant features for short-utterance speaker verifi- cation. We first propose to learn a spectral feature mapping from each speech signal to the corresponding subglottal acoustic signal which has less phoneme variation, using deep neural networks (DNNs). The estimated subglottal features show better speaker-separation abil- ity and provide complementary information when combined with traditional speech features on speaker verification tasks. Additional, we propose another DNN-based mapping model, which maps the speaker representation extracted from short utterances to the speaker rep- resentation extracted from long utterances of the same speaker. Two non-linear regression models using an autoencoder are proposed to learn this mapping, and they both improve speaker verification performance significantly.
    [Show full text]
  • Recurrent Highway Networks
    Recurrent Highway Networks Julian Georg Zilly * 1 Rupesh Kumar Srivastava * 2 Jan Koutník 2 Jürgen Schmidhuber 2 Abstract mappings in modern RNNs composed of units grouped in Many sequential processing tasks require com- layers usually do not take advantage of depth (Pascanu et al., plex nonlinear transition functions from one step 2013). For example, the state update from one time step to to the next. However, recurrent neural networks the next is typically modeled using a single trainable linear with "deep" transition functions remain difficult transformation followed by a non-linearity. to train, even when using Long Short-Term Mem- Unfortunately, increased depth represents a challenge when ory (LSTM) networks. We introduce a novel the- neural network parameters are optimized by means of error oretical analysis of recurrent networks based on backpropagation (Linnainmaa, 1970; 1976; Werbos, 1982). Geršgorin’s circle theorem that illuminates several Deep networks suffer from what are commonly referred to modeling and optimization issues and improves as the vanishing and exploding gradient problems (Hochre- our understanding of the LSTM cell. Based on iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001), since this analysis we propose Recurrent Highway Net- the magnitude of the gradients may shrink or explode expo- works, which extend the LSTM architecture to al- nentially during backpropagation. These training difficulties low step-to-step transition depths larger than one. were first studied in the context of standard RNNs where Several language modeling experiments demon- the depth through time is proportional to the length of input strate that the proposed architecture results in pow- sequence, which may have arbitrary size.
    [Show full text]
  • A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures
    REVIEW Communicated by Terrence Sejnowski A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures Yong Yu [email protected] Department of Automation, Xi’an Institute of High-Technology, Xi’an 710025, China, and Institute No. 25, Second Academy of China, Aerospace Science and Industry Corporation, Beijing 100854, China Xiaosheng Si [email protected] Changhua Hu [email protected] Jianxun Zhang [email protected] Department of Automation, Xi’an Institute of High-Technology, Xi’an 710025, China Recurrent neural networks (RNNs) have been widely adopted in research areas concerned with sequential data, such as text, audio, and video. How- ever, RNNs consisting of sigma cells or tanh cells are unable to learn the relevant information of input data when the input gap is large. By intro- ducing gate functions into the cell structure, the long short-term memory (LSTM) could handle the problem of long-term dependencies well. Since its introduction, almost all the exciting results based on RNNs have been achieved by the LSTM. The LSTM has become the focus of deep learning. We review the LSTM cell and its variants to explore the learning capac- ity of the LSTM cell. Furthermore, the LSTM networks are divided into two broad categories: LSTM-dominated networks and integrated LSTM networks. In addition, their various applications are discussed. Finally, future research directions are presented for LSTM networks. 1 Introduction Over the past few years, deep learning techniques have been well devel- oped and widely adopted to extract information from various kinds of data (Ivakhnenko & Lapa, 1965; Ivakhnenko, 1971; Bengio, 2009; Carrio, Sampe- dro, Rodriguez-Ramos, & Campoy, 2017; Khan & Yairi, 2018).
    [Show full text]