Learning Recurrent Neural Networks with Hessian-Free Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Learning Recurrent Neural Networks with Hessian-Free Optimization James Martens [email protected] Ilya Sutskever [email protected] University of Toronto, Canada Abstract In this work we resolve the long-outstanding problem of how to effectively train recurrent neu- ral networks (RNNs) on complex and difficult sequence modeling problems which may con- tain long-term data dependencies. Utilizing re- cent advances in the Hessian-free optimization Figure 1. The architecture of a recurrent neural network. approach (Martens, 2010), together with a novel Gradient-based training of RNNs might appear straight- damping scheme, we successfully train RNNs on forward because, unlike many rich probabilistic sequence two sets of challenging problems. First, a col- models (Murphy, 2002), the exact gradients can be cheaply lection of pathological synthetic datasets which computed by the Backpropagation Through Time (BPTT) are known to be impossible for standard op- algorithm (Rumelhart et al., 1986). Unfortunately, gradi- timization approaches (due to their extremely ent descent and other 1st-order methods completely fail to long-term dependencies), and second, on three properly train RNNs on large families of seemingly sim- natural and highly complex real-world sequence ple yet pathological synthetic problems that separate a tar- datasets where we find that our method sig- get output and from its relevant input by many time steps nificantly outperforms the previous state-of-the- (Bengio et al., 1994; Hochreiter and Schmidhuber, 1997). art method for training neural sequence mod- In fact, 1st-order approaches struggle even when the sep- els: the Long Short-term Memory approach of aration is only 10 timesteps (Bengio et al., 1994). An un- Hochreiter and Schmidhuber (1997). Addition- fortunate consequence of these failures is that these highly ally, we offer a new interpretation of the gen- expressive and potentially very powerful time-series mod- eralized Gauss-Newton matrix of Schraudolph els are seldom used in practice. (2002) which is used within the HF approach of Martens. The extreme difficulty associated with training RNNs is likely due to the highly volatile relationship between the parameters and the hidden states. One way that this volatil- ity manifests itself, which has a direct impact on the per- 1. Introduction formance of gradient-descent, is in the so-called “vanish- A Recurrent Neural Network (RNN) is a neural network ing/exploding gradients” phenomenon (Bengio et al., 1994; that operates in time. At each timestep, it accepts an in- Hochreiter, 1991), where the error-signals exhibit expo- put vector, updates its (possibly high-dimensional) hid- nential decay/growth as they are back-propagated through den state via non-linear activation functions, and uses it time. In the case of decay, this leads to the long-term error to make a prediction of its output. RNNs form a rich signals being effectively lost as they are overwhelmed by model class because their hidden state can store informa- un-decayed short-term signals, and in the case of exponen- tion as high-dimensional distributed representations (as op- tial growth there is the opposite problem that the short-term posed to a Hidden Markov Model, whose hidden state is es- error signals are overwhelmed by the long-term ones. sentially log n-dimensional) and their nonlinear dynamics During the 90’s there was intensive research by the ma- can implement rich and powerful computations, allowing chine learning community into identifying the source of the RNN to perform modeling and prediction tasks for se- difficultly in training RNNs as well as proposing meth- quences with highly complex structure. ods to address it. However, none of these methods be- came widely adopted, and an analysis by Hochreiter and Appearing in Proceedings of the 28 th International Conference Schmidhuber (1996) showed that they were often no bet- on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 ter than random guessing. In an attempt to sidestep the by the author(s)/owner(s). difficulty of training RNNs on problems exhibiting long- Learning Recurrent Neural Networks with Hessian-Free Optimization term dependencies, Hochreiter and Schmidhuber (1997) Other contributions we make include the development of proposed a modified architecture called the Long Short- the aforementioned structural damping scheme which, as term Memory (LSTM) and successfully applied it to speech we demonstrate through experiments, can significantly im- and handwritten text recognition (Graves and Schmidhu- prove robustness of the HF optimizer in the setting of RNN ber, 2009; 2005) and robotic control (Mayer et al., 2007). training, and a new interpretation of the generalized Gauss- The LSTM consists of a standard RNN augmented with Newton matrix of Schraudolph (2002) which forms a key “memory-units” which specialize in transmitting long-term component of the HF approach of Martens. information, along with a set of “gating” units which al- low the memory units to selectively interact with the usual 2. Recurrent Neural Networks RNN hidden state. Because the memory units are forced to have fixed linear dynamics with a self-connection of We now formally define the standard RNN (Rumelhart value 1, the error signal neither decays nor explodes as it et al., 1986) which forms the focus of this work. Given n is backpropagated through them. However, it is not clear a sequence of inputs x1; x2; : : : ; xT , each in R , the net- if the approach of training with gradient descent and en- work computes a sequence of hidden states h1; h2; : : : ; hT , m abling long-term memorization with the specialized LSTM each in R , and a sequence of predictions y^1; y^2;:::; y^T , architecture is harnessing the true power of recurrent neu- each in Rk, by iterating the equations ral computation. In particular, gradient descent may be de- ti = Whxxi + Whhhi−1 + bh (1) ficient for RNN learning in ways that are not compensated for by using LSTMs. Another recent attempt to resolve hi = e(ti) (2) the problem of RNN training which has received atten- si = Wyhhi + by (3) tion is the Echo-State-Network (ESN) of Jaeger and Haas y^i = g(si) (4) (2004), which gives up on learning the problematic hidden- where Wyh;Whx;Whh are the weight matrices and to-hidden weights altogether in favor of using fixed sparse k bh; by are the biases, t1; t2; : : : ; tT , each in R , and connections which are generated randomly. However, since k this approach cannot learn new non-linear dynamics, in- s1; s2; : : : ; sT , each in R , are the inputs to the hidden and stead relying on those present in the random “reservoir” output units (resp.) and e and g are pre-defined vector val- which is effectively created by the random initialization, ued functions which are typically non-linear and applied its power is clearly limited. coordinate-wise (although the only formal requirement is that they have a well-defined Jacobian). The RNN also has init m Recently, a special variant of the Hessian-Free (HF) opti- a special initial bias bh 2 R which replaces the formally mization approach (aka truncated-Newton or Newton-CG) undefined expression Whhh0 at time i = 1. For a sub- was successfully applied to learning deep multilayered neu- scripted variable zi, the variable without a subscript, z, will ral networks from random initializations (Martens, 2010), refer to all of the zi’s from i = 1 to T concatenated into a an optimization problem for which gradient descent and large vector, and similarly will use θ to refer to all of the even quasi-Newton methods like L-BFGS have never been parameters as one large vector. demonstrated to be effective. Martens examines the prob- lem of learning deep auto-encoders and is able to obtain the The objective function associated with RNNs for a single training pair (x; y) is defined as f(θ) = L(^y; y), where best-known results for these problems, significantly sur- 1 passing the benchmark set by the seminal deep learning L is a distance function which measures the deviation of the predictions y^ from the target outputs y. Examples of L work of Hinton and Salakhutdinov (2006). P 2 include the squared error ky^i − yik =2 and the cross- P P i Inspired by the success of the HF approach on deep neu- entropy error − i j yij log(^yij)+(1−yij) log(1−y^ij). ral networks, in this paper we revisit and resolve the long- The overall objective function for the whole training set is standing open problem of RNN training. In particular, our simply given by the average of the individual objectives results show that the HF optimization approach of Martens, associated with each training example. augmented with a novel “structural-damping” which we develop, can effectively train RNNs on the aforementioned pathological long-term dependency problems (adapted di- 3. Hessian-Free Optimization rectly from Hochreiter and Schmidhuber (1997)), thus Hessian-Free optimization is concerned with the minimiza- overcoming the main objection made against using RNNs. tion of an objective f : RN ! R, with respect to N- From there we go on to address the question of whether dimensional input vector θ. Its operation is viewed as the these advances generalize to real-world sequence model- iterative minimization of simpler sub-objectives based on ing problems by considering both a high-dimensional mo- local approximations to f(θ). Specifically, given a param- tion video prediction task, a MIDI-music modeling task, eter setting θn, the next one, θn+1, is found by partially and a speech modelling task. We find that RNNs trained optimizing the sub-objective using our method are highly effective on these tasks and q (θ) ≡ M (θ) + λR (θ); (5) significantly outperform similarly sized LSTMs.