Unitary Evolution Recurrent Neural Networks

Unitary Evolution Recurrent Neural Networks Martin Arjovsky ∗ [email protected] Amar Shah ∗ [email protected] Yoshua Bengio Universidad de Buenos Aires, University of Cambridge, Universite´ de Montreal.´ Yoshua Bengio is a CIFAR Senior Fellow. ∗Indicates first authors. Ordering determined by coin flip. Abstract iter(1991) and shown by Bengio et al.(1994) to be nec- Recurrent neural networks (RNNs) are notori- essarily arising when trying to learn to reliably store bits ously difficult to train. When the eigenvalues of information in any parametrized dynamical system. If of the hidden to hidden weight matrix deviate gradients propagated back through a network vanish, the from absolute value 1, optimization becomes dif- credit assignment role of backpropagation is lost, as infor- ficult due to the well studied issue of vanish- mation about small changes in states in the far past has no ing and exploding gradients, especially when try- influence on future states. If gradients explode, gradient- ing to learn long-term dependencies. To circum- based optimization algorithms struggle to traverse down a vent this problem, we propose a new architecture cost surface, because gradient-based optimization assumes that learns a unitary weight matrix, with eigen- small changes in parameters yield small changes in the ob- values of absolute value exactly 1. The chal- jective function. As the number of time steps considered lenge we address is that of parametrizing uni- in the sequence of states grows, the shrinking or expanding tary matrices in a way that does not require ex- effects associated with the state-to-state transformation at pensive computations (such as eigendecomposi- individual time steps can grow exponentially, yielding re- tion) after each weight update. We construct an spectively vanishing or exploding gradients. See Pascanu expressive unitary weight matrix by composing et al.(2010) for a review. several structured matrices that act as building Although the long-term dependencies problem appears blocks with parameters to be learned. Optimiza- intractable in the absolute (Bengio et al., 1994) for tion with this parameterization becomes feasible parametrized dynamical systems, several heuristics have only when considering hidden states in the com- recently been found to help reduce its effect, such as the plex domain. We demonstrate the potential of use of self-loops and gating units in the LSTM (Hochre- this architecture by achieving state of the art re- iter & Schmidhuber, 1997) and GRU (Cho et al., 2014) re- sults in several hard tasks involving very long- current architectures. Recent work also supports the idea term dependencies. of using orthogonal weight matrices to assist optimization (Saxe et al., 2014; Le et al., 2015). arXiv:1511.06464v4 [cs.LG] 25 May 2016 1. Introduction In this paper, we explore the use of orthogonal and unitary matrices in recurrent neural networks. We start in Section2 Deep Neural Networks have shown remarkably good per- by showing a novel bound on the propagated gradients in formance on a wide range of complex data problems in- recurrent nets when the recurrent matrix is orthogonal. Sec- cluding speech recognition (Hinton et al., 2012), image tion3 discusses the difficulties of parameterizing real val- recognition (Krizhevsky et al., 2012) and natural language ued orthogonal matrices and how they can be alleviated by processing (Collobert et al., 2011). However, training very moving to the complex domain. deep models remains a difficult task. The main issue sur- We discuss a novel approach to constructing expressive rounding the training of deep networks is the vanishing unitary matrices as the composition of simple unitary ma- and exploding gradients problems introduced by Hochre- trices which require at most O(n log n) computation and Proceedings of the 33 rd International Conference on Machine O(n) memory, when the state vector has dimension n. Learning, New York, NY, USA, 2016. JMLR: W&CP volume These are unlike general matrices, which require O(n2) 48. Copyright 2016 by the author(s). computation and memory. Complex valued representations Unitary Evolution Recurrent Neural Networks have been considered for neural networks in the past, but we prove with limited success and adoption (Hirose, 2003; Zimmer- T −1 mann et al., 2011). We hope our findings will change this. @C @C Y T = Dk+1Wk @ht @hT Whilst our model uses complex valued matrices and pa- k=t rameters, all implementation and optimization is possible T −1 @C Y T with real numbers and has been done in Theano (Bergstra ≤ Dk+1Wk @hT et al., 2010). This along with other implementation details k=t are discussed in Section4, and the code used for the exper- T −1 @C Y iments is available online. The potential of the developed = kDk+1k : (3) @hT model for learning long term dependencies with relatively k=t few parameters is explored in Section5. We find that the proposed architecture generally outperforms LSTMs and 0 (j) Since Dk is diagonal, kDkk = maxj=1;:::;n jσ (zk )j, previous approaches based on orthogonal initialization. (j) with zk the j-th pre-activation of the k-th hidden layer. If the absolute value of the derivative σ0 can take some 2. Orthogonal Weights and Bounding the @C value τ > 1, then this bound is useless, since k @h k ≤ t Long-Term Gradient @C τ T −t which grows exponentially in T . We there- @hT A matrix, W, is orthogonal if W>W = WW> = I. fore cannot effectively bound @C for deep networks, re- @ht Orthogonal matrices have the property that they preserve sulting potentially in exploding gradients. norm (i.e. kWhk = khk ) and hence repeated iterative 2 2 In the case jσ0j < τ < 1, equation3 proves that that @C multiplication of a vector by an orthogonal matrix leaves @ht T the norm of the vector unchanged. tends to 0 exponentially fast as grows, resulting in guar- anteed vanishing gradients. This argument makes the rec- Let hT and ht be the hidden unit vectors for hidden layers tified linear unit (ReLU) nonlinearity an attractive choice T and t of a neural network with T hidden layers and T (Glorot et al., 2011; Nair & Hinton, 2010). Unless all the t. If C is the objective we are trying to minimize, then activations are killed at one layer, the maximum entry of the vanishing and exploding gradient problems refer to the Dk is 1, resulting in kDkk = 1 for all layers k. With decay or growth of @C as the number of layers, T , grows. @ht ReLU nonlinearities, we thus have Let σ be a pointwise nonlinearity function, and T −1 @C @C Y @C ≤ kDk+1k = : (4) @ht @hT @hT k=t zt+1 = Wtht + Vtxt+1 ht+1 = σ(zt+1) (1) Most notably, this result holds for a network of arbitrary depth and renders engineering tricks like gradient clipping unnecessary (Pascanu et al., 2010). then by the chain rule To the best of our knowledge, this analysis is a novel con- tribution and the first time a neural network architecture has been mathematically proven to avoid exploding gradients. @C @C @h = T @ht @hT @ht T −1 T −1 3. Unitary Evolution RNNs @C Y @hk+1 @C Y T = = Dk+1Wk (2) Unitary matrices generalize orthogonal matrices to the @hT @hk @hT k=t k=t complex domain. A complex valued, norm preserving matrix, U, is called a unitary matrix and is such that U∗U = UU∗ = I, where U∗ is the conjugate transpose of U. Di- 0 where Dk+1 = diag(σ (zk+1)) is the Jacobian matrix of rectly parametrizing the set of unitary matrices in such a the pointwise nonlinearity. way that gradient-based optimization can be applied is not straightforward because a gradient step will typically yield In the following we define the norm of a matrix to refer to a matrix that is not unitary, and projecting on the set of uni- the spectral radius norm (or operator 2-norm) and the norm tary matrices (e.g., by performing an eigendecomposition) of a vector to mean L -norm. By definition of the oper- 2 generally costs O(n3) computation when U is n × n. ator norms, for any matrices A; B and vector v we have kAvk ≤ kAk kvk and kABk ≤ kAk kBk. If the weight The most important feature of unitary and orthogonal ma- matrices Wk are norm preserving (i.e. orthogonal), then trices for our purpose is that they have eigenvalues λj with Unitary Evolution Recurrent Neural Networks absolute value 1. The following lemma, proved in (Hoff- Appealingly, D, R and Π all permit O(n) storage and man & Kunze, 1971), may shed light on a method which O(n) computation for matrix vector products. F and F −1 can be used to efficiently span a large set of unitary matri- require no storage and O(n log n) matrix vector multiplica- ces. tion using the Fast Fourier Transform algorithm. A major Lemma 1. A complex square matrix W is unitary if and advantage of composing unitary matrices of the form listed only if it has an eigendecomposition of the form W = above, is that the number of parameters, memory and com- VDV∗, where ∗ denotes the conjugate transpose. Here, putational cost increase almost linearly in the size of the hidden layer. With such a weight matrix, immensely large V; D 2 Cn×n are complex matrices, where V is unitary, hidden layers are feasible to train, whilst being impossible and D is a diagonal such that jDj;jj = 1. Furthermore, W is a real orthogonal matrix if and only if for every eigen- in traditional neural networks. value Dj;j = λj with eigenvector vj, there is also a com- With this in mind, in this work we choose to consider recur- plex conjugate eigenvalue λk = λj with corresponding rent neural networks with unitary hidden to hidden weight eigenvector vk = vj .

Load more