Orthogonal Recurrent Neural Networks with Scaled Cayley Transform

Orthogonal Recurrent Neural Networks with Scaled Cayley Transform Kyle E. Helfrich * 1 Devin Willmott * 1 Qiang Ye 1 Abstract oversteps the local minimum. This issue significantly di- minishes RNNs’ ability to learn time-based dependencies, Recurrent Neural Networks (RNNs) are designed particularly in problems with long input sequences. to handle sequential data but suffer from vanishing or exploding gradients. Recent work on Uni- A variety of architectures have been introduced to overcome tary Recurrent Neural Networks (uRNNs) have this difficulty. The current preferred RNN architectures are been used to address this issue and in some cases, those that introduce gating mechanisms to control when in- exceed the capabilities of Long Short-Term Mem- formation is retained or discarded, such as LSTMs (Hochre- ory networks (LSTMs). We propose a simpler iter & Schmidhuber, 1997) and GRUs (Cho et al., 2014), and novel update scheme to maintain orthogonal at the cost of additional trainable parameters. More re- recurrent weight matrices without using complex cently, the unitary evolution RNN (uRNN) (Arjovsky et al., valued matrices. This is done by parametrizing 2016) uses a parametrization that forces the recurrent weight with a skew-symmetric matrix using the Cayley matrix to remain unitary throughout training, and exhibits transform; such a parametrization is unable to rep- superior performance to LSTMs on a variety of testing prob- resent matrices with negative one eigenvalues, but lems. For clarity, we follow the convention of Wisdom et al. this limitation is overcome by scaling the recur- (2016) and refer to this network as the restricted-capacity rent weight matrix by a diagonal matrix consisting uRNN. of ones and negative ones. The proposed training Since the introduction of uRNNs, orthogonal and unitary scheme involves a straightforward gradient calcu- RNN schemes have increased in both popularity and com- lation and update step. In several experiments, the plexity. Wisdom et al.(2016) use a multiplicative update proposed scaled Cayley orthogonal recurrent neu- method detailed in Tagare(2011) and Wen & Yin(2013) ral network (scoRNN) achieves superior results to expand uRNNs’ capacity to include all unitary matrices. with fewer trainable parameters than other unitary These networks are referred to as full-capacity uRNNs. Jing RNNs. et al.(2016) and Mhammedi et al.(2017) parametrize the space of unitary/orthogonal matrices with Givens rotations and Householder reflections, respectively, but typically opti- 1. Introduction mize over a subset of this space by restricting the number of parameters. Another complex parametrization has also been Deep neural networks have been used to solve numerical explored in Hyland & Gunnar(2017). There is also work problems of varying complexity. RNNs have parameters using unitary matrices in GRUs (i.e. in GORU of Jing et al. that are reused at each time step of a sequential data point (2017)) or near-unitary matrices in RNNs by restricting the and have achieved state of the art performance on many se- singular values of the recurrent matrix to an interval around quential learning tasks. Nearly all optimization algorithms 1 (see Vorontsov et al.(2017)). For other work in addressing for neural networks involve some variant of gradient descent. the vanishing and exploding gradient problem, see Henaff One major obstacle to training RNNs with gradient descent arXiv:1707.09520v3 [stat.ML] 19 Jun 2018 et al.(2017) and Le et al.(2015). is due to vanishing or exploding gradients, as described in Bengio et al.(1993) and Pascanu et al.(2013). This prob- In this paper, we consider RNNs with a recurrent weight lem refers to the tendency of gradients to grow or decay matrix taken from the set of all orthogonal matrices. To exponentially in size, resulting in gradient descent steps that construct the orthogonal weight matrix, we parametrize are too small to be effective or so large that the network it with a skew-symmetric matrix through a scaled Cayley transform. This scaling allows us to avoid the singularity * 1 Equal contribution Department of Mathematics, Univer- issue occurring for −1 eigenvalues that may arise in the sity of Kentucky, Lexington, Kentucky, USA. Correspondence standard Cayley transform. By tuning this scaling matrix, to: Kyle Helfrich <[email protected]>, Devin Willmott <[email protected]>. the network can reach an appropriate orthogonal matrix using a relatively simple gradient descent update step. The Proceedings of the 35 th International Conference on Machine resulting method achieves superior performance on various Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 sequential data tasks. by the author(s). Orthogonal Recurrent Neural Networks with Scaled Cayley Transform The method we present in this paper works entirely with real Wisdom et al.(2016) note that this representation has only matrices, and as such, our results deal only with orthogonal 7n parameters, which is insufficient to represent all unitary and skew-symmetric matrices. However, the method and all matrices for n > 7. In response, they present the full- related theory remain valid for unitary and skew-Hermitian capacity uRNN, which uses a multiplicative update step that matrices in the complex case. The experimental results in is able to reach all unitary matrices of order n. this paper indicate that state of the art performance can be The full-capacity uRNN aims to construct a unitary ma- achieved without using complex matrices to optimize along trix W (k+1) from W (k) by moving along a curve on the the Stiefel manifold. Stiefel manifold fW 2 Cn×n j W ∗W = Ig. For the network optimization, it is necessary to use a curve that is 2. Background in a descent direction of the cost function L := L(W ). In Tagare(2011), Wen & Yin(2013), Wisdom et al.(2016), and 2.1. Recurrent Neural Networks Vorontsov et al.(2017), a descent direction is constructed A recurrent neural network (RNN) is a function with input as B(k)W (k), which is a representation of the derivative (k) parameters U 2 Rn×m, recurrent parameters W 2 Rn×n, operator DL(W ) in the tangent space of the Stiefel man- (k) (k) (k) recurrent bias b 2 Rn, output parameters V 2 Rp×n, and ifold at W . Then, with B W defining the direction output bias c 2 Rp where m is the data input size, n is the of a descent curve, an update along the Stiefel manifold is number of hidden units, and p is the output data size. From obtained using the Cayley transform as m an input sequence x = (x1; x2; :::; xT ) where xi 2 R , the λ −1 λ RNN returns an output sequence y = (y1; y2; :::; yT ) where W (k+1) = I + B(k) I − B(k) W (k) (1) p each yi 2 R is given recursively by 2 2 ht = σ (Uxt + W ht−1 + b) where λ is the learning rate. yt = V ht + c 3. Scaled Cayley Orthogonal RNN n where h = (h0; : : : ; hT −1), hi 2 R is the hidden layer state at time i and σ(·) is the activation function, which is 3.1. Cayley Transform often a pointwise nonlinearity such as a hyperbolic tangent The Cayley transform gives a representation of orthogonal function or rectified linear unit (Nair & Hinton, 2010). matrices without −1 eigenvalues using skew-symmetric matrices (i.e., matrices where AT = −A): 2.2. Unitary RNNs W = (I + A)−1 (I − A) ;A = (I + W )−1 (I − W ) : A real matrix W is orthogonal if it satisfies W T W = I. The complex analog of orthogonal matrices are unitary matrices, This bijection parametrizes the set of orthogonal matri- ∗ which satisfy W W = I, where ∗ denotes the conjugate ces without −1 eigenvalues with skew-symmetric matrices. transpose. Orthogonal and unitary matrices have the desir- This direct and simple parametrization is attractive from a able property that kW xk2 = kxk2 for any vector x. This machine learning perspective because it is closed under addi- property motivates the use of orthogonal or unitary matrices tion: the sum or difference of two skew-symmetric matrices in RNNs to avoid vanishing and exploding gradients, as is also skew-symmetric, so we can use gradient descent detailed in Arjovsky et al.(2016). algorithms like RMSprop (Tieleman & Hinton, 2012) or Arjovsky et al.(2016) follow the framework of the previous Adam (Kingma & Ba, 2014) to train parameters. section for their restricted-capacity uRNN, but introduce a However, this parametrization cannot represent orthogonal parametrization of the recurrent matrix W using a product matrices with −1 eigenvalues, since in this case I +W is not of simpler matrices. This parameterization is given by a invertible. Theoretically, we can still represent matrices with product consisting of diagonal matrices with complex norm eigenvalues that are arbitrarily close to −1; however, it can 1, complex Householder reflection matrices, discrete Fourier require large entries of A. For example, a 2x2 orthogonal transform matrices, and a fixed permutation matrix with the matrix W with eigenvalues ≈ −0:99999 ± 0:00447i and its resulting product being unitary. The Efficient Unitary RNN parametrization A by the Cayley transform is given below, (EURNN) by Jing et al.(2016) and orthogonal RNN (oRNN) where α = 0:99999. by Mhammedi et al.(2017) parametrize in a similar manner p 2 with products of Givens rotation matrices and Householder W = p −α − 1 − α ;A ≈ 0 447:212 reflection matrices, respectively. This can also be seen in 1 − α2 −α −447:212 0 the parametrization through matrix exponentials in Hyland & Gunnar(2017), which does not appear to perform as well Gradient descent algorithms will learn this A matrix very as the restricted-capacity uRNN. slowly, if at all. This difficulty can be overcome through a Orthogonal Recurrent Neural Networks with Scaled Cayley Transform suitable diagonal scaling according to results from Kahan @L(W (A(k))) A(k+1) = A(k) − λ (2006).

Load more