Cheap Orthogonal Constraints in Neural Networks: a Simple Parametrization of the Orthogonal and Unitary Group
Total Page:16
File Type:pdf, Size:1020Kb
Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group Mario Lezcano-Casado 1 David Mart´ınez-Rubio 2 Abstract Recurrent Neural Networks (RNNs). In RNNs the eigenval- ues of the gradient of the recurrent kernel explode or vanish We introduce a novel approach to perform first- exponentially fast with the number of time-steps whenever order optimization with orthogonal and uni- the recurrent kernel does not have unitary eigenvalues (Ar- tary constraints. This approach is based on a jovsky et al., 2016). This behavior is the same as the one parametrization stemming from Lie group theory encountered when computing the powers of a matrix, and through the exponential map. The parametrization results in very slow convergence (vanishing gradient) or a transforms the constrained optimization problem lack of convergence (exploding gradient). into an unconstrained one over a Euclidean space, for which common first-order optimization meth- In the seminal paper (Arjovsky et al., 2016), they note that ods can be used. The theoretical results presented unitary matrices have properties that would solve the ex- are general enough to cover the special orthogo- ploding and vanishing gradient problems. These matrices nal group, the unitary group and, in general, any form a group called the unitary group and they have been connected compact Lie group. We discuss how studied extensively in the fields of Lie group theory and Rie- this and other parametrizations can be computed mannian geometry. Optimization methods over the unitary efficiently through an implementation trick, mak- and orthogonal group have found rather fruitful applications ing numerically complex parametrizations usable in RNNs in recent years (cf. Section 2). at a negligible runtime cost in neural networks. In parallel to the work on unitary RNNs, there has been an in- In particular, we apply our results to RNNs with creasing interest for optimization over the orthogonal group orthogonal recurrent weights, yielding a new ar- and the Stiefel manifold in neural networks (Harandi & Fer- chitecture called EXPRNN. We demonstrate how nando, 2016; Ozay & Okatani, 2016; Huang et al., 2017; our method constitutes a more robust approach to Bansal et al., 2018). As shown in these papers, orthogonal optimization with orthogonal constraints, show- constraints in linear and CNN layers can be rather beneficial ing faster, accurate, and more stable convergence for the generalization of the network as they act as a form 1 in several tasks designed to test RNNs. of implicit regularization. The main problem encountered while using these methods in practice was that optimiza- tion with orthogonality constraints was neither simple nor 1. Introduction computationally cheap. We aim to close that bridge. Training deep neural networks presents many difficulties. In this paper we present a simple yet effective way to ap- One of the most important is the exploding and vanishing proach problems that present orthogonality or unitary con- arXiv:1901.08428v3 [cs.LG] 30 May 2019 gradient problem, as first observed and studied in (Bengio straints. We build on results from Riemannian geometry et al., 1994). This problem arises from the ill-conditioning and Lie group theory to introduce a parametrization of these of the function defined by a neural network as the number groups, together with theoretical guarantees for it. of layers increase. This issue is particularly problematic in This parametrization has several advantages, both theoreti- cal and practical: 1Mathematical Institute, University of Oxford, Oxford, United Kingdom 2Department of Computer Science, University of Oxford, Oxford, United Kingdom. Correspondence to: Mario Lezcano 1. It can be used with general purpose optimizers. Casado <[email protected]>. 2. The parametrization does not create additional minima Proceedings of the 36 th International Conference on Machine or saddle points in the main parametrization region. Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 3. It is possible to use a structured initializer to take ad- 1Implementation can be found at https://github.com/ vantage of the structure of the eigenvalues of the or- Lezcano/expRNN thogonal matrix. Cheap Orthogonal Constraints in Neural Networks 4. Other approaches need to enforce hard orthogonality used in practice are similar to those used in the real case, constraints, ours does not. cf. (Manton, 2002; Abrudan et al., 2008). Most previous approaches fail to satisfy one or many of Unitary RNNs. The idea of parametrizing the matrix that these points. The parametrization in (Helfrich et al., 2018) defines an RNN by a unitary matrix was first proposed and (Maduranga et al., 2018) comply with most of these in (Arjovsky et al., 2016). Their parametrization centers on points but they suffer degeneracies that ours solves (cf. Re- a matrix-based fast Fourier transform-like (FFT) approach. mark in Section 4.2). We compare our architecture with As pointed out in (Jing et al., 2017), this representation, other methods to optimize over SO(n) and U(n) in the although efficient in memory, does not span the whole space remarks in Sections 3 and 4. of unitary matrices, giving the model reduced expressive- ness. This second paper solves this issue in the same way High-level idea The matrix exponential maps skew- it is solved when computing the FFT—using log(n) iter- symmetric matrices to orthogonal matrices transforming ated butterfly operations. A different approach to perform an optimization problem with orthogonal constraints into this optimization was presented in (Wisdom et al., 2016; an unconstrained one. We use Pade´ approximants and the Vorontsov et al., 2017). Although not mentioned explic- scale-squaring trick to compute machine-precision approx- itly in either of the papers, this second approach consists imations of the matrix exponential and its gradient. We of a retraction-based Riemannian gradient descent via the can implement the parametrization with negligible overhead Cayley transform. The paper (Hyland & Ratsch,¨ 2017) pro- observing that it does not depend on the batch size. poses to use the exponential map on the complex case, but Structure of the Paper In Section 3, we introduce the they do not perform an analysis of the algorithm or provide parametrization and present the theoretical results that a way to approximate the map nor the gradients. A third support the efficiency of the exponential parametrization. approach has been presented in (Mhammedi et al., 2017) In Section 4, we explain the implementation details of via the use of Householder reflections. Finally, in (Helfrich the layer. Finally, in Section 5, we present the numeri- et al., 2018) and the follow-up (Maduranga et al., 2018), a cal experiments confirming the numerical advantages of this parametrization of the orthogonal group via the use of the parametrization. Cayley transform is proposed. We will have a closer look at these methods and their properties in Sections 3 and 4. 2. Related Work 3. Parametrization of Compact Lie Groups Riemannian gradient descent. There is a vast literature on optimization methods on Riemannian manifolds, and in For a much broader introduction to Riemannian geometry particular for matrix manifolds, both in the deterministic and Lie group theory see Appendix A. We will restrict 2 and the stochastic setting. Most of the classical convergence our attention to the special orthogonal and unitary case, results from the Euclidean setting have been adapted to but the results in this section can be generalized to any the Riemannian one (Absil et al., 2009; Bonnabel, 2013; connected compact matrix Lie group equipped with a bi- Boumal et al., 2016; Zhang et al., 2016; Sato et al., 2017). invariant metric. We prove the results for general connected On the other hand, the problem of adapting popular opti- compact matrix Lie groups in Appendix C. mization algorithms like RMSPROP (Tieleman & Hinton, 2012), ADAM (Kingma & Ba, 2014) or ADAGRAD (Duchi 3.1. The Lie algebras of SO(n) and U(n) et al., 2011) is a topic of current research (Kumar Roy et al., We are interested in the study of parametrizations of the 2018; Becigneul & Ganea, 2019). special orthogonal group Optimization over the Orthogonal and Unitary groups. n×n The first formal study of optimization methods on manifolds SO(n) = fB 2 R j B|B = I; det(B) = 1g with orthogonal constraints (Stiefel manifolds) is found in the thesis (Smith, 1993). These ideas were later simplified and the unitary group in the seminal paper (Edelman et al., 1998), where they n×n ∗ were generalized to Grassmannian manifolds and extended U(n) = fB 2 C j B B = Ig: to get the formulation of the conjugate gradient algorithm and the Newton method for these manifolds. After that, These two sets are compact and connected Lie groups. Fur- n×n n×n optimization with orthogonal constraints has been a central thermore, when seen as submanifolds of R (resp. C ) topic of study in the optimization community. A rather in equipped with the metric induced from the ambient space depth literature review of existing methods for optimization 2Note that we consider just matrices with determinant equal with orthogonality constraints can be found in (Jiang & Dai, to one, since the full group of orthogonal matrices O(n) is not 2015). When it comes to the unitary case, the algorithms connected, and hence, not amenable to gradient descent algorithms. Cheap Orthogonal Constraints in Neural Networks hX; Y i = tr(X|Y ) (resp. tr(X∗Y )), they inherit a bi- 3.3. From Riemannian to Euclidean optimization invariant metric, meaning that the metric is invariant with In this section we describe some properties of the expo- respect to left and right multiplication by matrices of the nential parametrization which make it a sound choice for group. This is clear given that the matrices of the two groups optimization with orthogonal constraints in neural networks.