Geometry of Neural Network Loss Surfaces Via Random Matrix Theory

Geometry of Neural Network Loss Surfaces via Random Matrix Theory Jeffrey Pennington 1 Yasaman Bahri 1 Abstract 2014; Choromanska et al., 2015; Neyshabur et al., 2015), architecture design, and generalization (Keskar et al.). Understanding the geometry of neural network loss surfaces is important for the development of 1.1. Related work improved optimization algorithms and for build- ing a theoretical understanding of why deep There is no shortage of prior work focused on the loss learning works. In this paper, we study the ge- surfaces of neural networks. Choromanska et al.(2015) ometry in terms of the distribution of eigenvalues and Dauphin et al.(2014) highlighted the prevalence of of the Hessian matrix at critical points of varying saddle points as dominant critical points that plague op- energy. We introduce an analytical framework timization, as well as the existence of many local minima and a set of tools from random matrix theory that at low loss values. Dauphin et al.(2014) studied the dis- allow us to compute an approximation of this distribution of critical points as a function of the loss value tribution under a set of simplifying assumptions. empirically and found a trend which is qualitatively sim- The shape of the spectrum depends strongly on ilar to predictions for random Gaussian landscapes (Bray the energy and another key parameter, φ, which & Dean, 2007). Choromanska et al.(2015) argued that the measures the ratio of parameters to data points. loss function is well-approximated by a spin-glass model Our analysis predicts and numerical simulations studied in statistical physics, thereby predicting the exis- support that for critical points of small index, the tence of local minima at low loss values and saddle points number of negative eigenvalues scales like the 3=2 at high loss values as the network increases in size. Good- power of the energy. We leave as an open prob- fellow et al. observed that loss surfaces arising in prac- lem an explanation for our observation that, in tice tend to be smooth and seemingly convex along low- the context of a certain memorization task, the dimensional slices. Subsequent works (Kawaguchi, 2016; energy of minimizers is well-approximated by Safran & Shamir, 2016; Freeman & Bruna, 2016) have fur- 1 2 the function 2 (1 − φ) . thered these and related ideas empirically or analytically, but it remains safe to say that we have a long way to go before a full understanding is achieved. 1. Introduction 1.2. Our contributions Neural networks have witnessed a resurgence in recent years, with a smorgasbord of architectures and configura- One shortcoming of prior theoretical results is that they are tions designed to accomplish ever more impressive tasks. often derived in contexts far removed from practical neu- Yet for all the successes won with these methods, we have ral network settings – for example, some work relies on managed only a rudimentary understanding of why and in results for generic random landscapes unrelated to neural what contexts they work well. One difficulty in extend- networks, and other work draws on rather tenuous connec- ing our understanding stems from the fact that the neural tions to spin-glass models. While there is a lot to be gained network objectives are generically non-convex functions from this type of analysis, it leaves open the possibility that in high-dimensional parameter spaces, and understanding characteristics of loss surfaces specific to neural networks their loss surfaces is a challenging task. Nevertheless, an may be lost in the more general setting. In this paper, we improved understanding of the loss surface could have a focus narrowly on the setting of neural network loss sur- large impact on optimization (Saxe et al.; Dauphin et al., faces and propose an analytical framework for studying the spectral density of the Hessian matrix in this context. 1Google Brain. Correspondence to: Jeffrey Pennington <jpen- [email protected]>. Our bottom-up construction assembles an approximation of the Hessian in terms of blocks derived from the weights, Proceedings of the 34 th International Conference on Machine the data, and the error signals, all of which we assume to Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Geometry of Neural Networks be random variables1. From this viewpoint, the Hessian terization of the network capacity is the ratio of the num- may be understood as a structured random matrix and we ber of parameters to the effective number of examples2, study its eigenvalues in the context of random matrix the- φ ≡ 2n2=mn = 2n=m. As we will see, φ is a critical ory, using tools from free probability. We focus on single- parameter which governs the shape of the distribution. For hidden-layer networks, but in principle the framework can instance, eqn. (3) shows that H0 has the form of a covari- accommodate any network architecture. After establishing ance matrix computed from the Jacobian, with φ governing our methodology, we compute approximations to the Hes- the rank of the result. sian at several levels of refinement. One result is a predic- We will be making a variety of assumptions throughout this tion that for critical points of small index, the index scales work. We distinguish primary assumptions, which we use like the energy to the 3=2 power. to establish the foundations of our analytical framework, from secondary assumptions, which we invoke to simplify 2. Preliminaries computations. To begin with, we establish our primary assumptions: Let W (1) 2 Rn1×n0 and W (2) 2 Rn2×n1 be weight matrices of a single-hidden-layer network without biases. De- 1. The matrices H0 and H1 are freely independent, a note by x 2 Rn0×m the input data and by y 2 Rn2×m the property we discuss in sec.3. targets. We will write z(1) = W (1)x for the pre-activations. 2. The residuals are i.i.d. normal random variables with We use [·] = max(·; 0) to denote the ReLU activation, + tunable variance governed by , e ∼ N (0; 2). This which will be our primary focus. The network output is iµ assumption allows the gradient to vanish in the large n1 m limit, specifying our analysis to critical points. X (2) (1) y^ = W [z ] ; (1) iµ ik kµ + 3. The data features are i.i.d. normal random variables. k=1 4. The weights are i.i.d. normal random variables. and the residuals are eiµ =y îµ − yiµ. We use Latin indices for features, hidden units, and outputs, and µ to index ex- We note that normality may not be strictly necessary. We amples. We consider mean squared error, so that the loss will discuss these assumptions and their validity in sec.5. (or energy) can be written as, n2;m 1 X 3. Primer on Random Matrix Theory L = n = e2 : (2) 2 2m iµ i,µ=1 In this section we highlight a selection of results from the theory of random matrices that will be useful for our anal- The Hessian matrix is the matrix of second derivatives of ysis. We only aim to convey the main ideas and do not @2L the loss with respect to the parameters, i.e. Hαβ = , @θα@θβ attempt a rigorous exposition. For a more thorough intro- (1) (2) where θα 2 fW ;W g. It decomposes into two pieces, duction, see, e.g., (Tao, 2012). H = H0 + H1, where H0 is positive semi-definite and where H1 comes from second derivatives and contains all 3.1. Random matrix ensembles of the explicit dependence on the residuals. Specifically, The theory of random matrices is concerned with proper- n2;m ties of matrices M whose entries M are random variables. 1 X @yîµ @yîµ 1 ij [H ] ≡ ≡ [JJ T ] ; (3) 0 αβ m @θ @θ m αβ The degree of independence and the manner in which the i,µ=1 α β Mij are distributed determine the type of random matrix ensemble to which M belongs. Here we are interested where we have introduced the Jacobian matrix, J, and, primarily in two ensembles: the real Wigner ensemble for n2;m 2 which M is symmetric but otherwise the M are i.i.d.; and 1 X @ yîµ ij [H ] ≡ e : (4) T 1 αβ m iµ @θ @θ the real Wishart ensemble for which M = XX where i,µ=1 α β Xij are i.i.d.. Moreover, we will restrict our attention to studying the limiting spectral density of M. For a random We will almost exclusively consider square networks with n×n matrix Mn 2 , the empirical spectral density is de- n ≡ n0 = n1 = n2. We are interested in the limit of R large networks and datasets, and in practice they are typ- fined as, n ically of the same order of magnitude. A useful charac- 1 X ρMn (λ) = δ(λ − λj(Mn)) ; (5) 1 n Although we additionally assume the random variables are j=1 independent, our framework does not explicitly require this as- 2 sumption, and in principle it could be relaxed in exchange for In our context of squared error, each of the n2 targets may be more technical computations. considered an effective example. Geometry of Neural Networks where the λj(Mn), j = 1; : : : ; n, denote the n eigenvalues from which ρ can be recovered using the inversion formula, of Mn, including multiplicity, and δ(z) is the Dirac delta 1 function centered at z. The limiting spectral density is ρ(λ) = − lim Im G(λ + i) : (12) π !0+ defined as the limit of eqn. (5) as n ! 1, if it exists.

Load more