Arxiv:1910.05929V1 [Cs.LG] 14 Oct 2019 Multaneously Accounts for All 4 of These Sur- the Journey to These Minima

Emergent properties of the local geometry of neural loss landscapes Stanislav Fort Surya Ganguli Stanford University Stanford University Stanford, CA, USA Stanford, CA, USA Abstract 1 Introduction The geometry of neural network loss landscapes and The local geometry of high dimensional neu- the implications of this geometry for both optimiza- ral network loss landscapes can both chal- tion and generalization have been subjects of intense lenge our cherished theoretical intuitions as interest in many works, ranging from studies on the well as dramatically impact the practical suc- lack of local minima at significantly higher loss than cess of neural network training. Indeed recent that of the global minimum [1, 2] to studies debating works have observed 4 striking local prop- relations between the curvature of local minima and erties of neural loss landscapes on classifi- their generalization properties [3, 4, 5, 6]. Fundamen- cation tasks: (1) the landscape exhibits ex- tally, the neural network loss landscape is a scalar loss actly C directions of high positive curvature, function over a very high D dimensional parameter where C is the number of classes; (2) gradi- space that could depend a priori in highly nontriv- ent directions are largely confined to this ex- ial ways on the very structure of real-world data itself tremely low dimensional subspace of positive as well as intricate properties of the neural network Hessian curvature, leaving the vast major- architecture. Moreover, the regions of this loss land- ity of directions in weight space unexplored; scape explored by gradient descent could themselves (3) gradient descent transiently explores in- have highly atypical geometric properties relative to termediate regions of higher positive curva- randomly chosen points in the landscape. Thus un- ture before eventually finding flatter minima; derstanding the shape of loss functions over high di- (4) training can be successful even when con- mensional spaces with potentially intricate dependen- fined to low dimensional random affine hy- cies on both data and architecture, as well as biases in perplanes, as long as these hyperplanes in- regions explored by gradient descent, remains a signif- tersect a Goldilocks zone of higher than av- icant challenge in deep learning. Indeed many recent erage curvature. We develop a simple theo- studies explore extremely intriguing properties of the retical model of gradients and Hessians, jus- local geometry of these loss landscapes, as character- tified by numerical experiments on architec- ized by the gradient and Hessian of the loss landscape, tures and datasets used in practice, that si- both at minima found by gradient descent, and along arXiv:1910.05929v1 [cs.LG] 14 Oct 2019 multaneously accounts for all 4 of these sur- the journey to these minima. prising and seemingly unrelated properties. Our unified model provides conceptual in- In this work we focus on providing a simple, unified ex- sights into the emergence of these properties planation of 4 seemingly unrelated yet highly intrigu- and makes connections with diverse topics ing local properties of the loss landscape on classifica- in neural networks, random matrix theory, tion tasks that have appeared in the recent literature: and spin glasses, including the neural tangent kernel, BBP phase transitions, and Derrida's random energy model. (1) The Hessian eigenspectrum is composed of a bulk plus C outlier eigenvalues where C is the number of classes. Recent works have observed this phenomenon in small networks [7, 8, 9], as well as large networks [10, 11]. This implies that locally the loss landscape has C highly curved directions, while it is much flatter in the vastly larger number of D − C directions in weight space. (2) Gradient aligns with this tiny Hessian sub- rally from the very structure of real-world data, com- space. Recent work [12] demonstrated that the gra- plex neural architectures, and potentially biased ex- dient ~g over training time lies primarily in the subspace plorations of the loss landscape through the dynamics spanned by the top few largest eigenvalues of the Hes- of gradient descent starting at a random initialization. sian H (equal to the number of classes C). This implies Moreover, it is unclear what specific aspects of data, that most of the descent directions lie along extremely architecture and descent trajectory are important for low dimensional subspaces of high local positive curva- generating these 4 properties, and what myriad as- ture; exploration in the vastly larger number of D − C pects are not relevant. available directions in parameter space over training Our main contribution is to provide an extremely sim- utilizes a small portion of the gradient. ple, unified model that simultaneously accounts for all 4 of these local geometric properties. Our model yields (3) The maximal Hessian eigenvalue grows, conceptual insights into why these 4 properties might peaks and then declines during training. Given arise quite generically in neural networks solving classi- widespread interest in arriving at flat minima (e.g. [6]) fication tasks, and makes connections to diverse topics due to their presumed superior generalization prop- in neural networks, random matrix theory, and spin erties, it is interesting to understand how the local glasses, including the neural tangent kernel [19, 20], geometry, and especially curvature, of the loss land- BBP phase transitions [21, 22], and the random en- scape varies over training time. Interestingly, a recent ergy model [23]. study [13] found that the spectral norm of the Hessian, as measured by the top Hessian eigenvalue, displays The outline of this paper is as follows. We set up the a non-monotonic dependence on training time. This basic notation and questions we ask about gradients non-monotonicity implies that gradient descent trajec- and Hessians in detail Sec. 2. In Sec. 3 we introduce a tories tend to enter higher positive curvature regions sequence of simplifying assumptions about the struc- of the loss landscape before eventually finding the de- ture of gradients, Hessians, logit gradients and logit sired flatter regions. Related effects were observed in curvatures that enable us to obtain in the end an ex- [14, 15]. tremely simple random model of Hessians and gradients and how they evolve both over training time and (4) Initializing in a Goldilocks zone of higher weight scale. We then immediately demonstrate in convexity enables training in very low di- Sec. 4 that all 4 striking properties of local geometry mensional weight subspaces. Recent work [16] of the loss landscape emerge naturally from our sim- showed, surprisingly, that one need not train all D pa- ple random model. Finally, in Sec. 5 we give direct rameters independently to achieve good training and evidence that our simplifying theoretical assumptions test error; instead one can choose to train only within leading to our random model in Sec. 3 are indeed valid a random low dimensional affine hyperplane of param- in practice, by performing numerical experiments on eters. Indeed the dimension of this hyperplane can realistic architectures and datasets. be far less than D. More recent work [17] explored how the success of training depends on the position of 2 Overall framework this hyperplane in parameter space. This work found a Goldilocks zone as a function of the initial weight Here we describe the local shape of neural loss land- variance, such that the intersection of the hyperplane scapes, as quantified by their gradient and Hessian, with this Goldilocks zone correlated with training suc- and formulate the main problem we aim to solve: con- cess. Furthermore, this Goldilocks zone was character- ceptually understanding the emergence of the 4 strik- ized as a region of higher than usual positive curvature ing properties from these two fundamental objects. as measured by the Hessian statistic Trace(H)=jjHjj. This statistic takes larger positive values when typical 2.1 Notation and general setup randomly chosen directions in parameter space exhibit more positive curvature [17, 18]. Thus overall, the ease We consider a classification task with C classes. Let of optimizing over low dimensional hyperplanes corre- f~xµ; ~yµgN denote a dataset of N input-output vec- lates with intersections of this hyperplane with regions µ=1 tors where the outputs ~yµ 2 C are one-hot vectors, of higher positive curvature. R with all components equal to 0 except a single compo- µ Taken together these 4 somewhat surprising and nent yk = 1 if and only if k is the correct class label seemly unrelated local geometric properties fundamen- for input xµ. We assume a neural network transforms tally challenge our conceptual understanding of the each input ~xµ into a logit vector ~zµ 2 RC through µ µ ~ D shape of neural network loss landscapes. It is a pri- the function ~z = FW~ (x ), where W 2 R denotes ori unclear how these 4 properties may emerge natu- a D dimensional vector of trainable neural network parameters. We aim to obtain the aforementioned 4 and the Hessian of the loss Lµ with respect to logits is local properties of the loss landscape as a consequence @2Lµ of a set of simple properties and therefore we do not 2 µ µ µ rzL kl = µ µ = pk (δkl − pl ) : (7) @zk @zl assume the function FW~ corresponds to any particular architecture such as a ResNet [24], deep convolutional neural network [25], or a fully-connected neural net- Then applying the chain rule yields the gradient of L work. We do assume though that the predicted class

Load more