Theories of Deep Learning: Generalization, Expressivity, and Training

Theories of deep learning: generalization, expressivity, and training Surya Ganguli Dept. of Applied Physics, Neurobiology, and Electrical Engineering Stanford University ! Funding: Bio-X Neuroventures! NIH! Burroughs Wellcome! Office of Naval Research! Genentech Foundation! Simons Foundation! James S. McDonnell Foundation! Sloan Foundation! McKnight Foundation! Swartz Foundation! National Science Foundation! Stanford Terman Award! http://ganguli-gang.stanford.edu Twitter: @SuryaGanguli An interesting artificial neural circuit for image classification Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton NIPS 2012 References: http://ganguli-gang.stanford.edu • M. Advani and S. Ganguli, An equivalence between high dimensional Bayes optimal inference and M-estimation, NIPS 2016. • M. Advani and S. Ganguli, Statistical mechanics of optimal convex inference in high dimensions, Physical Review X, 6, 031034, 2016. • A. Saxe, J. McClelland, S. Ganguli, Learning hierarchical category structure in deep neural networks, Proc. of the 35th Cognitive Science Society, pp. 1271-1276, 2013. • A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014. • Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high- dimensional non-convex optimization, NIPS 2014. • B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016. • S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, Deep information propagation, ICLR 2017. • S. Lahiri, J. Sohl-Dickstein and S. Ganguli, A universal tradeoff between energy speed and accuracy in physical communication, arxiv 1603.07758 • A memory frontier for complex synapses, S. Lahiri and S. Ganguli, NIPS 2013. • Continual learning through synaptic intelligence, F. Zenke, B. Poole, S. Ganguli, ICML 2017. • Modelling arbitrary probability distributions using non-equilibrium thermodynamics, J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, S. Ganguli, ICML 2015. • Deep Knowledge Tracing, C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. Guibas, J. Sohl-Dickstein, NIPS 2015. • Deep learning models of the retinal response to natural scenes, L. McIntosh, N. Maheswaranathan, S. Ganguli, S. Baccus, NIPS 2016. • Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017. • Variational walkback: learning a transition operator as a recurrent stochastic neural net, A. Goyal, N.R. Ke, S. Ganguli, Y. Bengio, NIPS 2017. • The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018. Tools: Non-equilibrium statistical mechanics Riemannian geometry Dynamical mean field theory Random matrix theory Statistical mechanics of random landscapes Free probability theory At a coarse grained level: 3 puzzles of deep learning Generalization: How can neural networks predict the response to new examples? A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014. A. Lampinen, J. McCelland, S. Ganguli, An analytic theory of generalization dynamics and transfer learning in deep linear networks, work in progress. Expressivity: Why deep? What can a deep neural network “say” that a shallow network cannot? B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016. Trainability: How can we optimize non-convex loss functions to achieve small training error? Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017. The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018. Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS 2014. Learning dynamics of training and testing error Andrew Saxe Andrew Lampinen Jay McClelland Harvard Stanford Stanford Learning dynamics of training and testing error The dynamics of learning in deep nonlinear networks is complex: Training error Test error Training time Training time Non-monotonicity: Plateaus with sudden Overfitting to training examples; sudden transitions to lower error Bad predictions on new examples Deep network • Little hope for a complete theory with arbitrary nonlinearities D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x) 2 1 W D W W . f (x) N ND+1 3 N1 y ∈ R h2 ∈ R x R x ∈ Deep linear network • Use a deep linear network as a starting point D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x) 2 1 W D W W . f (x) N ND+1 3 N1 y ∈ R h2 ∈ R x R x ∈ Final Report: Convergence properties of deep linear networks Andrew Saxe Christopher Baldassano [email protected] [email protected] 1 Introduction Deep learning approaches have realized remarkable performance across a range of application areas in machine learning, from computer vision [1, 2] to speech recognition [3] and natural language processing [4], but the complexity of deep nonlinear networks has made it difficult to develop a comprehensive theoretical understanding of deep learning. For example, the necessary conditions for convergence, the speed of convergence, and optimal methods for initialization are based pri- marily on empirical results without much theoretical support. As a first step in understanding the learning dynamics of deep nonlinear networks, we can analyze deep linear networks which compute D D 1 2 1 i y = W W − W W x, where x, y are input and output vectors respectively, and the W are D weight matrices··· in this D +1layer deep network. Although these networks are no more expres- sive than a single linear map y = Wx (and therefore unlikely to yield high accuracy in practice), we have previously shown [5] that they do exhibit nonlinear learning dynamics similar to those ob- served in nonlinear networks. By precisely characterizing how the weight matrices evolve in linear networks, we may gain insight into the properties of nonlinear networks with simple nonlinearities (such as rectified linearDeep units). linear network In this progress report, we show preliminary results for continuous batch gradient descent, in which the gradient step size is assumed to be small enough to take a continuous time limit. By the end of the• project, Input-output map: we hope to obtain similarAlways results linear for discrete batch gradient descent (with a discrete step size) and stochastic (online) gradient descent. " D % 2 Preliminaries andy Previous= $∏ WorkW i 'x ≡ W tot x A deep linear network maps input# vectorsi=1 x to& output vectors y = D W i x Wx. We wish i=1 ⌘ to minimize the squared error on the training set xµ,yµ P , l(W⇣)= P ⌘yµ Wxµ 2. • Gradient descent dynamics: Nonlinear; coupled; { }µ=1 Q nonconvexµ=1 k − k The batch gradient descent update for a layer l is P P D T D l 1 T − ∆W l = λ W i yµxµT W i xµxµT W i , (1) − µ=1 ! " i=1 ! # i=1 ! X i=Yl+1 Y Y b i b (b 1) (a 1) a b i where W = W W − W − W with the caveat that W =lI=if1a>b,, D. i=a ··· i=a The• minimizingUseful for studying Q W can be foundlearning analytically, dynamics by setting the, not representation power. derivativeQ of the loss to zero: P (yµ Wxµ)xµT =0 (2) − µ=1 X Let ⌃xx P xµxµT be the input correlation matrix, and ⌃yx P yµxµT be the input- ⌘ µ=1 ⌘ µ=1 output correlation matrix. The optimal W is P P yx xx 1 W ⇤ = ⌃ (⌃ )− (3) 1 Nontrivial learning dynamics Plateaus and sudden Faster convergence from transitions pretrained initial conditions 4 x 10 3 4 x 10 3 2.8 2.8 Random ICs 2.6 2.6 2.4 Pretrained 2.4 2.2 2.2 2 2 Training error Training error 1.8 1.8 1.6 Training error 1.6 1.4 Training error 1.4 1.2 0 50 100 150 200 250 300 350 400 450 500 Epochs 1.2 0 50 100 150 200 250 300 350 400 450 500 Epochs Epochs Epochs • Build intui4ons for nonlinear case by analyzing linear case Student Version of MATLAB Student Version of MATLAB Nonlinear learning dynamics in a 3 layer linear net N N1 N2 3 W21 W32 Object Feature representation representation Averaging over the input statistics Input statistics guide change of synaptic coordinates Dynamics of synaptic modes aα bα Input mode α Output mode α Cooperative growth Stabilization Inter-mode competition Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes S S 3 one learning epoch. We note that, although the network we - = Properties B B Properties Properties analyze is completely linear with the simple input-output map 32 21 P P y = W W x, the gradient descent learning dynamics given T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S V Input-output Output Input Singular values Decomposing the input-output correlations Our funda- correlation matrix singular vectors singular vectors mental goal is to understand the dynamics of learning in (3)- Figure 2: Example singular value decomposition for a toy (4) as a function of the input statistics S11 and S31. In general, dataset. Left: The learning environment is specified by an the outcome of learning will reflect an interplay between the input-output correlation matrix. This example dataset has perceptual correlations in the input patterns, described by S11, four items: Canary, Salmon, Oak, and Rose.

Load more