Theories of deep learning: generalization, expressivity, and training

Surya Ganguli

Dept. of Applied , Neurobiology, and Electrical Engineering

Stanford University ! Funding: Bio-X Neuroventures! NIH! Burroughs Wellcome! Office of Naval Research! Genentech Foundation! Simons Foundation! James S. McDonnell Foundation! Sloan Foundation! McKnight Foundation! Swartz Foundation! National Science Foundation! Stanford Terman Award! http://ganguli-gang.stanford.edu Twitter: @SuryaGanguli

An interesting artificial neural circuit for image classification

Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton NIPS 2012 References: http://ganguli-gang.stanford.edu • M. Advani and S. Ganguli, An equivalence between high dimensional Bayes optimal inference and M-estimation, NIPS 2016. • M. Advani and S. Ganguli, Statistical mechanics of optimal convex inference in high dimensions, Physical Review X, 6, 031034, 2016. • A. Saxe, J. McClelland, S. Ganguli, Learning hierarchical category structure in deep neural networks, Proc. of the 35th Cognitive Science Society, pp. 1271-1276, 2013. • A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014. • Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high- dimensional non-convex optimization, NIPS 2014. • B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016. • S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, Deep information propagation, ICLR 2017. • S. Lahiri, J. Sohl-Dickstein and S. Ganguli, A universal tradeoff between energy speed and accuracy in physical communication, 1603.07758 • A memory frontier for complex synapses, S. Lahiri and S. Ganguli, NIPS 2013. • Continual learning through synaptic intelligence, F. Zenke, B. Poole, S. Ganguli, ICML 2017. • Modelling arbitrary probability distributions using non-equilibrium thermodynamics, J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, S. Ganguli, ICML 2015. • Deep Knowledge Tracing, C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. Guibas, J. Sohl-Dickstein, NIPS 2015. • Deep learning models of the retinal response to natural scenes, L. McIntosh, N. Maheswaranathan, S. Ganguli, S. Baccus, NIPS 2016. • Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017. • Variational walkback: learning a transition operator as a recurrent stochastic neural net, A. Goyal, N.R. Ke, S. Ganguli, Y. Bengio, NIPS 2017. • The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.

Tools: Non-equilibrium statistical mechanics Riemannian geometry Dynamical mean field theory Random matrix theory Statistical mechanics of random landscapes Free probability theory At a coarse grained level: 3 puzzles of deep learning

Generalization: How can neural networks predict the response to new examples?

A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014.

A. Lampinen, J. McCelland, S. Ganguli, An analytic theory of generalization dynamics and transfer learning in deep linear networks, work in progress.

Expressivity: Why deep? What can a deep neural network “say” that a shallow network cannot?

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016.

Trainability: How can we optimize non-convex loss functions to achieve small training error?

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017.

The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.

Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS 2014. Learning dynamics of training and testing error

Andrew Saxe Andrew Lampinen Jay McClelland Harvard Stanford Stanford Learning dynamics of training and testing error

The dynamics of learning in deep nonlinear networks is complex:

Training error Test error

Training time Training time

Non-monotonicity: Plateaus with sudden Overfitting to training examples; sudden transitions to lower error Bad predictions on new examples Deep network • Little hope for a complete theory with arbitrary nonlinearities

D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)

2 1 W D W W . . .

f (x)

N ND+1 3 N1 y ∈ R h2 ∈ R x R x ∈ Deep linear network

• Use a deep linear network as a starting point

D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)

2 1 W D W W . . .

f (x)

N ND+1 3 N1 y ∈ R h2 ∈ R x R x ∈ Final Report: Convergence properties of deep linear networks

Andrew Saxe Christopher Baldassano [email protected] [email protected]

1 Introduction

Deep learning approaches have realized remarkable performance across a range of application areas in machine learning, from computer vision [1, 2] to speech recognition [3] and natural language processing [4], but the complexity of deep nonlinear networks has made it difficult to develop a comprehensive theoretical understanding of deep learning. For example, the necessary conditions for convergence, the speed of convergence, and optimal methods for initialization are based pri- marily on empirical results without much theoretical support. As a first step in understanding the learning dynamics of deep nonlinear networks, we can analyze deep linear networks which compute D D 1 2 1 i y = W W W W x, where x, y are input and output vectors respectively, and the W are D weight matrices··· in this D +1layer deep network. Although these networks are no more expres- sive than a single linear map y = Wx (and therefore unlikely to yield high accuracy in practice), we have previously shown [5] that they do exhibit nonlinear learning dynamics similar to those ob- served in nonlinear networks. By precisely characterizing how the weight matrices evolve in linear networks, we may gain insight into the properties of nonlinear networks with simple nonlinearities (such as rectified linearDeep units). linear network In this progress report, we show preliminary results for continuous batch gradient descent, in which the gradient step size is assumed to be small enough to take a continuous time limit. By the end of the• project, Input-output map: we hope to obtain similarAlways results linear for discrete batch gradient descent (with a discrete step size) and stochastic (online) gradient descent. " D % 2 Preliminaries andy Previous= $∏ WorkW i 'x ≡ W tot x A deep linear network maps input# vectorsi=1 x to& output vectors y = D W i x Wx. We wish i=1 ⌘ to minimize the squared error on the training set xµ,yµ P , l(W⇣)= P ⌘yµ Wxµ 2. • Gradient descent dynamics: Nonlinear; coupled; { }µ=1 Q nonconvexµ=1 k k The batch gradient descent update for a layer l is P

P D T D l 1 T W l = W i yµxµT W i xµxµT W i , (1) µ=1 ! " i=1 ! # i=1 ! X i=Yl+1 Y Y b i b (b 1) (a 1) a b i where W = W W W W with the caveat that W =lI=if1a>b,, D. i=a ··· i=a The• minimizingUseful for studying Q W can be foundlearning analytically, dynamics by setting the, not representation power. derivativeQ of the loss to zero: P (yµ Wxµ)xµT =0 (2) µ=1 X Let ⌃xx P xµxµT be the input correlation matrix, and ⌃yx P yµxµT be the input- ⌘ µ=1 ⌘ µ=1 output correlation matrix. The optimal W is P P yx xx 1 W ⇤ = ⌃ (⌃ ) (3)

1 Nontrivial learning dynamics

Plateaus and sudden Faster convergence from transitions pretrained initial conditions

4 x 10 3 4 x 10 3 2.8 2.8 Random ICs 2.6 2.6

2.4 Pretrained 2.4

2.2 2.2

2 2 Training error Training error

1.8 1.8

1.6 Training error 1.6

1.4 Training error 1.4 1.2 0 50 100 150 200 250 300 350 400 450 500 Epochs 1.2 0 50 100 150 200 250 300 350 400 450 500 Epochs Epochs Epochs

• Build intuions for nonlinear case by analyzing linear case Student Version of MATLAB Student Version of MATLAB Nonlinear learning dynamics in a 3 layer linear net

N N1 N2 3

W21 W32 Object Feature representation representation Averaging over the input statistics Input statistics guide change of synaptic coordinates Dynamics of synaptic modes

aα bα

Input mode α Output mode α

Cooperative growth Stabilization Inter-mode competition Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes S S 3 one learning epoch. We note that, although the network we -

= Properties B B Properties Properties analyze is completely linear with the simple input-output map 32 21 P P y = W W x, the gradient descent learning dynamics given T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S V Input-output Output Input Singular values Decomposing the input-output correlations Our funda- correlation matrix singular vectors singular vectors mental goal is to understand the dynamics of learning in (3)- Figure 2: Example singular value decomposition for a toy (4) as a function of the input statistics S11 and S31. In general, dataset. Left: The learning environment is specified by an the outcome of learning will reflect an interplay between the input-output correlation matrix. This example dataset has perceptual correlations in the input patterns, described by S11, four items: Canary, Salmon, Oak, and Rose. The two animals and the input-output correlations described by S31. To begin, share the property that they can Move, while the two plants though, we consider the case of orthogonal input representa- cannot. In addition each item has a unique property: can Fly, tions where each item is designated by a single active input can Swim, has Bark, and has Petals, respectively. Right: The unit, as used by (Rumelhart & Todd, 1993) and (Rogers & 31 SVD decomposes S into input-output modes that link a set McClelland, 2004). In this case, S11 corresponds to the iden- of coherently covarying properties (output singular vectors in tity matrix. Under this scenario, the only aspect of the train- the columns of U) to a set of coherently covarying items (in- ing examples that drives learning is the second order input- T put singular vectors in the rows of V ). The overall strength output correlation matrix S31. We consider its singular value of this link is given by the singular values lying along the di- decomposition (SVD) agonal of S. In this toy example, mode 1 distinguishes plants N1 from animals; modeItems 2 birds from fish;Modes and mode 3 flowersModes Items 31 33 31 11T a aT P C S O R 1 2 3 1 2 3 C S O SR = U isS anV N3 =NÂ1 input-outputsau v , correlation(6) matrix, and t l . Here from trees. + ⇥ a=1 ⌘ 1 M M t measures time in units of learning epochs; as t varies from which will play a central role in understanding how the ex- 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes We wish to train the network to learn a particular input- Modes amples drive learning. The SVD decomposes any rectangu- S

S 11 µ µ lar3 matrix into the productone learning of three matrices. epoch. We Here noteV that,is although the network we output map from a set of P training examples x ,y ,µ = -

= Properties

B { } B 1,...,P. The Properties input vector xµ, identifies item µ while each yµ an N1 N1 orthogonalanalyze matrix whose is completely columns contain linearinput- with the simple input-output map ⇥ a 32 21

P analyzing singular vectors v that reflect independent modes is a set of attributesP to be associated to this item. Training y = W W x, the gradient descent learning dynamics given of variation in the input, U33 is an N N orthogonal ma- is accomplished in an31 online fashion= via stochastic gradient T in Eqns. (3)-(4)3 ⇥ are3 highly nonlinear. descent; each timeΣ an example µ is presented,U the weightsS trix whoseV columns contain output-analyzing singular vectors 32 21 Input-output Output a Input W and W are adjusted by a small amount in theSingular direction values u that reflect independentDecomposing modes of variation the input-output in the output, correlations Our funda- correlation matrix singular vectors 2 singular31 vectors that minimizes the squared error yµ W 32W 21xµ between and S is an N3 N1 mentalmatrix whose goal is only to understandnonzero elements the dynamics of learning in (3)- ⇥ the desiredFigure feature output, 2: Example and the network’ssingular featurevalue decomposition output. are on for the a diagonal; toy these(4) as elements a function are the of the singular input values statistics S11 and S31. In general, s ,a = 1,...,N ordered so that s s s . An ex- This gradientdataset. descent procedureLeft: The yields learning the learning environment rule is specifieda by an1 1 2 ··· N1 ample SVD of a toy datasetthe outcome is given of in learning Fig. 2. As will can reflect be an interplay between the 11 input-outputT correlation matrix. This example dataset has perceptual correlations in the input patterns, described by S , DW 21 = lW 32 yµxµT W 32W 21xµxµT (1) seen, the SVD extracts coherently covarying items and prop- four items: Canary, Salmon, Oak, and Rose. The two animals 31 T erties from this dataset,and with the various input-output modes picking correlations out the described by S . To begin, DWshare32 = thel propertyyµxµT W 32 thatW 21 theyxµxµT canW 21Move, ,(2) while the two plants underlying hierarchy presentthough, in the we toy consider environment. the case of orthogonal input representa- cannot. In addition each item has a unique property: can Fly, for each example µ, where l is a small learning rate. We The temporal dynamicstions of where learning eachA item central is result designated of by a single active input imagine thatcan trainingSwim is, has dividedBark into, and a sequence has Petals of learning, respectively.this Right:work is The that we haveunit, described as used the by full (Rumelhart time course & of Todd, 1993) and (Rogers & 31 epochs, andSVD in each decomposes epoch, the aboveS into rules input-output are followed formodeslearningthat link by solving a set theMcClelland, nonlinearFixed points dynamical 2004). equations In this (3)-(4) case, S11 corresponds to the iden- all P examplesof coherently in random order. covarying As long properties as l is sufficiently (output singularfor orthogonal vectors in input representationstity matrix. Under(S11 = I this), and scenario, arbitrary the only aspect of the train- 31 small so thatthe the columns weights change of U) by to onlya set a of small coherently amount per covaryinginput-output items ( correlationin- ingS examples. In particular, that drives we find learning a class is the second order input- P learning epoch,put singular we can average vectors (1)-(2)in the over rows all ofexamplesV T ). The overallof exact strength solutions (whose derivation will be presented31 else- and take a continuous time limit to obtain the mean change in • As , weights approach 21 output32 correlation matrix S . We consider its singular value of this link is given by the singular values lyingwhere) along for theWt di-→(t) and∞W (t) such that the composite map- weights per learning epoch, ping at any time t is givendecomposition by (SVD) agonal of S. In this toy example, mode 1 distinguishes plants N2 N1 d 21 32T 31 32 21 11 T t fromW animals;= W modeS 2W birdsW fromS fish;(3) and mode 3 flowersW 32(t)W 21(t)= a(t,s ,a310 )uavaT33, 31 11(7) a aT dt Â→ aS a = U S V = Â sau v , (6) from trees. a=1 a=1 d 32 31 32 21 11 21T t W = S W W S W , (4) 0 dt where the function a(twhich,s,a ) governing will play the a strengthcentral of role each in understanding how the ex- input-output mode is given by 11 WeT wish to train the network to learn a particular input- amples drive learning. The SVD decomposes any rectangu- where S E[xx ] is an N1 N1 input correlation matrix, • (Baldi & Hornik, 1989; Sanger, 1989) 11 ⌘output map from⇥ a set of P training examples xµ,yµ ,µ = lar matrixse2 intost/t the product of three matrices. Here V is 31 T a(t,s,a )= . (8) S E[yx ] µ (5) { } µ 0 2st/t 1,...,P. The input vector x , identifies item µ while each y an Ne1 N1 1orthogonal+ s/a0 matrix whose columns contain input- ⌘ ⇥ a is a set of attributes to be associated to this item. Training analyzing singular vectors v that reflect independent modes • Simple end point of variation in the input, U33 is an N N orthogonal ma- is accomplished in an online fashion via stochastic gradient 3 ⇥ 3 descent; each time an example µ is presented, the weights trix whose columns contain output-analyzing singular vectors • What dynamicsa occur along the way? W 32 and W 21 are adjusted by a small amount in the direction u that reflect independent modes of variation in the output, 2 31 that minimizes the squared error yµ W 32W 21xµ between and S is an N3 N1 matrix whose only nonzero elements ⇥ the desired feature output, and the network’s feature output. are on the diagonal; these elements are the singular values s ,a = 1,...,N ordered so that s s s . An ex- This gradient descent procedure yields the learning rule a 1 1 2 ··· N1 ample SVD of a toy dataset is given in Fig. 2. As can be T DW 21 = lW 32 yµxµT W 32W 21xµxµT (1) seen, the SVD extracts coherently covarying items and prop- T erties from this dataset, with various modes picking out the DW 32 = l yµxµT W 32W 21xµxµT W 21, (2) underlying hierarchy present in the toy environment. for each example µ, where l is a small learning rate. We The temporal dynamics of learning A central result of imagine that training is divided into a sequence of learning this work is that we have described the full time course of epochs, and in each epoch, the above rules are followed for learning by solving the nonlinear dynamical equations (3)-(4) all P examples in random order. As long as l is sufficiently for orthogonal input representations (S11 = I), and arbitrary small so that the weights change by only a small amount per input-output correlation S31. In particular, we find a class learning epoch, we can average (1)-(2) over all P examples of exact solutions (whose derivation will be presented else- and take a continuous time limit to obtain the mean change in where) for W 21(t) and W 32(t) such that the composite map- weights per learning epoch, ping at any time t is given by

N2 d 21 32T 31 32 21 11 t W = W S W W S (3) W 32(t)W 21(t)= a(t,s ,a0 )uavaT , (7) dt  a a a=1 d 32 31 32 21 11 21T t W = S W W S W , (4) 0 dt where the function a(t,s,a ) governing the strength of each input-output mode is given by 11 T where S E[xx ] is an N1 N1 input correlation matrix, ⌘ ⇥ se2st/t 31 T a(t,s,a0)= . (8) S E[yx ] (5) e2st/t 1 + s/a ⌘ 0 Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes S S 3 one learning epoch. We note that, although the network we -

= Properties B B Properties Properties analyze is completely linear with the simple input-output map 32 21 P P y = W W x, the gradient descent learning dynamics given T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S V Input-output Output Input Singular values Decomposing the input-output correlations Our funda- Items Modes Modes correlationItems matrix singular vectors singular vectors P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, andmentalt l . goal Here is to understand the dynamics of learning in (3)- + ⇥ ⌘

1 11 31 M M Figure 2: Example singulart measures value time decomposition in units of learning for aepochs; toy as(4)t varies as a function from of the input statistics S and S . In general, 2 F F dataset. Left: The0 learning0 to 1, the environment network has is seen specifiedP examples by an corresponding to Modes

Modes the outcome of learning will reflect an interplay between the S S input-output3 correlationone learning matrix. epoch. This example We note dataset that, although has the network we 11 - perceptual correlations in the input patterns, described by S ,

= Properties B B Properties Properties four items: Canary, Salmon,analyze isOak, completely and Rose. linearThe with two the animals simple input-output map 31 32 21 and the input-output correlations described by S . To begin, P P y = W W x, the gradient descent learning dynamics given share the property that they can Move, while the two plants though, we consider the case of orthogonal input representa- T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S cannot. InV addition each item has a unique property: can Fly, tions where each item is designated by a single active input Input-output Output can SwimInput, has Bark, and has Petals, respectively. Right: The Singular values Decomposing the input-output correlationsunit,Our as funda- used by (Rumelhart & Todd, 1993) and (Rogers & correlation matrix singular vectors singular vectors 31 SVD decomposes S mentalinto input-output goal is to understandmodes that the link dynamics a set of learningMcClelland, in (3)- 2004). In this case, S11 corresponds to the iden- Figure 2: Example singular value decomposition for a toy 11 31 of coherently covarying(4) propertiesas a function (output of the singularinput statistics vectorsS inand Stity. In matrix. general, Under this scenario, the only aspect of the train- dataset. Left: The learning environment is specified by an the columns of U) tothe a set outcome of coherently of learning covarying will reflect items an (in- interplaying between examples the that drives learning is the second order input- input-output correlation matrix. This example dataset has T 11 put singular vectors inperceptual the rows correlations of V ). The in theoverall input strength patterns, describedoutput by correlationS , matrix S31. We consider its singular value four items: Canary, Salmon, Oak, and Rose.ofThe this two link animals is given byand the thesingular input-output values correlationslying along described the di- by S31. To begin, share the property that they can Move, while the two plants decomposition (SVD) agonal of S. In this toythough, example, we consider mode 1 the distinguishes case of orthogonal plants input representa- cannot. In addition each item has a unique property: can Fly, N1 from animals; modetions 2 birds where from each fish; item and is mode designated 3 flowers by a single active input 31 33 31 11T a aT can Swim, has Bark, and has Petals, respectively. Right: The unit, as used by (Rumelhart & Todd, 1993) and (Rogers &S = U S V = Â sau v , (6) 31 from trees. SVD decomposes S into input-output modes that link a set McClelland, 2004). In this case, S11 corresponds to the iden- a=1 of coherently covarying properties (output singular vectors in tity matrix. Under this scenario, the only aspectwhich of the will train- play a central role in understanding how the ex- the columns of U) to a set of coherently covaryingWe wish items to (in- train theing networkexamples to that learn drives a particular learning is input- the secondamples order input- drive learning. The SVD decomposes any rectangu- T 11 put singular vectors in the rows of V ). Theoutput overall map strength from a setoutput of P correlationtraining examples matrix S31.x Weµ,yµ consider,µ = its singularlar matrix value into the product of three matrices. Here V is of this link is given by the singular values lying along the di- µ { } µ an N N orthogonal matrix whose columns contain input- 1,...,P. The input vectordecompositionx , identifies (SVD) item µ while each y 1 ⇥ 1 agonal of ItemsS. In this toy example,Modes modeModes 1 distinguishes Items plants analyzingP singular vectors va that reflect independent modes C S O R 1 2 3 1 2 3is a setC ofS attributesO R tois an beN associated3 N1 input-output to this correlationitem. TrainingN1 matrix, and t l . Here from animals; mode 2 birds from fish; and mode 3 flowers + ⇥ 31 33 31 11T a aT ⌘ 33 1 of variation in the input, U is an N N orthogonal ma-

M 3 3 M is accomplished in ant onlinemeasures fashion timeS in= viaU units stochasticS ofV learning= gradient epochs;sau v as,t varies from(6) from trees. Â ⇥ 2

F =1 F descent; each time0 an0 exampleto 1, the networkµ is presented, has seen theP examples weightsa correspondingtrix whose tocolumns contain output-analyzing singular vectors Modes Modes a S

S which will play a central role in understanding how the ex- 32 3 21 one learning epoch. We note that, although theu networkthat reflect we independent modes of variation in the output, W and W are adjusted- by a small amount in the direction

= Properties 31 B B We Properties wish to train the network to learn a particular input- amplesanalyze drive is completely learning.µ 32 linear The21 with SVDµ 2 the decomposes simple input-outputand any rectangu-S is map an N3 N1 matrix whose only nonzero elements that minimizesµ µ the squared error32 21 y W W x between 11 ⇥ P outputP map from a set of P training examples x ,y ,µ = lary = matrixW W intox, the the gradient product descent of three learning matrices. dynamicsare Here onV the given diagonal;is these elements are the singular values the desired{ } feature output, and the network’s feature output. 1,...,P. The31 input vector xµ, identifies item µ while eachT yµ aninN Eqns.1 N (3)-(4)1 orthogonal are highly matrix nonlinear. whose columns contain input- = ⇥ sa,a = 1,...,N1 ordered so that s1 s2 sN1 . An ex- Σ U S This gradientV descentanalyzing proceduresingular yields vectorsthe learningva that rule reflect independent modes ··· is a setInput-output of attributes to beOutput associated to this item. TrainingInput ample SVD of a toy dataset is given in Fig. 2. As can be Singular values Decomposing the input-output33 correlations Our funda- is accomplishedcorrelation matrix in an onlinesingular vectors fashion via stochasticsingular gradient vectors of variationT in the input, U is an N3 N3 orthogonal ma- DW 21 = lWmental32 goalyµxµT is toW understand32W 21xµx theµT dynamics(1)⇥ of learningseen, the in (3)- SVD extracts coherently covarying items and prop- descent;Figure each 2: Example time an singular example valueµ is decomposition presented, the for weights a toy trix whose columns contain output-analyzing11 singular31 vectors (4)a as a function of the input statistics S and Serties. In general, from this dataset, with various modes picking out the W 32dataset.and W 21 Left:are The adjusted learning by a environment small amount is in specified the direction32 by an u thatµ µT reflect32 independent21 µ µT modes21T of variation in the output, DW = lthey outcomex31 W ofW learningx x willW reflect, an(2) interplayunderlying between the hierarchy present in the toy environment. µ 32 21 µ 2 and S is an N N matrix whose only nonzero elements thatinput-output minimizes the correlation squared matrix. error y ThisW exampleW x datasetbetween has perceptual correlations3 ⇥ 1 in the input patterns, described by S11, are on the diagonal; these elements are the singular values thefour desired items: featureCanary, output, Salmon, and Oak, the network’sand Rose.forThe feature each two example output. animals µ, andwhere thel input-outputis a small correlations learning rate. described We by SThe31. To temporal begin, dynamics of learning A central result of s ,a = 1,...,N ordered so that s s s . An ex- Thisshare gradient the property descent that procedure they can yields Move the, while learningimagine the tworule that plants trainingthough,a is divided we consider into1 a sequence the case of of orthogonal learning1 2 ··· inputthis representa-N1 work is that we have described the full time course of cannot. In addition each item has a uniqueepochs, property: and can inFly each, ample epoch,tions where SVD the above of each a toy item rules dataset is are designated followed is given by in for a Fig. singlelearning 2. active As can input by be solving the nonlinear dynamical equations (3)-(4) T µ µT µ µT can SwimDW 21, has=Barkl,W and32 hasy Petalsx ,W respectively.32W 21x x Right:(1) The seen,unit, the as used SVD by extracts (Rumelhart coherently & Todd, covarying 1993) items and (Rogers and prop- & 11

, all P examples in random order. As long as l is sufficiently 31 for orthogonal input representations (S = I), and arbitrary SVD decomposesis S into input-output modes thatT link a set erties from this dataset, with various11 modes picking out the 31 11 McClelland, 2004). In this case, S corresponds to the iden- (6) DW 32 = l yµxµT W 32W 21xµxµTsmallW 21 so, that the(2) weights change by only a small amount per(7) input-output correlation(8) S . In particular, we find a class S of coherently covarying properties (output singular vectors in underlying hierarchy present in the toy environment. 11 learning epoch, we cantity average matrix. Under (1)-(2) this over scenario, all P examples the only aspectof of exact the train- solutions (whose derivation will be presented else- input- . Here the columns of U)V to a set of coherently covarying items (in- ing examples thatAnalytic learning trajectory drives learning is the second order input- 21 32 l P for each example µ, where l is a smallT learningand take rate. a continuous We The time temporal limit to obtain dynamics the mean of learning change inA centralwhere) result for W of (t) and W (t) such that the composite map- put singular vectors in the rows of V ). The overall. An ex- strength 31 imagine that training is divided into a sequence of1 learning output correlation matrix S . We consider its singular value ⌘ weights per learningthis epoch, work is that we have described the full time course of N . To begin, of this link is given by the singular values lying along the di- ping at any time t is given by Our funda- varies from SVD of input-output correlaons: s , t . In general, epochs, and in each epoch, the above rules are followed for decomposition (SVD)

31 learning by solving the nonlinear dynamical equations (3)-(4) t agonal of S. In this toy example, mode 1 distinguishes plants T N ,

31 2 S 11 d T a all P examples in random order. As long as l is sufficiently21 for orthogonal32 31 input32 representations21 11 N1 (S = I), and arbitrary 1/Learning rate T 32 21 0 T

), and arbitrary a a

S τ T v orthogonal ma- from animals; mode 2 birds fromsingular vectors fish; and mode 3t flowersW = W S 31 W 33W 31S 11 (3)a aT ( ) ( )= ( , , ) , I W . t W t a t s a u v (7) a 31 a a small so that the weights change by only a small amountdt per S= U S V = saa u v , (6)  3 v ··· input-output correlation S . In particular, we find a class 0 s Singular value from trees. a=1 u a=1 a = a N ) A central result of and learning epoch, we can average (1)-(2) over all P examplesd of exact solutions (whose derivationT will be presented else- a Inial mode strength u 32 31 32 21 11 21 / 0 =which will play a central role in, understanding0 a how the ex- 0 t W S W W S W (4) s

a 21 32 ⇥ 11 2 and take a continuous time limit to obtain the mean change in a where the function a(t,s,a ) governing the strength of each 11 where) for W (t) and W (t) such that the composite map- s dt t s , 3 amples drive learning.S The SVD decomposes any rectangu- + / 1 S We wish to train the network to learn a particular input- Network input-output map: a corresponds to the iden- weights1 per learning epoch, input-output mode is given by N ping at any time t is given by 11 st = 1

s

µ µ N 11 T lar matrix into the product of three matrices. Here V is  2

P x ,y ,µ = , outputa map from a set of training exampleswhere S E[xx ] is an N1 N1 input correlation matrix,

11 1 N t 2st/t µ { } µ 2 examples corresponding to ⌘ ⇥ ( se d T s an N1 N1 orthogonal matrix whose columns contain input- se S 1,...,P. The21 input vector32x , identifies31 32 item21 µ11while each y t = 32 21 0 a aT t W = W S W W S (3) ⇥ a a 31 W T(t) W (t)= a(t,s a,a )u v where , (7) / a(t,s,a0)= . (8) such that the composite map- a

P analyzing singular vectors v that reflect independent modes 1 is an [ ] Â 2st/t isT a set ofdt attributes to be associated to this item. Training S E yx (5) st that reflect independent modes 2 + / =1 governing the strength of each e 1 s a0 33a ) = 2 . We consider its singular value ⌘ N Â . In particular, we find a class of variation in the input, U t is an N N orthogonal ma- 11 d T 3 3 a ) e is accomplished in an online fashion via stochastic gradient a 32 31 33 32 21 11 21 output-analyzing (

31 ⇥ 0 t W = S v W W S W , (4) 0 200 V wheretrix whose the• function columnsStarng from decoupled inial a contain31 (t,s,a output-analyzing) governing the strengthsingular vectors of each U a S descent;dt each time an example µ is presented, the weights 32 Simulation S , 31 32 21 a Theory u that reflect independent modes of variation)= in the output, s input-output modecondions. is given by )=

W S and W are adjusted by a small amount in the direction W matrix whose only nonzero elements t

11 T , 150 31 0 ( where S E[xx ] is an N1 N1 inputµ correlation32 21 µ matrix,2 and S is an N N matrix whose only nonzerot elements 33 3 1 a 1 / that minimizes⌘ the squared⇥ error y W W x between 2st t ( ,

⇥ se 21 a N s U are on the diagonal; these elements are the singular values 31 T • Each ‘connecvity mode’ evolves a(t,s,a0)= and . (8) 100 the desired feature output, and the network’s feature output. 2st/t , E[yx ] (5) is given by

S W e 1 + s/a t ordered so that ⇥ ) 0 ) = sa,a = 1,...,N1 ordered so that s1 s2 sN . An ex- (

⌘ t 1 independently t This gradient descent procedure yields the learning rule t

3 ··· 1 ( a ample SVD of a toy dataset is given in Fig.( 2. As can be 50 31 N N , the gradient descent learning dynamics given 21

T 32 input-output correlation matrix, and Input − output mode strength S 21 32 µ µT 32 21 µ µT x DW = lW y x W W x x (1) seen, the SVD extracts coherently covarying items and prop- orthogonal matrix whose columns contain 1 • Singular value s learned at me O(1/s) W

W 0 21 singular vectors erties from this dataset, with various modes picking out the 1 N 32 µ µT 32 21 µ µT 21T 0 200 400 600 ,..., is an N W DW = l y x W W x x W , (2) t (Epochs) 1 ⇥ underlyingSaxe, hierarchyMcCelland present, Ganguli in, ICLR, 2014 the toy environment. Epochs 3 32 ⇥ 31 =

N , 1 for each example µ where l is a smallS learning rate. We A central result of

W The temporal dynamics of learning that reflect independent modes of variation in the output, a N imagine that training is divided into a sequence, of learning this work is that we have described the full time course of = measures time in units of learning epochs; as a a is an 0 to 1, theone learning network epoch. has seen Weanalyze note is completely that, linear with although the the simpley input-output network map we in Eqns. (3)-(4) are highly nonlinear. Decomposing the input-outputmental correlations goal is to understand(4) the as dynamics a function of of learning the inthe input statistics (3)- outcome of learning willperceptual reflect correlations in an the interplay input patterns, between describedand the by the input-output correlations describedthough, by we consider the casetions of where orthogonal each input item representa- unit, is as designated by used a byMcClelland, single (Rumelhart 2004). active In & input this Todd, case, tity 1993) matrix. and Under (Rogers this & ing scenario, the examples only that aspect drives ofoutput learning the correlation is train- matrix the seconddecomposition order (SVD) input- which will play aamples central drive role learning. in The understandinglar SVD how matrix decomposes the into any ex- rectangu- thean product of threeanalyzing matrices. Here of variation in thetrix input, whose columns contain u and are on the diagonal;s these elements areample the SVD singular of values aseen, toy the SVD dataset extracts is coherentlyerties given covarying items from in and this Fig. prop- dataset,underlying with 2. hierarchy various present in modes As the picking can toy out environment. be the The temporal dynamicsthis of work learning is thatlearning by we solving have the nonlinear described dynamicalfor equations the (3)-(4) orthogonal full input time representationsinput-output course ( of correlation of exact solutions (whosewhere) derivation for will be presentedping at else- any time where the function input-output mode is given by t epochs, and in each epoch, the above rules are followed for learning by solving the nonlinear dynamical equations (3)-(4) all P examples in random order. As long as l is sufficiently for orthogonal input representations (S11 = I), and arbitrary - + 0 small so that the weights change by only a small amount per input-output correlation S31. In particular, we find a class , µ = in learning epoch, we cany average (1)-(2) over all P examples

R of exact solutions (whose derivation will be presented else- in- (1) (2) (3) (4) (5) µ Fly and take a continuous time limit to obtain the mean change in 21 32 , where) for W (t) and W (t) such that the composite map- T O weights per learning} epoch, ping at any time t is given by µ between , y V Items Items S examples Input , T N 2 2 ,

µ T d 21 32 31 32 21 11 32 21 0 T

a a T x 21 t W = W S W W S (3) P

µT W (t)W (t)= a(t,s ,a )u v , (7) C

µ a a singular vectors vectors singular

dt { Â x 11 1 1 2 3 3 x 21 that link a set is sufficiently while each a=1 W µ

S Modes Modes d T x

21

32 µ 31 32 21 11 21 l W

The two animals t W = S W W S W , (4) 0 21

where the function a(t,s,a ) governing the strength of each 21 µT dt W lying along the di- x 3 W input-output mode is given by 11 32 W 11 T µ modes S 32 where S E[xx ] is an N1 N1 input correlation matrix, x 32

W 2st/t , while the two plants 2 ). The overall strength S

⌘ ⇥ 21

21 se W Modes T W ( , , )= . 31 T a t s a0 (8) 2st/ ] S E[yx ] (5) t W W output singular vectors 1 , respectively. Right: The

µ + / is presented, the weights e 1 s a V 0 T

⌘ input correlation matrix, y Singular values Singular 32 Modes Modes 32 Move 31 µ yx 1 µT [ W S W is a small learning rate. We N x E identifies item µ 3 Petals training examples ⇥ l y , T ⌘ µ 1 31 32 P x singular values 2 µT T N S U x 31 Modes W µ Output 32 S

1 y

into input-output

M M F S B B

P P where

W singular vectors vectors singular Properties Properties in the rows of is an l l = = , 31 , and has ] µ S T ) to a set of coherently covarying items ( = = = = 21 32 xx U [ Bark R are adjusted by a small amount in the direction W W E Canary, Salmon, Oak, and Rose. 21 32 21 d d . In this toy example, mode 1 distinguishes plants dt dt O ⌘ S 1 , has t t W W W 3 11 S D D Items Items . The input vector S Σ P examples in random order. As long as and Input-output

C

Swim

M M F S B B P P P correlation matrix matrix correlation

We wish to train the network to learn a particular input- 32 Properties Properties ,..., Figure 2: Exampledataset. singular value Left: decomposition Theinput-output for learning correlation a environment matrix. toy isfour specified items: This by example an share dataset the has property thatcannot. they In can addition each itemcan has a unique property: can SVD decomposes of coherently covarying properties ( the columns of put singular vectors of this link is given byagonal the of from animals; mode 2from birds trees. from fish; and mode 3 flowers output map from a1 set of is a set ofis attributes accomplished to in be andescent; associated online to fashion each this via time stochastic item. an gradient example Training that minimizes the squared error the desired feature output,This and gradient the descent network’s procedure feature yields output. the learning rule for each example imagine that training isepochs, divided and into in a each sequenceall epoch, of the learning above rulessmall are so followed that for the weightslearning change epoch, by we only a canand small average take amount a (1)-(2) per continuous over time all limitweights to per obtain learning the epoch, mean change in where W Deeper networks • Can generalize to arbitrary depth network

• Each effective singular value a evolves independently

d τ 1/Learning 2−2 (Nl −1) rate τ a = (Nl −1)a (s − a) dt s Singular value Nl # layers

• In deep networks, combined gradient is O(Nl τ )

Nl 1 wNl-1 w2 w1 a = Wi i=1 Y Learning as a singular mode detection wave

s>t/ˆ ⌧ are learned. At time t, data singular modes with: s

Teacher Student

N3 23 W N N 2  2 12 W

N1

23 12 W W W ⌘ µ µ µ P yˆ = W xˆ + z 11 µ µT ⌃ = xˆ xˆ = IN N 1⇥ 1 µ=1 X P ⌃31 = yˆµxˆµT = W + Z˜ µ=1 X How the teacher is buried in the training data

N2 W = s↵ u↵ v↵ T ↵=1 X N3 ⌃31 = sˆ↵ uˆ↵ vˆ↵ T ↵=1 X

s : teacher singular value sˆ : training data singular value Match between theory and numerics for training and testing error

Rank N student, Rank 1 Teacher, both have one hidden layer.

Test error at the optimal early stopping time is independent of number of student hidden units!

It only depends on the structure of the data, not on the student architecture. Match between theory and numerics for training and testing error

Rank N student, Rank 1 Teacher, student has 5 layers (3 hidden layers)

Test error at the optimal early stopping time is independent of number of student hidden units!

It only depends on the structure of the data, not on the student architecture. At a coarse grained level: 3 puzzles of deep learning

Generalization: How can neural networks predict the response to new examples?

A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014.

A. Lampinen, J. McCelland, S. Ganguli, An analytic theory of generalization dynamics and transfer learning in deep linear networks, work in progress.

Expressivity: Why deep? What can a deep neural network “say” that a shallow network cannot?

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016.

Trainability: How can we optimize non-convex loss functions to achieve small training error?

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017.

The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.

Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS 2014. A theory of deep neural expressivity through transient input-output chaos

Stanford Google

Subhaneil Maithra Jascha Ben Poole Lahiri Raghu Sohl-Dickstein

Expressivity: what kinds of functions can a deep network express that shallow networks cannot?

Exponential expressivity in deep neural networks through transient chaos, B. Poole, S. Lahiri,M. Raghu, J. Sohl-Dickstein, S. Ganguli, NIPS 2016.

On the expressive power of deep neural networks, M.Raghu, B. Poole,J. Kleinberg, J. Sohl-Dickstein, S. Ganguli, under review, ICML 2017.

The problem of expressivity

Networks with one hidden layer are universal function approximators.

So why do we need depth?

Overall idea: there exist certain (special?) functions that can be computed:

a) efficiently using a deep network (poly # of neurons in input dimension)

b) but not by a shallow network (requires exponential # of neurons)

Intellectual traditions in boolean circuit theory: parity function is such a function for boolean circuits. Seminal works on the expressive power of depth

Nonlinearity Measure of Functional Complexity

Rectified Linear Unit (ReLu) Number of linear regions

There exists a “saw-tooth” function computable by a deep network where the number of linear regions is exponential in the depth.

To approximate this function with a shallow network, one would require exponentially many more neurons.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks, NIPS 2014 Seminal works on the expressive power of depth

Nonlinearity Measure of Functional Complexity

Sum-product network Number of monomials

There exists a function computable by a deep network where the number of unique monomials is exponential in the depth.

To approximate this function with a shallow network, one would require exponentially many more neurons.

Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks, NIPS 2011. Questions

The particular functions exhibited by prior work do not seem natural?

Are such functions rare curiosities?

Or is this phenomenon much more generic than these specific examples?

In some sense, is any function computed by a generic deep network not efficiently computable by a shallow network?

If so we would like a theory of deep neural expressivity that demonstrates this for 1) Arbitrary nonlinearities 2) A natural, general measure of functional complexity.

We will combine Riemannian geometry + dynamic mean field theory to show that even in generic, random deep neural networks, measures of functional curvature grow exponentially with depth but not width!

More over the origins of this exponential growth can be traced to chaos theory.

A maximum entropy ensemble of deep random networks

Nl = number of neurons in layer l D =depth(l =1,...,D) xl = (hl) l l l 1 l h = W x + b

Structure: i.i.d. random Gaussian weights and biases:

2 Wl 0, w ij N N l 1 ✓ ◆ bl (0, 2) i N b Emergent, deterministic signal propagation in random neural networks

Nl = number of neurons in layer l D =depth(l =1,...,D) xl = (hl) l l l 1 l h = W x + b

Question: how do simple input manifolds propagate through the layers?

A pair of points: Do they become more similar or more different, and how fast?

A smooth manifold: How does its curvature and volume change? Propagation of two points through a deep network

x0,1 x0,2

Do nearby points come closer together or separate?

χ is the mean squared singular value of the Jacobian across 1 layer χ < 1 : nearby points come closer together; gradients exponentially vanish χ > 1 : nearby points are driven apart; gradients exponentially explode 1 Tr JT J = L N Propagation of a manifold through a deep network

x0(✓)

The geometry of the manifold is captured by the similarity matrix - How similar two points are in internal representation space):

N 1 l ql(✓ , ✓ )= hl[x0(✓ )] hl[x0(✓ )] 1 2 N i 1 i 2 l i=1 X l l Or autocorrelation function: q (✓)= d✓ q (✓, ✓ + ✓) Z Propagation of a manifold through a deep network h1(✓)= N q u0 cos(✓)+u1 sin(✓) A great circle 1 ⇤ input manifold p ⇥ ⇤ Riemannian geometry I: Euclidean length

h(✓)

✓ @h(✓) @✓

@h(✓) @h(✓) Metric on manifold coordinate θ gE(✓)= induced by Euclidean metric in @✓ · @✓ internal representation space h.

Length element: if one moves from d E = gE(✓)d✓ Θ to Θ+ dΘ along the manifold, L then one moves a distance dLE q in internal representation space Riemannian geometry II: Extrinsic Gaussian Curvature

h(✓) Point on the curve @h(✓) v(✓)= Tangent or velocity @✓ vector @v(✓) a(✓)= Acceleration vector @✓

The velocity and acceleration vector span a 2 dimensional plane in N dim space.

Within this plane, there is a unique circle that touches the curve at h(θ), with the same velocity and acceleration.

The Gaussian curvature κ(θ) is the inverse of the radius of this circle. (v v)(a a) (v a)2 (✓)= · · · (v v)3 s · Riemannian geometry III: The Gauss map and Grassmannian length

The unit A point on tangent vector the curve at that point ✓ N 1 ˆv (✓) S 2 @ˆv (✓) @ˆv (✓) Metric on manifold coordinate θ gG(✓)= induced by metric on the Grassmannian: @✓ · @✓ how quickly unit tangent vector changes

Length element: if one moves from d G = gG(✓)d✓ Θ to Θ+ dΘ along the manifold, L then one moves a distance dLG q Along the Grassmanian gG(✓)=(✓)2gE(✓) Grassmannian length, Gaussian curvature and Euclidean length An example: the great circle

h1(✓)= Nq u0 cos(✓)+u1 sin(✓) A great circle input manifold p ⇥ ⇤ Euclidean Gaussian Grassmannian length Curvature Length

gE(✓)=Nq (✓)=1/ Nq gG(✓)=1 p G E =2⇡ =2⇡ Nq L L

Behavior underp isotropic linear expansion via multiplicative stretch χ1: 1 E p E   G G L ! 1 L ! p1 L ! L χ1 < 1 Contraction Increase Constant

χ1 > 1 Expansion Decrease Constant

Increase Decrease Remain length Curvature Invariant An example: the great circle

Increase Decrease Remain length Curvature Invariant Theory of curvature propagation in deep networks

E,l E,l 1 E,1 g¯ = g¯ g¯ = q⇤ 1 2 2 = z 0 pq z 1 w D ⇤ Z ⇥ ⇤ 2 2 l 2 2 1 l 1 2 1 2 1 2 = z 00 pq⇤z (¯ ) =3 + (¯ ) (¯ ) = w D 2 Z 1 1 q⇤ ⇥ ⇤

Modification of existing curvature due to stretch

Addition of new curvature due to nonlinearity

Local Extrinsic Grassmannian Stretch Curvature Length

Ordered: χ1 < 1 Contraction Explosion Constant

Chaotic: χ1 > 1 Expansion Attenuation + Exponential Addition Growth Curvature propagation: theory and experiment

Unlike linear expansion, deep neural signal propagation can:

1) exponentially expand length, 2) without diluting Gaussian curvature, 3) thereby yielding exponential growth of Grassmannian length.

As a result, the circle will become fill space as it winds around at a constant rate of curvature to explore many dimensions! Exponential expressivity is not achievable by shallow nets

x0(✓)

N1 Summary

We have combined Riemannian geometry with dynamical mean field theory to study the emergent deterministic properties of signal propagation in deep nonlinear nets.

We derived analytic recursion relations for Euclidean length, correlations, curvature, and Grassmannian length as simple input manifolds propagate forward through the network.

We obtain an excellent quantitative match between theory and simulations.

Our results reveal the existence of a transient chaotic phase in which the network expands input manifolds without straightening them out, leading to “space filling” curves that explore many dimensions while turning at a constant rate. The number of turns grows exponentially with depth.

Such exponential growth does not happen with width in a shallow net.

Chaotic deep random networks can also take exponentially curved N-1 Dimensional decision boundaries in the input and flatten them into Hyperplane decision boundaries in the final layer: exponential disentangling!

(see Poggio’s talk later today!)

Are such functions rare curiosities?

Or is in some sense any function computed by a generic deep network not efficiently computable by a shallow network?

If so we would like a theory of deep neural expressivity that demonstrates this for 1) Arbitrary nonlinearities

2) A natural, general measure of functional complexity.

At a coarse grained level: 3 puzzles of deep learning

Generalization: How can neural networks predict the response to new examples?

A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014.

A. Lampinen, J. McCelland, S. Ganguli, An analytic theory of generalization dynamics and transfer learning in deep linear networks, work in progress.

Expressivity: Why deep? What can a deep neural network “say” that a shallow network cannot?

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016.

Trainability: How can we optimize non-convex loss functions to achieve small training error?

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017.

The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.

Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS 2014. Beyond manifold geometry to entire Jacobian singular value distributions

Andrew Saxe Sam Schoenholz Jeff Pennington Harvard Google Brain

Question: how do random initializations and nonlinearities impact learning dynamics?

Exact solutions to the nonlinear dynamics of learning in deep linear networks, A. Saxe, J. McClelland, S. Ganguli, ICLR 2014.

Investigating the learning dynamics of deep neural networks using random matrix theory, J. Pennington, S. Schoenholz, S. Ganguli, ICML 2017.

The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.

A Deep network

D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)

2 1 W D W W . . .

f (x)

N ND+1 3 N1 y ∈ R h2 ∈ R x ∈ R x

End to end Jacobian: D D 3 3 2 2 2 1 J = F W ...F W F W F W Prediction from an analytic theory of nonlinear learning dynamics in deep linear nets:

If you initialize with random orthogonal weights (or rectangular matrices with random singular vectors but all singular values = 1) then:

Learning time, in number of epochs, will be independent of depth even as the depth goes to infinity.

If you initialize with random Gaussian weights then:

Learning time will grow with depth, and cannot remain constant.

Exact solutions to the nonlinear dynamics of learning in deep linear networks, A. Saxe, J. McClelland, S. Ganguli, ICLR 2014.

Theoretical prediction verified: Depth independent training times

• Deep linear networks on MNIST • Scaled random Gaussian inializaon (Glorot & Bengio, 2010) Time to criterion Opmal learning rate

• Pretrained and orthogonal have fast depth-independent training mes! Random vs orthogonal

• Gaussian preserves norm of random vector on average

1 layer net 5 layer net 100 layer net Frequency

Nl −1 Singular values of W tot = ∏W i i=1 • Attenuates on subspace of high dimension • Ampliies on subspace of low dimension Random vs orthogonal

• Glorot preserves norm of random vector on average

1 layer net 5 layer net 100 layer net Frequency

Nl −1 Singular values of W tot = ∏W i i=1 • Orthogonal preserves norm of all vectors exactly

All singular values of W tot =1 Analysis of gradient low in nonlinear deep nets through free probability

D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)

2 1 W D W W . . . f (x)

N ND+1 3 N1 y ∈ R h2 ∈ R x ∈ R x

D D 3 3 2 2 2 1 End to end Jacobian: J F W ...F W F W F W =

Free probability theory for σ(A) σ(B) σ(AB) random matrix products:

[ S (z) S (z) ]D F W SA(z) SB(z) SA(z) SB(z)

Beyond mean squared singular value: free probability analysis of all the Jacobian singular values

D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)

2 1 W D W W . . . f (x)

N ND+1 3 N1 y ∈ R h2 ∈ R x ∈ R x The end to end Jacobian is a product of random matrices. 1 Tr JT J = L N If A and B are freely independent, then the spectrum of the product AB can be computed using the S-transform: Free probability analysis of Jacobian singular values

Scaling properties as the depth D goes to infinity:

Fraction of singular values Maximum singular value Within 1-ε to 1+ε

Gaussian Orthogonal

Linear: 1/D D 1 1

ReLu: 1/D D 1/D D

Tanh 1/D D O(1) O(1) Free probability analysis of Jacobian singular values

Example: linear network Gaussian weights at critical gain = 1

D = 2 D = 10

As depth D increases, the tail spreads out over an extent O(D) and the “middle” around 1 falls off as O(1/D) Free probability analysis of Jacobian singular values

Example: ReLU network with orthogonal weights at critical gain = 21/2

D = 3 D = 10

As depth D increases, the tail spreads out over an extent O(D) and the “middle” around 1 falls off as O(1/D) q = 0.2 q = 1 q = 4 100 40 40 30 30 50 20 20

g = 0.9 10 10 0 0 0 0 1 2 3 0 1 2 3 0 1 2 3 x 10−5 x 10−5 x 10−5 Free probability analysis of Jacobian singularq = 0.2 valuesq = 1 q = 4 100 40 60 40 40 30 30 30 30 40 20 50 20 20 20

g = 0.9 20 10 10 g = 0.95 10 10 0 0 0 0 0 0 0 1 2 3 0 1 0 2 2 3 4 0 6 1 0 21 2 3 3 4 0 1 2 3 4 −5 −5 −5 q = 0.2 q = 1 x 10 q = 4 x 10 x 10−3 x 10 x 10−3 x 10−3 100 40 40 60 40 40 30 40 40 30 30 30 30 30 30 40 20 50 20 20 Example: Tanh network with orthogonal weights at critical gain20 20 = 1 20 20 g = 0.9 10 20 10 g = 1 10 g = 0.95 10 10 10 10 0 0 0 0 1 2 3 0 10 2 3 0 1 0 2 0 3 0 0 0 0 2 4 6 0 1 0 2 0.13 0.24 0.3 0 0.4 1 0 2 0.13 0.24 0.3 0.4 0 0.1 0.2 0.3 0.4 x 10−5 x 10−5 x 10−5 q = 0.2 q = 1 q = 4 x 10−3 x 10−3 x 10−3 100 40 60 40 40 30 40 40 100 40 100 150 30 30 30 D40 = 100 30 20 30 30 100 50 20 20 20 20 20 50 20 50 g = 0.9 20 10 g = 0.95 10 10 10 g = 1 50 10 10 g = 1.05 10 0 0 0 0 0 0 0 1 2 3 0 1 0 2 2 3 4 0 6 1 0 21 0 2 3 3 4 0 1 0 2 3 0 4 0 0 0 −5 −5 0 −5 0.1 0.2 0.3 0.4 0 0.1 00.2 0.50.3 10.4 1.5 0 2 0.1 00.2 0.30.5 0.4 1 1.5 0 0.5 1 1.5 x 10 x 10 x 10−3 x 10 x 10−3 x 10−3 q = 0.2 q = 1 q = 4 100 40 60 40 40 40 30 40 40 100 100 400 150 400 600 30 30 30 30 30 30 40 20 300 300 Singular values 100 400 20 20 20 20 50 20 20 50 50 20 g = 1 10 200 200 g = 0.95 g = 0.9 10

10 10 10 g = 1.1 50 10 10 g = 1.05 200 of J 100 100 0 0 0 0 0 0 0 0 0 0 2 4 6 0 1 0 2 0.13 0.24 0.3 0 0.4 1 0 2 0.13 0 0.24 0.3 0.4 0 0.1 0 0.2 0.3 0.4 0 Frequency 0 1 2 3 0 1 2 3 0 1 2 3 0 0 0 −3 −3 0 −3 0.5 1 1.5 2 0 0.5 0 1 2 1.5 4 0 6 0.5 0 11 21.5 3 4 0 1 2 3 x 10−5 x 10−5 x 10 x 10−5 x 10 x 10 40 40 100 40 100 150 600 3e-5 40 0 6e-5 30 0 0.4 4000 2 400 0 600 6 30 30 30 30 300 100 300 40 20 400 20 20 20 50 20 50 g = 1 200 200 50 20 10 g = 1.05 g = 0.95 10 10 10 g = 1.1 200 10 100 100 Gain 0 0 0 0g=0.9 0 g=0.95 0 0 g=1 0 g=1.05 0 g=1.1 0 0.1 0.2 0.3 0.4 0 0.1 00.2 0.50.3 10.4 1.5 0 2 0.1 00.2 0.30.50 0.4 1 1.5 0 0.50 1 1.5 0 0 2 4 6 0 1 2 3 4 0 1 2 3 4 0 2 4 6 0 1 2 3 4 0 1 2 3 x 10−3 x 10−3 x 10−3 100 100 150 40 40 40 400 400 600 300 300 30 30 30 100 400 50 50 20 20 20 200 200

g = 1 50 g = 1.05 g = 1.1 200 10 10 10 100 100 0 0 0 0 0 0 0 0 0 As depth D increases,0 0.1 0.2 the0.3 0.4 entire0 0.10 0.20.5 0.3singular1 0.41.5 20 0.10 0.2 0.5 0.30value0.41 2 1.5 4 0distribution6 0.5 0 11 21.5 3 4 0stabilizes1 2 3 100 100 400 150 400 600 300 300 100 400 50 50 200 200

to a well defined limit distribution!!g = 1.1 50 200 g = 1.05 100 100

0 0 0 0 0 0 0 0.5 1 1.5 2 0 0.50 21 1.54 60 0.50 1 1 2 1.53 4 0 1 2 3

400 400 600 300 300 There is no extending tail that grows400 with depth! 200 200

g = 1.1 200 100 100 0 0 0 0 2 4 6 0 1 2 3 4 0 1 2 3 The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018. Free probability analysis of Jacobian singular values

Gaussian W, any f Orthogonal W, ReLU

Orthogonal W, Orthogonal W, tanh, σw ≫ 1 tanh, σw ~ 1+1/L

Theorem: For Gaussian weights no nonlinearity can achieve dynamical isometry Theorem: For ReLU, no random weights can achieve dynamical isometry Free probability analysis of Jacobian singular values

Scaling properties as the depth D goes to infinity:

Fraction of singular values Maximum singular value Within 1-ε to 1+e

Gaussian Orthogonal

Linear: 1/L L 1 1

ReLu: 1/L L 1/L L

Tanh 1/L L O(1) O(1)

Theoretical prediction: Sigmoidal networks with orthogonal weights can learn faster than ReLU networks with orthogonal weights Learning speed: with orthogonal weights, Sigmoidal can outperform ReLu

Test accuracy on CIFAR-10

Tanh Orthogonal, Tanh Orthogonal, Relu Orthogonal,

Tanh Gaussian, Tanh Gaussian, Relu Gaussian, Training time is sublinear in depth

A new scaling relation governing learning time as a function of depth

Optimal learning rate ~ 1/depth training time ~ sqrt(depth) Summary An order to chaos phase transition governs the dynamics of random deep networks, often used for initialization.

Not all networks at the edge of chaos - with neither vanishing nor exploding gradients are created equal.

The entire Jacobian singular value distribution, and not just its second moment impacts learning speed.

We used introduced free probability theory to deep learning to compute this entire distribution.

We found tanh networks with orthogonal weights have well conditioned Jacobians, but ReLU networks with orthogonal weights, or any network with Gaussian weights does not.

Correspondingly, we found that with orthogonal weights, tanh networks learn Faster than ReLU networks.

Controlling the entire singular value distribution at initialization may be an important architectural design principle in deep learning. References

• M. Advani and S. Ganguli, An equivalence between high dimensional Bayes optimal inference and M-estimation, NIPS 2016. • M. Advani and S. Ganguli, Statistical mechanics of optimal convex inference in high dimensions, Physical Review X, 6, 031034, 2016. • A. Saxe, J. McClelland, S. Ganguli, Learning hierarchical category structure in deep neural networks, Proc. of the 35th Cognitive Science Society, pp. 1271-1276, 2013. • A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014. • Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high- dimensional non-convex optimization, NIPS 2014. • B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016. • S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, Deep information propagation, https://arxiv.org/abs/1611.01232, under review at ICLR 2017. • S. Lahiri, J. Sohl-Dickstein and S. Ganguli, A universal tradeoff between energy speed and accuracy in physical communication, arxiv 1603.07758 • A memory frontier for complex synapses, S. Lahiri and S. Ganguli, NIPS 2013. • Continual learning through synaptic intelligence, F. Zenke, B. Poole, S. Ganguli, ICML 2017. • Modelling arbitrary probability distributions using non-equilibrium thermodynamics, J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, S. Ganguli, ICML 2015. • Deep Knowledge Tracing, C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. Guibas, J. Sohl-Dickstein, NIPS 2015. • Deep learning models of the retinal response to natural scenes, L. McIntosh, N. Maheswaranathan, S. Ganguli, S. Baccus, NIPS 2016. • Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017. • Variational walkback: learning a transition operator as a recurrent stochastic neural net, A. Goyal, N.R. Ke, S. Ganguli, Y. Bengio, NIPS 2017. • The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.

http://ganguli-gang.stanford.edu Twitter: @SuryaGanguli