Theories of deep learning: generalization, expressivity, and training
Surya Ganguli
Dept. of Applied Physics, Neurobiology, and Electrical Engineering
Stanford University ! Funding: Bio-X Neuroventures! NIH! Burroughs Wellcome! Office of Naval Research! Genentech Foundation! Simons Foundation! James S. McDonnell Foundation! Sloan Foundation! McKnight Foundation! Swartz Foundation! National Science Foundation! Stanford Terman Award! http://ganguli-gang.stanford.edu Twitter: @SuryaGanguli
An interesting artificial neural circuit for image classification
Alex Krizhevsky Ilya Sutskever Geoffrey E. Hinton NIPS 2012 References: http://ganguli-gang.stanford.edu • M. Advani and S. Ganguli, An equivalence between high dimensional Bayes optimal inference and M-estimation, NIPS 2016. • M. Advani and S. Ganguli, Statistical mechanics of optimal convex inference in high dimensions, Physical Review X, 6, 031034, 2016. • A. Saxe, J. McClelland, S. Ganguli, Learning hierarchical category structure in deep neural networks, Proc. of the 35th Cognitive Science Society, pp. 1271-1276, 2013. • A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014. • Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high- dimensional non-convex optimization, NIPS 2014. • B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016. • S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, Deep information propagation, ICLR 2017. • S. Lahiri, J. Sohl-Dickstein and S. Ganguli, A universal tradeoff between energy speed and accuracy in physical communication, arxiv 1603.07758 • A memory frontier for complex synapses, S. Lahiri and S. Ganguli, NIPS 2013. • Continual learning through synaptic intelligence, F. Zenke, B. Poole, S. Ganguli, ICML 2017. • Modelling arbitrary probability distributions using non-equilibrium thermodynamics, J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, S. Ganguli, ICML 2015. • Deep Knowledge Tracing, C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. Guibas, J. Sohl-Dickstein, NIPS 2015. • Deep learning models of the retinal response to natural scenes, L. McIntosh, N. Maheswaranathan, S. Ganguli, S. Baccus, NIPS 2016. • Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017. • Variational walkback: learning a transition operator as a recurrent stochastic neural net, A. Goyal, N.R. Ke, S. Ganguli, Y. Bengio, NIPS 2017. • The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.
Tools: Non-equilibrium statistical mechanics Riemannian geometry Dynamical mean field theory Random matrix theory Statistical mechanics of random landscapes Free probability theory At a coarse grained level: 3 puzzles of deep learning
Generalization: How can neural networks predict the response to new examples?
A. Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep neural networks, ICLR 2014.
A. Lampinen, J. McCelland, S. Ganguli, An analytic theory of generalization dynamics and transfer learning in deep linear networks, work in progress.
Expressivity: Why deep? What can a deep neural network “say” that a shallow network cannot?
B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, Exponential expressivity in deep neural networks through transient chaos, NIPS 2016.
Trainability: How can we optimize non-convex loss functions to achieve small training error?
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice, J. Pennington, S. Schloenholz, and S. Ganguli, NIPS 2017.
The emergence of spectral universality in deep networks, J. Pennington, S. Schloenholz, and S. Ganguli, AISTATS 2018.
Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS 2014. Learning dynamics of training and testing error
Andrew Saxe Andrew Lampinen Jay McClelland Harvard Stanford Stanford Learning dynamics of training and testing error
The dynamics of learning in deep nonlinear networks is complex:
Training error Test error
Training time Training time
Non-monotonicity: Plateaus with sudden Overfitting to training examples; sudden transitions to lower error Bad predictions on new examples Deep network • Little hope for a complete theory with arbitrary nonlinearities
D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)
2 1 W D W W . . .
f (x)
N ND+1 3 N1 y ∈ R h2 ∈ R x R x ∈ Deep linear network
• Use a deep linear network as a starting point
D D−1 2 1 f (W hD ) f (W hD−1) f (W h1) f (W x)
2 1 W D W W . . .
f (x)
N ND+1 3 N1 y ∈ R h2 ∈ R x R x ∈ Final Report: Convergence properties of deep linear networks
Andrew Saxe Christopher Baldassano [email protected] [email protected]
1 Introduction
Deep learning approaches have realized remarkable performance across a range of application areas in machine learning, from computer vision [1, 2] to speech recognition [3] and natural language processing [4], but the complexity of deep nonlinear networks has made it difficult to develop a comprehensive theoretical understanding of deep learning. For example, the necessary conditions for convergence, the speed of convergence, and optimal methods for initialization are based pri- marily on empirical results without much theoretical support. As a first step in understanding the learning dynamics of deep nonlinear networks, we can analyze deep linear networks which compute D D 1 2 1 i y = W W W W x, where x, y are input and output vectors respectively, and the W are D weight matrices··· in this D +1layer deep network. Although these networks are no more expres- sive than a single linear map y = Wx (and therefore unlikely to yield high accuracy in practice), we have previously shown [5] that they do exhibit nonlinear learning dynamics similar to those ob- served in nonlinear networks. By precisely characterizing how the weight matrices evolve in linear networks, we may gain insight into the properties of nonlinear networks with simple nonlinearities (such as rectified linearDeep units). linear network In this progress report, we show preliminary results for continuous batch gradient descent, in which the gradient step size is assumed to be small enough to take a continuous time limit. By the end of the• project, Input-output map: we hope to obtain similarAlways results linear for discrete batch gradient descent (with a discrete step size) and stochastic (online) gradient descent. " D % 2 Preliminaries andy Previous= $∏ WorkW i 'x ≡ W tot x A deep linear network maps input# vectorsi=1 x to& output vectors y = D W i x Wx. We wish i=1 ⌘ to minimize the squared error on the training set xµ,yµ P , l(W⇣)= P ⌘yµ Wxµ 2. • Gradient descent dynamics: Nonlinear; coupled; { }µ=1 Q nonconvexµ=1 k k The batch gradient descent update for a layer l is P
P D T D l 1 T W l = W i yµxµT W i xµxµT W i , (1) µ=1 ! " i=1 ! # i=1 ! X i=Yl+1 Y Y b i b (b 1) (a 1) a b i where W = W W W W with the caveat that W =lI=if1a>b,, D. i=a ··· i=a The• minimizingUseful for studying Q W can be foundlearning analytically, dynamics by setting the, not representation power. derivativeQ of the loss to zero: P (yµ Wxµ)xµT =0 (2) µ=1 X Let ⌃xx P xµxµT be the input correlation matrix, and ⌃yx P yµxµT be the input- ⌘ µ=1 ⌘ µ=1 output correlation matrix. The optimal W is P P yx xx 1 W ⇤ = ⌃ (⌃ ) (3)
1 Nontrivial learning dynamics
Plateaus and sudden Faster convergence from transitions pretrained initial conditions
4 x 10 3 4 x 10 3 2.8 2.8 Random ICs 2.6 2.6
2.4 Pretrained 2.4
2.2 2.2
2 2 Training error Training error
1.8 1.8
1.6 Training error 1.6
1.4 Training error 1.4 1.2 0 50 100 150 200 250 300 350 400 450 500 Epochs 1.2 0 50 100 150 200 250 300 350 400 450 500 Epochs Epochs Epochs
• Build intui ons for nonlinear case by analyzing linear case Student Version of MATLAB Student Version of MATLAB Nonlinear learning dynamics in a 3 layer linear net
N N1 N2 3
W21 W32 Object Feature representation representation Averaging over the input statistics Input statistics guide change of synaptic coordinates Dynamics of synaptic modes
aα bα
Input mode α Output mode α
Cooperative growth Stabilization Inter-mode competition Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes S S 3 one learning epoch. We note that, although the network we -
= Properties B B Properties Properties analyze is completely linear with the simple input-output map 32 21 P P y = W W x, the gradient descent learning dynamics given T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S V Input-output Output Input Singular values Decomposing the input-output correlations Our funda- correlation matrix singular vectors singular vectors mental goal is to understand the dynamics of learning in (3)- Figure 2: Example singular value decomposition for a toy (4) as a function of the input statistics S11 and S31. In general, dataset. Left: The learning environment is specified by an the outcome of learning will reflect an interplay between the input-output correlation matrix. This example dataset has perceptual correlations in the input patterns, described by S11, four items: Canary, Salmon, Oak, and Rose. The two animals and the input-output correlations described by S31. To begin, share the property that they can Move, while the two plants though, we consider the case of orthogonal input representa- cannot. In addition each item has a unique property: can Fly, tions where each item is designated by a single active input can Swim, has Bark, and has Petals, respectively. Right: The unit, as used by (Rumelhart & Todd, 1993) and (Rogers & 31 SVD decomposes S into input-output modes that link a set McClelland, 2004). In this case, S11 corresponds to the iden- of coherently covarying properties (output singular vectors in tity matrix. Under this scenario, the only aspect of the train- the columns of U) to a set of coherently covarying items (in- ing examples that drives learning is the second order input- T put singular vectors in the rows of V ). The overall strength output correlation matrix S31. We consider its singular value of this link is given by the singular values lying along the di- decomposition (SVD) agonal of S. In this toy example, mode 1 distinguishes plants N1 from animals; modeItems 2 birds from fish;Modes and mode 3 flowersModes Items 31 33 31 11T a aT P C S O R 1 2 3 1 2 3 C S O SR = U isS anV N3 =NÂ1 input-outputsau v , correlation(6) matrix, and t l . Here from trees. + ⇥ a=1 ⌘ 1 M M t measures time in units of learning epochs; as t varies from which will play a central role in understanding how the ex- 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes We wish to train the network to learn a particular input- Modes amples drive learning. The SVD decomposes any rectangu- S
S 11 µ µ lar3 matrix into the productone learning of three matrices. epoch. We Here noteV that,is although the network we output map from a set of P training examples x ,y ,µ = -
= Properties
B { } B 1,...,P. The Properties input vector xµ, identifies item µ while each yµ an N1 N1 orthogonalanalyze matrix whose is completely columns contain linearinput- with the simple input-output map ⇥ a 32 21
P analyzing singular vectors v that reflect independent modes is a set of attributesP to be associated to this item. Training y = W W x, the gradient descent learning dynamics given of variation in the input, U33 is an N N orthogonal ma- is accomplished in an31 online fashion= via stochastic gradient T in Eqns. (3)-(4)3 ⇥ are3 highly nonlinear. descent; each timeΣ an example µ is presented,U the weightsS trix whoseV columns contain output-analyzing singular vectors 32 21 Input-output Output a Input W and W are adjusted by a small amount in theSingular direction values u that reflect independentDecomposing modes of variation the input-output in the output, correlations Our funda- correlation matrix singular vectors 2 singular31 vectors that minimizes the squared error yµ W 32W 21xµ between and S is an N3 N1 mentalmatrix whose goal is only to understandnonzero elements the dynamics of learning in (3)- ⇥ the desiredFigure feature output, 2: Example and the network’ssingular featurevalue decomposition output. are on for the a diagonal; toy these(4) as elements a function are the of the singular input values statistics S11 and S31. In general, s ,a = 1,...,N ordered so that s s s . An ex- This gradientdataset. descent procedureLeft: The yields learning the learning environment rule is specifieda by an1 1 2 ··· N1 ample SVD of a toy datasetthe outcome is given of in learning Fig. 2. As will can reflect be an interplay between the 11 input-outputT correlation matrix. This example dataset has perceptual correlations in the input patterns, described by S , DW 21 = lW 32 yµxµT W 32W 21xµxµT (1) seen, the SVD extracts coherently covarying items and prop- four items: Canary, Salmon, Oak, and Rose. The two animals 31 T erties from this dataset,and with the various input-output modes picking correlations out the described by S . To begin, DWshare32 = thel propertyyµxµT W 32 thatW 21 theyxµxµT canW 21Move , ,(2) while the two plants underlying hierarchy presentthough, in the we toy consider environment. the case of orthogonal input representa- cannot. In addition each item has a unique property: can Fly, for each example µ, where l is a small learning rate. We The temporal dynamicstions of where learning eachA item central is result designated of by a single active input imagine thatcan trainingSwim is, has dividedBark into, and a sequence has Petals of learning, respectively.this Right:work is The that we haveunit, described as used the by full (Rumelhart time course & of Todd, 1993) and (Rogers & 31 epochs, andSVD in each decomposes epoch, the aboveS into rules input-output are followed formodeslearningthat link by solving a set theMcClelland, nonlinearFixed points dynamical 2004). equations In this (3)-(4) case, S11 corresponds to the iden- all P examplesof coherently in random order. covarying As long properties as l is sufficiently (output singularfor orthogonal vectors in input representationstity matrix. Under(S11 = I this), and scenario, arbitrary the only aspect of the train- 31 small so thatthe the columns weights change of U) by to onlya set a of small coherently amount per covaryinginput-output items ( correlationin- ingS examples. In particular, that drives we find learning a class is the second order input- P learning epoch,put singular we can average vectors (1)-(2)in the over rows all ofexamplesV T ). The overallof exact strength solutions (whose derivation will be presented31 else- and take a continuous time limit to obtain the mean change in • As , weights approach 21 output32 correlation matrix S . We consider its singular value of this link is given by the singular values lyingwhere) along for theWt di-→(t) and∞W (t) such that the composite map- weights per learning epoch, ping at any time t is givendecomposition by (SVD) agonal of S. In this toy example, mode 1 distinguishes plants N2 N1 d 21 32T 31 32 21 11 T t fromW animals;= W modeS 2W birdsW fromS fish;(3) and mode 3 flowersW 32(t)W 21(t)= a(t,s ,a310 )uavaT33, 31 11(7) a aT dt Â→ aS a = U S V = Â sau v , (6) from trees. a=1 a=1 d 32 31 32 21 11 21T t W = S W W S W , (4) 0 dt where the function a(twhich,s,a ) governing will play the a strengthcentral of role each in understanding how the ex- input-output mode is given by 11 WeT wish to train the network to learn a particular input- amples drive learning. The SVD decomposes any rectangu- where S E[xx ] is an N1 N1 input correlation matrix, • (Baldi & Hornik, 1989; Sanger, 1989) 11 ⌘output map from⇥ a set of P training examples xµ,yµ ,µ = lar matrixse2 intost/t the product of three matrices. Here V is 31 T a(t,s,a )= . (8) S E[yx ] µ (5) { } µ 0 2st/t 1,...,P. The input vector x , identifies item µ while each y an Ne1 N1 1orthogonal+ s/a0 matrix whose columns contain input- ⌘ ⇥ a is a set of attributes to be associated to this item. Training analyzing singular vectors v that reflect independent modes • Simple end point of variation in the input, U33 is an N N orthogonal ma- is accomplished in an online fashion via stochastic gradient 3 ⇥ 3 descent; each time an example µ is presented, the weights trix whose columns contain output-analyzing singular vectors • What dynamicsa occur along the way? W 32 and W 21 are adjusted by a small amount in the direction u that reflect independent modes of variation in the output, 2 31 that minimizes the squared error yµ W 32W 21xµ between and S is an N3 N1 matrix whose only nonzero elements ⇥ the desired feature output, and the network’s feature output. are on the diagonal; these elements are the singular values s ,a = 1,...,N ordered so that s s s . An ex- This gradient descent procedure yields the learning rule a 1 1 2 ··· N1 ample SVD of a toy dataset is given in Fig. 2. As can be T DW 21 = lW 32 yµxµT W 32W 21xµxµT (1) seen, the SVD extracts coherently covarying items and prop- T erties from this dataset, with various modes picking out the DW 32 = l yµxµT W 32W 21xµxµT W 21 , (2) underlying hierarchy present in the toy environment. for each example µ, where l is a small learning rate. We The temporal dynamics of learning A central result of imagine that training is divided into a sequence of learning this work is that we have described the full time course of epochs, and in each epoch, the above rules are followed for learning by solving the nonlinear dynamical equations (3)-(4) all P examples in random order. As long as l is sufficiently for orthogonal input representations (S11 = I), and arbitrary small so that the weights change by only a small amount per input-output correlation S31. In particular, we find a class learning epoch, we can average (1)-(2) over all P examples of exact solutions (whose derivation will be presented else- and take a continuous time limit to obtain the mean change in where) for W 21(t) and W 32(t) such that the composite map- weights per learning epoch, ping at any time t is given by
N2 d 21 32T 31 32 21 11 t W = W S W W S (3) W 32(t)W 21(t)= a(t,s ,a0 )uavaT , (7) dt  a a a=1 d 32 31 32 21 11 21T t W = S W W S W , (4) 0 dt where the function a(t,s,a ) governing the strength of each input-output mode is given by 11 T where S E[xx ] is an N1 N1 input correlation matrix, ⌘ ⇥ se2st/t 31 T a(t,s,a0)= . (8) S E[yx ] (5) e2st/t 1 + s/a ⌘ 0 Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes S S 3 one learning epoch. We note that, although the network we -
= Properties B B Properties Properties analyze is completely linear with the simple input-output map 32 21 P P y = W W x, the gradient descent learning dynamics given T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S V Input-output Output Input Singular values Decomposing the input-output correlations Our funda- Items Modes Modes correlationItems matrix singular vectors singular vectors P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, andmentalt l . goal Here is to understand the dynamics of learning in (3)- + ⇥ ⌘
1 11 31 M M Figure 2: Example singulart measures value time decomposition in units of learning for aepochs; toy as(4)t varies as a function from of the input statistics S and S . In general, 2 F F dataset. Left: The0 learning0 to 1, the environment network has is seen specifiedP examples by an corresponding to Modes
Modes the outcome of learning will reflect an interplay between the S S input-output3 correlationone learning matrix. epoch. This example We note dataset that, although has the network we 11 - perceptual correlations in the input patterns, described by S ,
= Properties B B Properties Properties four items: Canary, Salmon,analyze isOak, completely and Rose. linearThe with two the animals simple input-output map 31 32 21 and the input-output correlations described by S . To begin, P P y = W W x, the gradient descent learning dynamics given share the property that they can Move, while the two plants though, we consider the case of orthogonal input representa- T in Eqns. (3)-(4) are highly nonlinear. Σ31 = U S cannot. InV addition each item has a unique property: can Fly, tions where each item is designated by a single active input Input-output Output can SwimInput, has Bark, and has Petals, respectively. Right: The Singular values Decomposing the input-output correlationsunit,Our as funda- used by (Rumelhart & Todd, 1993) and (Rogers & correlation matrix singular vectors singular vectors 31 SVD decomposes S mentalinto input-output goal is to understandmodes that the link dynamics a set of learningMcClelland, in (3)- 2004). In this case, S11 corresponds to the iden- Figure 2: Example singular value decomposition for a toy 11 31 of coherently covarying(4) propertiesas a function (output of the singularinput statistics vectorsS inand Stity. In matrix. general, Under this scenario, the only aspect of the train- dataset. Left: The learning environment is specified by an the columns of U) tothe a set outcome of coherently of learning covarying will reflect items an (in- interplaying between examples the that drives learning is the second order input- input-output correlation matrix. This example dataset has T 11 put singular vectors inperceptual the rows correlations of V ). The in theoverall input strength patterns, describedoutput by correlationS , matrix S31. We consider its singular value four items: Canary, Salmon, Oak, and Rose.ofThe this two link animals is given byand the thesingular input-output values correlationslying along described the di- by S31. To begin, share the property that they can Move, while the two plants decomposition (SVD) agonal of S. In this toythough, example, we consider mode 1 the distinguishes case of orthogonal plants input representa- cannot. In addition each item has a unique property: can Fly, N1 from animals; modetions 2 birds where from each fish; item and is mode designated 3 flowers by a single active input 31 33 31 11T a aT can Swim, has Bark, and has Petals, respectively. Right: The unit, as used by (Rumelhart & Todd, 1993) and (Rogers &S = U S V = Â sau v , (6) 31 from trees. SVD decomposes S into input-output modes that link a set McClelland, 2004). In this case, S11 corresponds to the iden- a=1 of coherently covarying properties (output singular vectors in tity matrix. Under this scenario, the only aspectwhich of the will train- play a central role in understanding how the ex- the columns of U) to a set of coherently covaryingWe wish items to (in- train theing networkexamples to that learn drives a particular learning is input- the secondamples order input- drive learning. The SVD decomposes any rectangu- T 11 put singular vectors in the rows of V ). Theoutput overall map strength from a setoutput of P correlationtraining examples matrix S31.x Weµ,yµ consider,µ = its singularlar matrix value into the product of three matrices. Here V is of this link is given by the singular values lying along the di- µ { } µ an N N orthogonal matrix whose columns contain input- 1,...,P. The input vectordecompositionx , identifies (SVD) item µ while each y 1 ⇥ 1 agonal of ItemsS. In this toy example,Modes modeModes 1 distinguishes Items plants analyzingP singular vectors va that reflect independent modes C S O R 1 2 3 1 2 3is a setC ofS attributesO R tois an beN associated3 N1 input-output to this correlationitem. TrainingN1 matrix, and t l . Here from animals; mode 2 birds from fish; and mode 3 flowers + ⇥ 31 33 31 11T a aT ⌘ 33 1 of variation in the input, U is an N N orthogonal ma-
M 3 3 M is accomplished in ant onlinemeasures fashion timeS in= viaU units stochasticS ofV learning= gradient epochs;sau v as,t varies from(6) from trees. Â ⇥ 2
F =1 F descent; each time0 an0 exampleto 1, the networkµ is presented, has seen theP examples weightsa correspondingtrix whose tocolumns contain output-analyzing singular vectors Modes Modes a S
S which will play a central role in understanding how the ex- 32 3 21 one learning epoch. We note that, although theu networkthat reflect we independent modes of variation in the output, W and W are adjusted- by a small amount in the direction
= Properties 31 B B We Properties wish to train the network to learn a particular input- amplesanalyze drive is completely learning.µ 32 linear The21 with SVDµ 2 the decomposes simple input-outputand any rectangu-S is map an N3 N1 matrix whose only nonzero elements that minimizesµ µ the squared error32 21 y W W x between 11 ⇥ P outputP map from a set of P training examples x ,y ,µ = lary = matrixW W intox, the the gradient product descent of three learning matrices. dynamicsare Here onV the given diagonal;is these elements are the singular values the desired{ } feature output, and the network’s feature output. 1,...,P. The31 input vector xµ, identifies item µ while eachT yµ aninN Eqns.1 N (3)-(4)1 orthogonal are highly matrix nonlinear. whose columns contain input- = ⇥ sa,a = 1,...,N1 ordered so that s1 s2 sN1 . An ex- Σ U S This gradientV descentanalyzing proceduresingular yields vectorsthe learningva that rule reflect independent modes ··· is a setInput-output of attributes to beOutput associated to this item. TrainingInput ample SVD of a toy dataset is given in Fig. 2. As can be Singular values Decomposing the input-output33 correlations Our funda- is accomplishedcorrelation matrix in an onlinesingular vectors fashion via stochasticsingular gradient vectors of variationT in the input, U is an N3 N3 orthogonal ma- DW 21 = lWmental32 goalyµxµT is toW understand32W 21xµx theµT dynamics(1)⇥ of learningseen, the in (3)- SVD extracts coherently covarying items and prop- descent;Figure each 2: Example time an singular example valueµ is decomposition presented, the for weights a toy trix whose columns contain output-analyzing11 singular31 vectors (4)a as a function of the input statistics S and Serties. In general, from this dataset, with various modes picking out the W 32dataset.and W 21 Left:are The adjusted learning by a environment small amount is in specified the direction32 by an u thatµ µT reflect32 independent21 µ µT modes21T of variation in the output, DW = lthey outcomex31 W ofW learningx x willW reflect , an(2) interplayunderlying between the hierarchy present in the toy environment. µ 32 21 µ 2 and S is an N N matrix whose only nonzero elements thatinput-output minimizes the correlation squared matrix. error y ThisW exampleW x datasetbetween has perceptual correlations3 ⇥ 1 in the input patterns, described by S11, are on the diagonal; these elements are the singular values thefour desired items: featureCanary, output, Salmon, and Oak, the network’sand Rose.forThe feature each two example output. animals µ, andwhere thel input-outputis a small correlations learning rate. described We by SThe31. To temporal begin, dynamics of learning A central result of s ,a = 1,...,N ordered so that s s s . An ex- Thisshare gradient the property descent that procedure they can yields Move the, while learningimagine the tworule that plants trainingthough,a is divided we consider into1 a sequence the case of of orthogonal learning1 2 ··· input this representa-N1 work is that we have described the full time course of cannot. In addition each item has a uniqueepochs, property: and can inFly each, ample epoch,tions where SVD the above of each a toy item rules dataset is are designated followed is given by in for a Fig. singlelearning 2. active As can input by be solving the nonlinear dynamical equations (3)-(4) T µ µT µ µT can SwimDW 21, has=Barkl,W and32 hasy Petalsx ,W respectively.32W 21x x Right:(1) The seen,unit, the as used SVD by extracts (Rumelhart coherently & Todd, covarying 1993) items and (Rogers and prop- & 11
, all P examples in random order. As long as l is sufficiently 31 for orthogonal input representations (S = I), and arbitrary SVD decomposesis S into input-output modes thatT link a set erties from this dataset, with various11 modes picking out the 31 11 McClelland, 2004). In this case, S corresponds to the iden- (6) DW 32 = l yµxµT W 32W 21xµxµTsmallW 21 so , that the(2) weights change by only a small amount per(7) input-output correlation(8) S . In particular, we find a class S of coherently covarying properties (output singular vectors in underlying hierarchy present in the toy environment. 11 learning epoch, we cantity average matrix. Under (1)-(2) this over scenario, all P examples the only aspectof of exact the train- solutions (whose derivation will be presented else- input- . Here the columns of U)V to a set of coherently covarying items (in- ing examples thatAnalytic learning trajectory drives learning is the second order input- 21 32 l P for each example µ, where l is a smallT learningand take rate. a continuous We The time temporal limit to obtain dynamics the mean of learning change inA centralwhere) result for W of (t) and W (t) such that the composite map- put singular vectors in the rows of V ). The overall. An ex- strength 31 imagine that training is divided into a sequence of1 learning output correlation matrix S . We consider its singular value ⌘ weights per learningthis epoch, work is that we have described the full time course of N . To begin, of this link is given by the singular values lying along the di- ping at any time t is given by Our funda- varies from SVD of input-output correla ons: s , t . In general, epochs, and in each epoch, the above rules are followed for decomposition (SVD)
31 learning by solving the nonlinear dynamical equations (3)-(4) t agonal of S. In this toy example, mode 1 distinguishes plants T N ,
31 2 S 11 d T a all P examples in random order. As long as l is sufficiently21 for orthogonal32 31 input32 representations21 11 N1 (S = I), and arbitrary 1/Learning rate T 32 21 0 T
), and arbitrary a a
S τ T v orthogonal ma- from animals; mode 2 birds fromsingular vectors fish; and mode 3t flowersW = W S 31 W 33W 31S 11 (3)a aT ( ) ( )= ( , , ) , I W . t W t a t s a u v (7) a 31 a a small so that the weights change by only a small amountdt per S = U S V = saa u v , (6)  3 v ··· input-output correlation S . In particular, we find a class 0 s Singular value from trees. a=1 u a=1 a = a N ) A central result of and learning epoch, we can average (1)-(2) over all P examplesd of exact solutions (whose derivationT will be presented else- a Ini al mode strength u 32 31 32 21 11 21 / 0 =which will play a central role in, understanding0 a how the ex- 0 t W S W W S W (4) s
a 21 32 ⇥ 11 2 and take a continuous time limit to obtain the mean change in a where the function a(t,s,a ) governing the strength of each 11 where) for W (t) and W (t) such that the composite map- s dt t s , 3 amples drive learning.S The SVD decomposes any rectangu- + / 1 S We wish to train the network to learn a particular input- Network input-output map: a corresponds to the iden- weights1 per learning epoch, input-output mode is given by N ping at any time t is given by 11 st = 1