The statistical of deep learning

On infant category learning, dynamic criticality, random landscapes, and the reversal of time.

Surya Ganguli

Dept. of Applied Physics, Neurobiology, and Electrical Engineering

Stanford University Neural circuits and behavior: theory, computation and experiment with Baccus lab: inferring hidden circuits in the retina

with Clandinin lab: unraveling the computations underlying fly motion vision from whole brain optical imaging with the Giocomo lab: understanding the internal representations of space in the mouse entorhinal cortex

with the Shenoy lab: a theory of neural dimensionality, dynamics and measurement

with the Raymond lab: theories of how enhanced plasticity can either enhance or impair learning depending on experience Statistical mechanics of high dimensional data analysis

N = dimensionality of data α = N / M M = number of data points

Classical Statistics Modern Statistics

N ~ O(1) N -> ∞ M -> ∞ M -> ∞ α -> 0 α ~ 0(1)

Machine Learning and Data Analysis! Statistical Physics of Quenched ! Disorder! Learn statistical parameters by ! maximizing log likelihood of data Energy = - log Prob ( data | parameters) given parameters. Data = quenched disorder Parameters = thermal degrees of freedom Applications to: 1) compressed sensing 2) optimal inference in high dimensions 3) a theory of neural dimensionality and measurement

The project that really keeps me up at night

Motivations for an alliance between theoretical neuroscience and theoretical machine learning: opportunities for statistical physics

• What does it mean to understand the brain (or a neural circuit?)

• We understand how the connectivity and dynamics of a neural circuit gives rise to behavior.

• And also how neural activity and synaptic learning rules conspire to self- organize useful connectivity that subserves behavior.

• It is a good start, but it is not enough, to develop a theory of either random networks that have no function.

• The field of machine learning has generated a plethora of learned neural networks that accomplish interesting functions.

• We know their connectivity, dynamics, learning rule, and developmental experience, *yet*, we do not have a meaningful understanding of how they learn and work!

On simplicity and complexity in the brave new world of large scale neuroscience, Gao and Ganguli Curr. Op. Neurobiology 2015 Talk Outline

Original motivation: understanding category learning in neural networks

We find random weight initializations, that make a network dynamically critical and allow rapid training of very deep networks.

Dynamic Criticality

Random Time Landscapes Reversal

Understand and exploit geometry Exploit violations of the second of high dimensional error surfaces: law of thermodynamics to create need to escape saddle points not deep generative models local minima. Acknowledgements and Funding ! Saxe, J. McClelland, S. Ganguli, Learning hierarchical category structure in deep neural networks, Cog Sci. 2013.! ! Saxe, J. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, ICLR 2014.! ! J. Sohl-Dickstein, B. Poole, and S. Ganguli, Fast large scale optimization by unifying stochastic gradient and quasi-Newton methods, ICML 2014.! ! Identifying and attacking the saddle point problem in high dimensional non-convex optimization, Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio, NIPS 2014.! ! Deep unsupervised learning using non-equilibrium thermodynamics, J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, S. Ganguli, ICML 2015.! ! On simplicity and complexity in the brave new world of large-scale neuroscience, P. Gao and S. Ganguli, Current Opinion in Neurobiology, 2015.! ! http://ganguli-gang.standford.edu ! ! !Funding: Bio-X Neuroventures! Office of Naval Research! Burroughs Wellcome! ! Simons Foundation! Genentech Foundation! Sloan Foundation! ! James S. McDonnell Foundation! Simons Foundation! ! McKnight Foundation! Swartz Foundation! National Science Foundation! Stanford Terman Award! A Mathematical Theory of Semantic Development*

Joint work with: Andrew Saxe and Jay McClelland

*AKA: The misadventures of an “applied physicist” wondering around the psychology department What is “semantic cognition”?

Human semantic cognition: Our ability to learn, recognize, comprehend and produce inferences about properties of objects and events in the world, especially properties that are not present in the current perceptual stimulus For example:

Does a cat have fur? Do birds fly?

Our ability to do this likely relies on our ability to form internal representations of categories in the world Psychophysical tasks that probe semantic cognition

Looking time studies: Can an infant distinguish between two categories of objects? At what age?

Property verification tasks: Can a canary move? Can it sing? Response latency => central and peripheral properties

Category membership queries: Is a sparrow a bird? An ostrich? Response latency => typical / atypical category members

Inductive generalization: (A) Generalize familiar properties to novel objects: i.e. a “blick” has feathers. Does it fly? Sing? (B) Generalize novel properties to familiar objects: i.e. a bird has gene “X”. Does a crocodile have gene X? Does a dog? Semantic Cognition Phenomena A Network for Semantic Cognition

Rogers and McClelland Evolution of internal representations

Rogers and McClelland Categorical representations in human and monkey

Kriegeskorte et. al. Neuron 2008 Categorical representations in human and monkey

Kriegeskorte et. al. Neuron 2008 Evolution of internal representations

Rogers and McClelland Theoretical questions

What are the mathematical principles underlying the hierarchical self-organization of internal representations in the network?

What are the relative roles of: nonlinear input-output response learning rule input statistics (second order? higher order?)

What is a mathematical definition of category coherence, and How does it relate the speed of category learning?

Why are some properties learned more quickly than others?

How can we explain changing patterns of inductive generalization over developmental time scales? Learning hierarchical categories in deep neural networks Andrew M. Saxe ([email protected]) Department of Electrical Engineering James L. McClelland ([email protected]) Department of Psychology Surya Ganguli ([email protected]) Department of Applied Physics , Stanford, CA 94305 USA

Abstract A wide array of psychology experiments have revealed re- markable regularities in the developmental time course of hu- 32 21 man cognition. For example, infants generally acquire broad W W categorical distinctions (i.e., plant/animal) before finer-scale distinctions (i.e., dog/cat), often exhibiting rapid, or stage-like transitions. What are the theoretical principles underlying the ability of neuronal networks to discover categorical structure from experience? We develop a mathematical theory of hi- N erarchical category learning through an analysis of the learn- y ∈ R 3 h ∈ RN2 x ∈ RN1 ing dynamics of multilayer networks exposed to hierarchically structured data. Our theory yields new exact solutions to the Figure 1: The three layer network analyzed in this work. nonlinear dynamics of error correcting learning in deep, three layer networks. These solutions reveal that networks learn mains elusive. What are the essential principles that underly input-output covariation structure on a time scale that is in- such behavior? What aspects of statistical structure in the versely proportional to its statistical strength. We further ana- lyze the covariance structure of data sampled from hierarchical input are responsible for driving such dynamics? For exam- probabilistic generative models, and show how such models ple, must networks exploit nonlinearities in their input-output yield a hierarchy of input-output modes of differing statistical map to detect higher order statistical regularities to drive such strength, leading to a hierarchy of time-scales over which such modes are learned. Our results reveal that even the second learning? order statistics of hierarchically structured data contain pow- Here we analyze the learning dynamics of a linear 3 layer erful statistical signals sufficient to drive complex experimen- network and find, surprisingly, that it can exhibit highly non- tally observed phenomena in semantic development, including progressive, coarse-to-fine differentiation of concepts and sud- linear learning dynamics, including rapid stage-like transi- den, stage-like transitions in performance punctuating longer tions. Furthermore, when exposed to hierarchically struc- dormant periods. tured data sampled from a hierarchical probabilistic model, Keywords: neural networks; hierarchical generative models; the network exhibits progressive differentiation of concepts semantic cognition; learning dynamics from broad to fine. Since such linear networks are sensitive Introduction only to the second order statistics of inputs and outputs, this Our world is characterized by a rich, nested hierarchical yields the intriguing result that merely second order patterns structure of categories within categories, and one of the most of covariation in hierarchically structured data contain statis- remarkable aspects of human semantic development is our tical signals powerful enough to drive certain nontrivial, high ability to learn and exploit this rich structure. Experimental level aspects of semantic development in deep networks. work has shown that infants and children acquire broad cate- gorical distinctions before fine categorical distinctions (Keil, Gradient descent dynamics in multilayer 1979; Mandler & McDonough, 1993), suggesting that hu- neural networks man category learning is marked by a progressive differen- We examine learning in a three layer network (input layer 1, tiation of concepts from broad to fine. Furthermore, humans hidden layer 2, and output layer 3) with linear activation func- can exhibit stage-like transitions as they learn, rapidly moving tions, simplifying the network model of Rumelhart and Todd from ignorance to mastery (Inhelder & Piaget, 1958; Siegler, (1993), in which input units correspond to items e.g, Canary, 1976). Rose and output units correspond to possible predicates or at- Many neural network simulations have captured aspects of tributes Can Fly, Has Petals that may or may not apply to each 21 these broad patterns of semantic development (Rogers & Mc- item. Let Ni be the number of neurons in layer i, W be an Clelland, 2004; Rumelhart & Todd, 1993; McClelland, 1995; N N matrix of synaptic connections from layer 1 to 2, and 2 ⇥ 1 Plunkett & Sinha, 1992; Quinn & Johnson, 1997). The inter- similarly, W 32 anProblem formulaon N N matrix of connections from layer 2 3 ⇥ 2 nal representations of such networks exhibit both progressive to 3. TheWe analyze a fully linear three layer network input-output map of the network is y = W 32W 21x, differentiation and stage like transitions. where x is an N1 dimensional column vector representing in- However the theoretical basis for the ability of neuronal puts to the network, and y is an N2 dimensional column vector networks to exhibit such strikingly rich nonlinear behavior re- representing the network output (see Fig. 1). W 32 W 21

N y ∈ R 3 h ∈ RN2 x ∈ RN1

Properes Items Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes S S 3 one learning epoch. We note that, although the network we -

= Properties B B Properties Properties analyze is completely linear with the simple input-output map 32 21 P P y = W W x Items ,Modes the gradientModes descentItems learning dynamics given P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here T in Eqns. (3)-(4) are highly nonlinear. + ⇥ ⌘

31 1 M = U S V M t measures time in units of learning epochs; as t varies from

Σ 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Input-output Output Input Decomposing the input-outputModes correlations Our funda-

Singular values S S 3 one learning epoch. We note that, although the network we correlation matrix singular vectors singular vectors -

= Properties B B Properties Properties mental goal is to understand the dynamics ofanalyze learning is completely in (3)- linear with the simple input-output map 11 31 32 21 P Figure 2: Example singular value decomposition for a toy P (4) as a function of the input statistics S andyS= W. InW general,x, the gradient descent learning dynamics given T in Eqns. (3)-(4) are highly nonlinear. dataset. Left: The learning environment is specified by an theΣ31 outcome= ofU learningS will reflectV an interplay between the Input-output Output Input input-output correlation matrix. This example dataset has Singular values Decomposing the11 input-output correlations Our funda- correlationperceptual matrix correlationssingular vectors in the inputsingular patterns, vectors described by S , four items: Canary, Salmon, Oak, and Rose. The two animals mental31 goal is to understand the dynamics of learning in (3)- Figureand 2: Examplethe input-output singular value correlations decomposition described for a toy by(4)S as a. function To begin, of the input statistics S11 and S31. In general, share the property that they can Move, while the two plants dataset.though, Left: The we learning consider environment the case is specifiedof orthogonal by an inputthe outcome representa- of learning will reflect an interplay between the cannot. In addition each item has a unique property: can Fly, input-outputtions where correlation each matrix. item This is designated example dataset by has a singleperceptual active correlations input in the input patterns, described by S11, can Swim, has Bark, and has Petals, respectively. Right: The four items: Canary, Salmon, Oak, and Rose. The two animals and the input-output correlations described by S31. To begin, shareunit, the property as used that they by (Rumelhart can Move, while & the Todd, two plants 1993)though, and (Rogers we consider & the case of orthogonal input representa- 31 11 SVD decomposes S into input-output modes that link a set cannot.McClelland, In addition each 2004). item has In a unique this case, property:S cancorrespondsFly, tions where to the each iden- item is designated by a single active input of coherently covarying properties (output singular vectors in can Swimtity, matrix. has Bark, and Under has Petals this, respectively. scenario, Right:the only The aspectunit, as of used the by train- (Rumelhart & Todd, 1993) and (Rogers & SVD decomposes S31 into input-output modes that link a set 11 the columns of U) to a set of coherently covarying items (in- ing examples that drives learning is the secondMcClelland, order 2004). input- In this case, S corresponds to the iden- of coherently covarying properties (output singular vectors in tity matrix. Under this scenario, the only aspect of the train- put singular vectors in the rows of V T ). The overall strength 31 the columnsoutput of correlationU) to a set of coherently matrix S covarying. We items consider (in- itsing singular examples that value drives learning is the second order input- T of this link is given by the singular values lying along the di- put singulardecomposition vectors in the (SVD) rows of V ). The overall strength output correlation matrix S31. We consider its singular value agonal of S. In this toy example, mode 1 distinguishes plants of this link is given by the singular values lying along the di- decomposition (SVD) agonal of S. In this toy example, mode 1 distinguishesN plants1 from animals; mode 2 birds from fish; and mode 3 flowers 31 33 31 11T a aT N1 from animals; mode 2 birds= fromU fish;S andV mode= 3 flowerss u v , 31 (6)33 31 11T a aT Learning dynamics S  a S = U S V =  sau v , (6) from trees. from trees. a=1 a=1 which will play a central role in understandingwhich how will play the a ex- central role in understanding how the ex- • Network is trained on a set of items and their properes We wish to train the network to learn a particular input- amples drive learning. The SVD decomposes any rectangu- tic gradient descent; each time an exampleamplesµ is drive presented, learning. the Thehave SVD shown decomposes that relaxing any this rectangu- assumption to incorporate more 11 We wish to train the network to learn a particular input- output map from a set of P training examples xµ,yµ ,µ = lar matrix into the product of three matrices. Here V is weights W 32 and W 21 are adjusted by a small amount in the complex{ input} correlations leaves11 intact the basic phenom- µ, µ , = 1,...,larP. The matrix input vector intox theµ, identifies product item ofµ while three each matrices.yµ an N1 HereN1 orthogonalV is matrix whose columns contain input- output map from a set of P training examples x y µ 2 ⇥ a µ direction that minimizes{ the} squaredµ is a error set ofN attributesyµ NW 32 toW be21 associatedxµ toena this of item. progressive Training differentiationanalyzing singularinput- and vectors stage-likev that reflect transitions independent modes 1,...,P. The input vector x , identifies item µ while each y an 1 1 orthogonal matrix whose columns contain 33 ⇥ of variation in the input, U is an N3 N3 orthogonal ma- between the desired feature output,is accomplishedandanalyzing the network’s insingular an online feature fashion vectors viainv stochastic learning.a that reflect gradient Nevertheless, independent understanding modes the impact⇥ of input is a set of attributes to be• associated Weights adjusted using standard backpropagaon: to this item. Training descent; each time an exampleµ is presented, the weights trix whose columns contain output-analyzing singular vectors output. This gradient descent procedure yields the standard correlations33 is an importanta direction for further work. is accomplished in an online fashion– Change each weight to reduce the error between desired network via stochastic gradient W 32 andofW variation21 are adjusted in the by a input, small amountU inis the an directionN3 N3uorthogonalthat reflect independent ma- modes of variation in the output, ⇥ 31 coupling back propagation learning rule µ 32Second,21 µ 2 the formand of Eqns.S is an (7)-(8)N3 N highlights1 matrix whose the only nonzero elements output and current network output that minimizestrix whose the squared columns error containy W output-analyzingW x between singular vectors⇥ 21 descent; each time an example µ is presented, the weights between the two equations:are on the to diagonal; know how these to elements change areW the singularwe values 32 21 the desireda feature output, and the network’s feature output. W and W are adjusted by a small amount in21 the direction32T µ u µthatµT reflect independent modes of variation32 s ,a in= the1,..., output,N ordered so that s s s . An ex- DW = lW This(y gradientyˆ )x31 descent procedure(3) yields themust learning know ruleW , anda visa versa,1 since each appears1 2 in··· the N1 µ 32 21 µ 2 and S is an N3 N1 matrix whose only nonzeroample SVD elements of a toy dataset is given in Fig. 2. As can be that minimizes the squared error y W W 32x betweenµ µ µT update equation for the other. This coupling is the crucial DW = l(y yˆ )h , 21 32T ⇥µ(4)µT 32 21 µ µT areDW on the= l diagonal;W y x theseW W elementsx x are(1) theseen, singular the SVD values extracts coherently covarying items and prop- the desired feature output, and the network’s feature output. element added by theerties addition from this of dataset, a hidden with layer, various and modes as picking we out the • Highlights the error-correcve aspect of this learning process , 32= ,...,µ µT 32 21 µ µT 21T , µ = 32 s21aDWµa 1= l yNx1orderedW W shallx sox thatW see,s1, it qualitativelys(2)2 s changesN1 . An ex- the learning dynamics of This gradient descent procedurefor yields each example the learningµ where ruley ˆ W W x denotes the output ···underlying hierarchy present in the toy environment. ample SVDµ µ of a21 toyµ datasetthe network is given compared in Fig. 2. to a As “shallow” can be network with no hid- of the network in response to inputfor example each examplex , µh, where= Wl isx a small learning rate. We A central result of 21 32T µ µT 32 21 µ µT The temporal dynamics of learning DW = lW y isx theW hiddenW unitx x activity, and(1) limagineisseen, a thatsmall the training learning SVD is divided extracts rate. into coherently a sequenceden layer. of covarying learning Intuitively,this items this work and coupling is that prop- we have complicates described the the full learn- time course of 32T µ µ epochs,erties and in from each epoch, this dataset, the above with rulesing are procedure various followed modes for sincelearning both picking weight by solving out matrices the the nonlinear must dynamical cooperate equations to (3)-(4) 32 µ µT Here32 W21 µ(yµT yˆ )21inT (3) corresponds to the signal back- W = y x W W x x W , (2) all P examples in random order. As long as l is sufficiently 11 D l propagated to the hidden units throughunderlying the hidden-to-output hierarchy presentproduce in the the toy correct environment.for answer; orthogonal but input crucially, representations it enables (S knowl-= I), and arbitrary small so that the weights change by only a small amount per input-output correlation S31. In particular, we find a class weights. These equations emphasize that the learning pro- edge sharing between different items, by assigning them sim- for each example µ, where l is a small learning rate. We learningThe epoch, temporal we can average dynamics (1)-(2) over of all learningP examples A centralof exact solutions result (whose of derivation will be presented else- cess works by comparing the network’sand take a current continuous output time limity ˆµ to to obtainilar the hidden mean change unit in representations.21 Without32 this coupling, the imagine that training is divided into a sequence of learning this work is that we have described the fullwhere) time for courseW (t) and of W (t) such that the composite map- the desired target output yµ, and adjustingweights per learningweights epoch, based on network would learnping each at item-property any time t is given association by indepen- epochs, and in each epoch, the above rules are followed for learning by solving the nonlinear dynamical equations (3)-(4) dently, and would not be sensitive to sharedN2 structure in the this error term. d 21 32T 31 32 21 11 all P examples in random order. As long as l is sufficiently t W = W S W W S (3)11 W 32(t)W 21(t)= a(t,s ,a0 )uavaT , (7) By a substitution and rearrangement,for orthogonaldt however, we input can representations training set. (S = I), and arbitrary  a a 31 a=1 small so that the weights change by only a small amount per input-outputd 32 correlation31 32 21S 11 . In21T particular, we find a class equivalently write these equations as t W = S W W TheS W temporal, (4) dynamics of learning To0 understand the learning epoch, we can average (1)-(2) over all P examples of exactdt solutions (whose derivation will bewhere presented the function else-a(t,s,a ) governing the strength of each connection between learninginput-output dynamics mode is given and by training set statis- 21 32T µ µT 3211 21 µ TµT 21 32 and take a continuous time limit to obtainDW the mean= lW changey inx whereWwhere)S W Ex[ forxxx] Wis an N(1t) (5)andN1 inputW correlation(t) such matrix, that the composite map- ⌘ ⇥ tics, then, we can solve Eqns. (7)-(8). We havese found2st/t a class weights per learning epoch, T 31 T a(t,s,a )= . (8) 32 µ µT 32 ping21 µ µT at any21 time t is given by 0 2st/t DW = l y x W W x x W .S E(6)[yx ] of exact solutions(5) (whose derivation will bee presented1 + s/a0 else- ⌘ where)N2 that describe the weights of the network over time d 21 32T 31 32 21 11 t W = W ThisS formW emphasizesW S two crucial(3) aspects of the learningW 32(t)W dy-21(t)= a(t,s ,a0 )uavaT , (7) dt during learning,a a as a function of the training set. In partic- namics. First, it highlights the importance of the statistics ular,a= the1 composite mapping at any time t is given by d 32 31 32 21 11 21T t W = S ofW theW trainingS set.W In, particular,(4) the training set enters only 0 dt where the function a(t,s,a ) governing the strengthN2 of each through two terms, one related to the input-output correla- 32 21 0 a aT input-output mode is given by W (t)W (t)= a(t,sa,a )u v , (9) 11 T tions yµxµT and the other related to the input correlations  a where S E[xx ] is an N1 N1 input correlation matrix, a=1 ⌘ ⇥xµxµT . Indeed, if l is sufficiently small so that weights change se2st/t 31 T a(t,s,a )= . 0 (8) only a small amount per epoch, we can rewrite these equa- 0 where2st/ thet function a(t,s,a ) governing the strength of each S E[yx ] (5) e 1 + s/a0 ⌘ tions in a batch update form by averaging over the training input-output mode is given by set to obtain the mean change in weights per learning epoch, se2st/t a(t,s,a )= . (10) d T 0 2st/t t W 21 = W 32 S31 W 32W 21S11 (7) e 1 + s/a0 dt d T That is, the network learns about the N2 strongest input- t W 32 = S31 W 32W 21S11 W 21 , (8) dt output modes identified by the singular value decomposi- tion, progressively incorporating each mode into its repre- 11 µ µT T a where S  = x x E[xx ] is an N1 N1 input corre- sentation. The coefficient a(t,s ,a ) describes how strongly ⌘ µ 1 ⌘ ⇥ 0 lation matrix, S31 is the N N input-output correlation ma- input-output mode a has been learned by time t, starting 3 ⇥ 1 trix defined previously, and t P . Hence we see that linear from some small initial value of a . As can be seen from ⌘ l 0 networks are sensitive only to the second order statistics of Fig. 3, this function is a sigmoidal curve, capturing the fact inputs and outputs. In general the learning process is driven that the network initially knows nothing about a particular by both the input and input-output correlation matrices. Here dimension (the animal-plant dimension, say), but over time we take the simplifying assumption that these input corre- learns the importance of this dimension and incorporates it lations are insignificant; formally, we assume S11 = I, the into its representation, ultimately reaching the correct asso- identity matrix. Concretely, this assumption corresponds to ciation strength sa. At this point the network correctly maps the supposition that input representations for different items items onto the animal-plant dimension using the object an- are highly differentiated from, or orthogonal to each other. alyzer vector vaT , and generates the corresponding correct While this is unlikely to hold exactly in any natural domain, features using the feature synthesizer vector ua. we take this assumption for two reasons. First, it was used in Eqns. (9)-(10) describe the fundamental connection be- prior simulation studies (Rogers & McClelland, 2004), and tween the structure of a training set and learning dynamics. hence our attempt to understand their results is not limited In particular, the dynamics depends on the singular value by this assumption. Second, Rogers and McClelland (2004) decomposition of the input-output correlation matrix of the Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F

F 0 Items Modes Modes 0Items to 1, the network has seen P examples corresponding to P Modes Items Modes C S O ModesR 1 2 Modes 3 Items 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t . Here S

S l 3 one learning+ epoch. We⇥ note that, although the networkP we ⌘

C S O R 1 2 3 1 2 3 is1 - an N3 N1 input-output correlation matrix, and t . Here M C S O R M t measures time in units of learning epochs; as t varies from = Properties l B B Properties Properties + analyze⇥ is completely linear with the simple input-output⌘ map 1 2 F F M 0 M t measures32 time21 0 in to units 1, the of network learning has epochs;seen P examples as t varies corresponding from to Modes Modes P P y = W W x, the gradient descent learning dynamics given S S 3

2 one learning epoch. We note that, although the network we F F 0 0 to 1, the network has seen P examples corresponding to T in Eqns. (3)-(4)- are highly nonlinear. Modes Modes 31 = Properties B B = Properties V analyze is completely linear with the simple input-output map

S U S S Σ 3 one learning epoch. We note that, although the network we - 32 21 P Input-output P Output Input y = W W x, the gradient descent learning dynamics given = Properties Decomposing the input-output correlations Our funda- Singular values B B Properties Properties correlation matrix singular vectors singular vectors analyzeT is completely linear with the simple input-output map 31 = mental32 goal21 isin to Eqns. understand (3)-(4) the are highlydynamics nonlinear. of learning in (3)-

P V P Figure 2: Example singularΣ value decompositionU forS a toy y = W W x, the gradient descent learning11 31 dynamics given Input-output Output (4)Input as a function of the input statistics S and S . In general, SingularT values Decomposing the input-output correlations Our funda- 31dataset. Left: The learningcorrelation matrix environment singular isvectors specified by an insingular Eqns. vectors (3)-(4) are highly nonlinear. = U S V the outcome ofmental learning goal will is to reflect understand an interplay the dynamics between of learning the in (3)- Σ input-output correlation matrix. This example dataset has 11 3111 Input-output OutputFigure 2: Example singular valueInput decompositionperceptual for a toy correlations(4) as a function in the input of the patterns, input statistics describedS and by S ., In general, Singular values Decomposing the input-output correlations Our funda- correlation fourmatrix items: Canary,singulardataset. vectors Salmon, Left: Oak, The and learning Rose.singularThe environment vectors two animals is specifiedand theby an input-output correlations described by S31. To begin, mental goal is tothe understand outcome of learning the dynamics will reflect of learningan interplay in between (3)- the share the propertyinput-output that they can correlationMove, while matrix. the twoThis plants example dataset has 11 Figure 2: Example singular value decomposition for a toy though, we considerperceptual the correlations case of orthogonal in the11 input input patterns,31 representa- described by S , cannot. In additionfour each items: itemCanary, has a unique Salmon, property: Oak, and can Rose.FlyThe, (4) twotions as animals a function where each of itemthe input is designated statistics byS a singleand S active. In input general,31 dataset. Left: The learning environment is specified by an and the input-output correlations described by S . To begin, can Swim, has Barkshare, and the has propertyPetals, that respectively. they can Move Right:, while The thethe twounit, outcome plants as used of bythough, learning (Rumelhart we will consider reflect& Todd, the casean 1993) interplay of orthogonal and (Rogers between input & representa- the 31 11 input-outputSVD correlation decomposes matrix.cannot.S into In input-outputThis addition example eachmodes item dataset hasthat a unique link has a set property:perceptualMcClelland, can Fly correlations, 2004).tions where In in this the each case, input itemS11 patterns, iscorresponds designated described by to a the single iden- by activeS , input four items:ofCanary, coherently Salmon, covaryingcan Oak,Swim properties and, has Rose.Bark (output, andThe has singular twoPetals animals, vectors respectively.in and Right:tity the matrix. input-output The Underunit, asthis correlations used scenario, by (Rumelhart the described only & aspect Todd, by of 1993)31 the. To train- and begin, (Rogers & 31 S share the propertythe columns that of theyU)SVD to can a decomposessetMove of coherently, whileS into covaryingthe input-output two plants itemsmodes (in- though,thating link examples a we set considerMcClelland, that drives the case learning 2004). of Inorthogonal is this the case, secondS input11 corresponds order representa- input- to the iden- put singular vectorsof coherentlyin the rows covarying of V T ). properties The overall (output strength singular vectors in 31 cannot. In addition each item has a unique property: can Fly, tionsoutput where correlation eachtity item matrix. matrix is designatedUnderS . We this consider scenario, by a its single the singular only active aspect value input of the train- of this link is giventhe by columns the singular of U) values to a setlying of coherently along the covarying di- items (in- ing examples that drives learning is the second order input- can Swim, has Bark, and has Petals, respectively. Right: TheT decomposition (SVD) agonal of S. In thisput toy singular example, vectors modein 1 the distinguishes rows of V ). plants The overallunit, strength as used byoutput (Rumelhart correlation & matrix Todd,S31 1993). We consider and (Rogers its singular & value 31 11N SVD decomposes S intoof input-output this link is givenmodes by thethatsingular link values a set lyingMcClelland, along the di- 2004). In this case,T S 1 corresponds to the iden- from animals; mode 2 birds from fish; and mode 3 flowers decomposition31 = 33 31 (SVD)11 = a aT , of coherently covarying propertiesagonal of (Soutput. In this singular toy example, vectors modein 1 distinguishes plants S U S V Â sau v (6) from trees. tity matrix. Under this scenario, thea=1 only aspectN of the train- T 1 the columns of U) to a setfrom of coherently animals; mode covarying 2 birds from items fish; (in- and modeingwhich 3 examples flowers will play that a drives central learningS role31 = inU understanding33 isS31 theV 11 second= hows orderua thevaT ex-, input- (6) T Â a put singular vectors in thefrom rows trees. of V ). The overall strength 31 a=1 We wish to train the network to learn a particular input- outputamples correlation drive learning. matrix TheS SVD. We decomposes consider its any singular rectangu- value which will play a central role in understanding11 how the ex- of this linkoutput is given map by from the asingular set of P valuestraininglying examples alongx theµ,yµ di-,µ = decompositionlar matrix into (SVD) the product of three matrices. Here V is We wishµ to train the network{ to learn} a particularµ input- amples drive learning. The SVD decomposes any rectangu- agonal of S1.,..., In thisP. The toy input example, vector modex , identifies 1 distinguishes item µ while plants each y an N1 N1 orthogonal matrix whose columns contain input- 11 output map from a set of P training examples xµ,yµ ,⇥µ = lar matrix intoa the productN1 of three matrices. Here V is from animals;is a set mode of attributes 2 birds to from be associated fish; and to mode this item. 3 flowers Training { analyzing} singular31 vectors33 31 v 11thatT reflect independenta aT modes 1,...,P. The input vector xµ, identifies item µ while each yµ S an=N1U NS1 orthogonalV33 = matrixsau whosev columns, contain(6) input- is accomplished in an online fashion via stochastic gradient of variation in the input,⇥ U is an ÂN3 a N3 orthogonal ma- from trees. is a set of attributes to be associated to this item. Training analyzing singular vectorsa=1v⇥ that reflect independent modes descent; each time an example µ is presented, the weights trix whose columns contain output-analyzing33 singular vectors is accomplished in an online fashion via stochasticwhicha gradient will playof a variation central in role the in input, understandingU is an N3 howN3 orthogonal the ex- ma- W 32 and W 21 are adjusted by a small amount in the direction u that reflect independent modes of variation in the⇥ output, descent; each time an example µ is presented, the weights31 trix whose columns contain output-analyzing singular vectors We wish to train the network to learnµ a particular32 21 µ input-2 amplesand S driveis an learning.Na3 N1 Thematrix SVD whose decomposes only nonzero any elements rectangu- that minimizes theW squared32 and W error21 arey adjustedW byW ax smallbetween amount in the direction u that⇥ reflect independent modes of variation in11 the output, µ µ larare matrix on the into diagonal; the product31 these elements of three are matrices. the singular Here valuesV is output mapthe from desired a set feature of P output,training and examples the network’sx feature,y µ,µ output.=32 21 µ 2 and S is an N3 N1 matrix whose only nonzero elements thatµ minimizes the squared{ error }y Wµ W x between ⇥ an sNa1,a =N1 ,...,orthogonalN1 ordered matrix so that whoses1 s columns2 s containN1 . An ex-input- 1,...,P. TheThis input gradient vector descentthex , desired procedureidentifies feature yields item output, theµ while learning and the each network’s ruley feature output.⇥ are on the diagonal; these elements ··· are the singular values analyzingample SVD of a toy datasetva is given in Fig. 2. As can be is a set of attributes to be associated to this item. Training singularsa,a vectors= 1,...,N1 orderedthat reflect so that independents1 s2 modessN1 . An ex- 21 This gradient32T µ descentµT procedure32 21 µ yieldsµT the learningseen, rule the SVD extracts coherently33 covarying items and ··· prop- is accomplished inDW an online= lW fashiony x via stochasticW W x x gradient (1) of variation in theample input, SVDU of a toyis an datasetN3 isN given3 orthogonal in Fig. 2. ma- As can be 21 32T µ µT 32 21 µ µTerties from this dataset, with various modes⇥ picking out the 32 µDµTW =32 l21W µ µTy x 21T W W xtrixx whose(1) columnsseen, contain the SVDoutput-analyzing extracts coherently covaryingsingular items vectors and prop- descent; each timeDW an example= l y xµ is presented,W W x x theW weights, (2) underlying hierarchy present in the toy environment. 32 21 a T erties from this dataset, with various modes picking out the W and W are adjusted by a smallDW 32 amount= l y inµx theµT directionW 32W 21xµxµT Wu 21that, reflect(2) independent modes of variation in the output, 31 underlying hierarchy present in the toy environment. for each example µ, whereµl is a32 small21 learningµ 2 rate. We andTheS temporalis an N dynamicsN matrix of learning whose onlyA central nonzero result elements of that minimizes the squared error y W W x between 3 ⇥ 1 imagine that trainingfor each is divided example intoµ, awhere sequencel is aof small learning learningarethis rate. on work the We diagonal; is thatThe we temporal these have described elements dynamics the are of full learning the time singular courseA central values of result of the desiredepochs, feature and output, in eachimagine and epoch, the that thenetwork’s training above isrules feature divided are followed intooutput. a sequence for oflearning learning by solvingthis work the nonlinear is that we dynamical have described equations the full (3)-(4) time course of sa,a = 1,...,N1 ordered so that s1 s2 sN . An ex- This gradientall P descentexamples procedure inepochs, random yields and order. in the each As learning long epoch, as thel ruleis above sufficiently rules are followedfor orthogonal for learning input representations by solving the nonlinear (S11 =I dynamical···), and arbitrary1 equations (3)-(4) ample SVD of a toy dataset is given in Fig. 2. As can be small so that the weightsall P examples change in by random only a order. small amountAs long peras l is sufficientlyinput-output correlationfor orthogonalS31 input. In representations particular, we ( findS11 = aI class), and arbitrary 21 32T µ µT 32 21 µ µT DWlearning= epoch,lW wesmall cany x so average thatW the (1)-(2) weightsW x overx change all P byexamples only(1) a smallseen, amountof exactthe per SVD solutions extractsinput-output (whose coherently correlation derivation covaryingS will31. be In presented particular, items and else- we prop- find a class 21 32 and take a continuouslearning time epoch, limit to we obtain can average the meanT (1)-(2) change over in allertiesPwhere)examples from for thisW of dataset,(t exact) and solutionsW with(t) varioussuch (whose that derivation modes the composite picking will be map- presented out the else- 32 µ µT 32 21 µ µT 21 21 32 DWweights= perl learningy xand epoch, takeW a continuousW x x timeW limit, to obtain(2) the meanunderlyingping change at any in hierarchy timewhere)t is given present for W by(t in) and theW toy(t environment.) such that the composite map- weights per learning epoch, ping at any time t is given by d T N2 for each example µ, where 21 l is a32 small31 learning32 21 rate.11 We The temporal32 dynamics21 of learning0 Aa centralaT result of Learning dynamics t W = W d S W W TS (3) ( ) ( )= ( , , N)2 , 21 32 31 32 21 11 W t W t 32 Â a21t sa aa u v 0 a a(7)T imagine that trainingdt is divided intot W a sequence= W ofS learningW W Sthis work(3) is that we haveW described(t)W (t)= the fulla(t,s timea,a )u coursev , of (7) In linear networks, there is an equivalent formulaon that dt a=1 Â a epochs, and in eachd epoch,32 the above31 rules32 21 are11 followed21T for learning by solving the nonlinear dynamicala=1 equations (3)-(4) highlights the role of the stascs of the training environment: t W = S d W32 W S 31W 32, 21 11(4) 21Twhere the function a(t,s,a0) governing the strength of each dt t W = S W W S W , (4) where the function a(t,s,a0)11governing the strength of each all P examples in random order. Asdt long as l is sufficiently forinput-output orthogonal mode input is representations given by (S = I), and arbitrary 11 T input-output mode31 is given by small so thatwhere theS weightsE[xx change] is an 11N by1 onlyN1 inputT a small correlation amount matrix, per input-output correlation S . In particular, we find a class Input correlaons: ⌘ where S ⇥E[xx ] is an N1 N1 input correlation matrix, 2st/t ⌘ ⇥ se 2st/t learningInput-output correlaons: epoch, we can average31 (1)-(2) overT all P examples of exact solutionsa(t (whose,s,a0)= derivation will be.se presented(8) else- S E[yx ] 31 T (5) a2(stt,/st,a )= . (8) [ ] 21 32 e 01 + s/2ast0/t and take a continuous time limit to obtain⌘ the meanS changeE yx in where) for(5)W (t) and W (t) such thate the composite1 + s/a map- ⌘ 0 Equivalent dynamics: weights per learning epoch, ping at any time t is given by

N2 d 21 32T 31 32 21 11 t W = W S W W S (3) W 32(t)W 21(t)= a(t,s ,a0 )uavaT , (7) dt  a a a=1 d 32 31 32 21 11 21T t W = S W W S W , (4) 0 dt where the function a(t,s,a ) governing the strength of each input-output mode is given by 11 T •where Learning driven only by correlaons in the training data S E[xx ] is an N1 N1 input correlation matrix, • Equaons coupled and nonlinear ⌘ ⇥ se2st/t 31 T a(t,s,a )= . (8) 0 2st/t S E[yx ] (5) e 1 + s/a0 ⌘

Items Modes Modes Items P Items Modes Modes C S O ItemsR 1 2 3 1 2 3 C S O R is an N3 N1 input-outputP correlation matrix, and t . Here Items Modes Modes Items + P l Items ModesC S O R Modes 1 2 3 Items 1 2 3 C S O R is an N3 N1 input-output correlation matrix,⇥ and t . Here ⌘ C S O R is an N N input-output1 correlationP matrix, and tl . Here 1 2 3 M 3 1 1 2 M 3 C S O R + t measures time in units of learning epochs; as t varies from C S O R 1 2 3 is an N3 N1 input-output+ correlation⇥ matrix, and t . Here ⌘ l 1 2 3 C S O R 1 ⇥ ⌘

M l M t measures time in units of learning epochs; as t varies from 1

+ 2 F

F 0 M ⇥ ⌘ M t measures time in units of learning0 to 1, epochs; the network as t hasvaries seen fromP examples corresponding to 1 Modes Modes M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to S S 3

2 one learning epoch. We note that, although the network we F

F 0 Modes

Modes 0 to 1, the network has seen P examples corresponding to

2 - F F Modes 0 0Modes to 1, the network has seen P examples corresponding to = Properties S S 3 B

B one learning epoch. We note that, although the network we Properties Properties Modes

Modes analyze is completely linear with the simple input-output map S S 3 - one learning epoch. We note that, although the network we S S Properties Properties 32 21 = 3 Items Modes Modes one learningItems epoch.- We note that, although the network we B P P P B Properties Properties analyze is completely linear with they = simpleW W input-outputx, the gradient map descent learning dynamics given = Properties C S O R 1 2 3 1 2 -3 C S O R is an N3 N1 input-output correlation matrix, and t . Here B B Properties Properties l = Properties analyze is completely linear with the simple input-output map + ⇥ ⌘

B 32 21 B Properties Properties

analyze1 is completely linear with the simple input-outputT map

M in Eqns. (3)-(4) are highly nonlinear. P M P 31 = t measuresy = timeW inW32 units of21x, learning the gradient epochs; as descentt varies from learning dynamics given P P 32 21 U y = W SW x, the gradientV descent learning dynamics given 2 F

F 0

P 0 to 1, the network has seen P examples corresponding to P y Σ= W W x, the gradient descent learning dynamics given

Modes T 31 Input-outputModes Outputin Eqns. (3)-(4) are highlyInput nonlinear. Our funda- S

S Decomposing the input-output correlations = 3 V T one learningin epoch. Eqns.Singular We values (3)-(4) note that, are although highly the nonlinear. network we 31 U T S correlationin Eqns. matrix (3)-(4)- singular are vectors highly nonlinear. singular vectors

31 Σ = = Properties

B V B Properties Properties U S analyze is completely linear with the simple input-output map = UInput-outputΣ S Output V Input mental goal is to understand the dynamics of learning in (3)- Σ Singular values 32Decomposing21 the input-output correlations Our funda- 11 31 P Input-outputP Output Figure 2: ExampleInput singulary = W W valuex, the decomposition gradient descent learningfor a toy dynamics given Input-output Outputcorrelation matrix singular vectorsInput Singular values singular vectors Decomposing the input-output(4) as correlations a function of theOur input funda- statistics S and S . In general, correlation matrixSingular valuessingular vectors Decomposingsingular vectors the input-output correlations Our funda- correlation matrix singular vectors 31 singular vectors dataset. Left:T The learningin Eqns.mental (3)-(4) environment are goal highly is is nonlinear. to specified understand by an the dynamics of learning in (3)- Figure 2: ExampleΣ singular= U value decompositionS mentalV for goal a is toy to understandmental the goal dynamics is to understand of learning thethe in dynamics outcome (3)-11 of of31 learning learning will in reflect (3)- an interplay between the Input-output Output input-outputInput correlation matrix.(4) as This a function example of thedataset input has statistics S and S . In general, 11 Figure 2: Example singular valueSingular decomposition values for a toyDecomposing the input-output correlations Ourperceptual funda- 11 correlations31 in the input patterns, described by S , Figure 2: Example singulardataset. value Left:correlation decomposition The matrix learning singular environment vectors for a toy is specifiedsingular vectors by an (4) as a function11 of the input31 statistics S and S . In general, four items:(4) asCanary, a function Salmon,mental of the goalthe Oak, input is outcome to and understand statistics Rose. ofThe the learningS dynamics twoand animals willS of learning. reflect In general,and in an (3)- the interplay input-output between correlations the described by S31. To begin, dataset. Left:Figure The 2: Example learning singular environment value decomposition is specified for a toy by an the outcome of learning11 will31 reflect an interplay between the dataset. Left: The learninginput-output environment correlation is specified matrix. by This an examplesharethe the outcome property dataset that has of(4) learning they as a functionperceptual can willMove of the reflect, while correlationsinput statistics an the interplay twoS in plantsand theS between input. In general, patterns, the described by S11, input-outputdataset. correlation Left: The learning matrix. environment This example is specified dataset by an has though, we consider the case of11 orthogonal input representa- input-output correlationfour matrix. items: Canary, This example Salmon, dataset Oak, and has Rose.cannot.The In two addition animals eachthe outcome item hasperceptual of a learning unique will correlations property: reflect an can interplay inFly the, between input the patterns,11 described31 by S , input-output correlation matrix. This exampleperceptual dataset has correlationsand in the the input-output input patterns, correlations describedtions described by S11 where, by eachS item. To is begin, designated by a single active input four items: Canary, Salmon, Oak, and Rose. The two animalsperceptual correlations in the input patterns, described31 by S , 31 four items: Canary, Salmon,share the Oak, propertyfour and items: Rose.Canary, thatThe they Salmon, two can animalsOak,Move and, Rose. whilecan TheSwimand the two, the has animals two input-outputBark plants, and hasthough, correlationsPetalsand the, respectively. we input-output consider described Right: the correlationsby case TheS of.31 To orthogonalunit, begin, described as used input by by (RumelhartS representa-. To begin, & Todd, 1993) and (Rogers & share the property that they can Move, while the two plants31and the input-output correlations described by S . To begin, share the property thatcannot. they can Inshare additionMove the, property while each item that the they two has can aplantsMove unique,SVD while property:though, decomposesthe two plants wecan considerSFlythough,,into input-outputtions we thethough, consider case where of the wemodes orthogonal case each consider ofthat item orthogonal link isthe input a designated setinput case representa- representa- ofMcClelland, orthogonal by a single 2004). input active In representa- this input case, S11 corresponds to the iden- cannot. Incannot. addition In addition each each item item has has a a unique uniqueof coherentlyproperty: property: can covaryingFly can, Fly properties, (output singular vectors in cannot. In addition eachcan itemSwim has, has a uniqueBark, and property: has Petals can,Fly respectively., tions Right: where The eachtions item wheretions is each designated item where is designated each by aitem by single a is single designated active activetity input matrix. by a Under single this active scenario, input the only aspect of the train- can Swim, has Bark, and has Petals, respectively.the columns Right: of TheU) tounit, a set as ofunit, used coherently by as (Rumelhart used covarying by & (Rumelhart Todd, items 1993) (in- and & (Rogers Todd, & 1993) and (Rogers & can Swim, has Bark31, and31 has Petals, respectively. Right: The ing11 examples that drives learning is the second order input- can Swim, has Bark, andSVD has decomposesPetalsSVD, decomposes respectively.S intoS input-outputinto Right: input-output The modesmodesunit,thatthat link as link a used set a set byMcClelland, (RumelhartMcClelland,unit, 2004). asT & In used this Todd, 2004). case, byS 1993) (Rumelhart11 Incorresponds this and case, (Rogers to &S the Todd, iden-corresponds & 1993) and to the (Rogers iden- & 31 31 put singular vectors in the rows of V ). The overall strength 11 31 SVD decomposes S ofSVDinto coherently input-output decomposesof coherently covaryingmodesS covaryinginto propertiesthat input-output properties link ( aoutput (output set modes singular singularthat vectors vectors linkin aintity set matrix.McClelland, Under this scenario,11 2004). the only In thisaspect case, of theoutputS train-corresponds correlation matrix to theS iden-. We consider its singular value of thisMcClelland, link is given 2004). by the singular Intity this matrix. valuescase, UnderSlyingcorresponds along this scenario,the di- to the the iden- only aspect of the train- of coherentlythe columns covarying of U) toproperties a set of coherently (output covarying singular items vectorsDecomposing input-output correlaons (in- ingin examples that drives learning is the second orderdecomposition input- (SVD) of coherently covaryingthe properties columns of (outputU) to singular a set of coherently vectors inT covarying items (in- tity matrix. Under this scenario, the only aspect of the train- put singular vectors in the rows of V ).agonal The overalltity of matrix.S strength. In this Under toyoutput example, this correlationing scenario, examples mode matrix 1 distinguishes theS that31. only We drives consider aspect plants learning its of singular the is train- value the second order input- the columns of U) to a set of coherentlyT covarying items (in- 31 N1 the columns of U) to aput set singular of coherentlyof this vectors link is covaryingin given the by rows the itemssingular of V (in- values).from Thelying overallanimals; along the strength mode di- 2 birds froming examples fish; and mode that 3 drives flowers learning is the second31 order33 31 input-11T a aT T ing examplesThe learning dynamics can be expressed using the SVD of thatdecomposition drivesoutput learning (SVD) correlation is the matrix secondS order. We considerinput- itsS singular= U S valueV = s u v , (6) put singular vectors inofput the this singular rows linkagonal of is vectors givenV T of).S. The by Inin this the theoverall toysingular rows example, strength of mode valuesV ). 1 distinguishes Thelying overall along plants strength the di- 31 31  a fromoutput trees. correlation matrixdecompositionoutputS . correlation We consider (SVD)N matrix1 its singularS . We value consider its singular value a=1 from animals; mode 2 birds from fish; and mode 3 flowers 31 33 31 11T a aT of this link is given byagonalof the thissingular linkof S. is In valuesgiven this toy bylying example,the alongsingular mode the values di- 1 distinguisheslying along plants the di- decompositionS = U S V (SVD)=  sau v , which(6) will play a central role in understanding how the ex- from trees. decomposition (SVD) a=1 N1 agonal of S. In this toyfromagonal example, animals; of S. mode In mode this 1 distinguishes toy 2 birds example, from plants mode fish; 1 and distinguishes mode 3 flowers plants 31 33 31 11T amples drivea a learning.T The SVD decomposes any rectangu- We wish toMode train thewhichα links a set of coherently network will play to a learn central aS role particularN = in understandingUcovaryingS input-V properes to how= the ex-Ns1au v , (6) 1 µ µ T  11 from animals; modefrom 2from birds trees. animals; fromWe fish; wishmode and to train 2 mode birds the network 3from flowers to fish; learnoutput and a particular mode map from input- 3 flowers a set31amples of P drivetraining33 31 learning.11 examplesT The SVD31 x decomposes,ya 33a,Tµ31= any11 rectangu-lar= matrix1 intoa a theT product of three matrices. Here V is a set of coherently = covarying= S items with strength = U S, V =a11 sau v , (6) µ µ Slar matrixUµ intoS theV product of three{sau matrices.v} µ HereanV N (6)Âis N orthogonal matrix whose columns contain input- from trees. from trees.output map from a set of P training examples1,...,P.x The,y , inputµ = vector x , identifies item µ while each y 1 = 1 µ { } µ an N whichN orthogonal will matrix playa= awhose1 central columns role contain in understandinginput-a⇥1 how the ex-a 1,...,P. The input vector x , identifies item µ while each y 31 1 ⇥ 1 analyzingT singular vectors v that reflect independent modes is a set of attributes toanalyzing be associatedampleswhichsingular= drive vectors will to this playlearning.va item.that a reflect central Training The independent SVD role in decomposes modes understandingV any rectangu- how the ex- We wishis to a set train of attributes the network to be associated to learn to this awhich particular item. Training will input- playΣ a central role inU understandingS howof the variation ex- in the input, U33 is an N N orthogonal ma- is accomplished inInput-output an online fashionFeature via synthesizer stochastic33 gradient Object analyzer 11 3 3 is accomplished in an online fashion via stochastic gradientµ µ of variationlaramples in matrix the input, drive intoU learning. theis an productSingularN3 N values The3 orthogonal of SVD three ma- decomposes matrices. Here any rectangu-V is ⇥ We wish to train theoutputWe network wish map to from to learn train a set a the particular of networkP training input- to learn examples aamples particularx ,y drivecorrelation, input-µ learning.= matrix The SVDvectors decomposes⇥ any rectangu-trix whosevectors columns contain output-analyzing singular vectors descent; each time an example µ is presented,descent; the{ each weights time} antrix example whose columnsµ is presented, contain output-analyzing the weightssingular vectors11 11 µ µ µ µ µ Itemsµ a anlarN1 matrixNModes1 orthogonal into the productModes matrix whose of threea Items columns matrices. contain Hereinput-V is output map from a set1output,..., of PP maptraining. TheW 32 from inputand examplesW a21 vector setare adjusted ofxPx, trainingidentifies, byy a small,µ = amountexamples itemW 32 larand inµ thewhile matrixW direction21x are, eachy into adjusted,µyu= thethat by product reflect a small independent of amount three modes in matrices. the of direction variation Here in theu output,V thatis reflect independent modes of variation in the output, C S O R31 ⇥ 1 2 3 1 2 3a C S 31O R µ { µ } µ µ 32 21 µ 2{ } andµ S analyzingisan an NN3 N1Nsingularmatrixorthogonal whose vectors2 only matrix nonzerov that elements whose reflectS columns independentN 1 containN modesinput- 1,...,P. The input vectoris1,..., a setxP,. ofidentifies Thethat attributes minimizes input item vector to theµ be squaredwhile associatedx , erroridentifies eachy y toW item thisWan item.xµNwhile1between TrainingN1 eachorthogonaly matrixµ1⇥ whose321 21 columnsµ containandinput-is an 3 1 matrix whose only nonzero elements

that minimizes the squared error y M W W x between ⇥ M are on the diagonal;a⇥ these elements are the33 singulara values ⇥ isis accomplished a set ofthe attributes desired in feature an to online be output, associated fashion and the network’s via tothe stochastic this desiredanalyzing feature item. featureoutput. gradient Trainingsingular output, vectors andofanalyzing variation thev network’sthatsingular in reflect the feature input, independent vectors output.U visthat anare modesN on reflect3 theN3 diagonal; independentorthogonal these ma- modes elements are the singular values F is a set of attributes to be associated to this item. Training F sa,a = 1,...,N1 ordered so that s1 s2 sN1 . An ex-⇥ 0

Modes 33 This gradient descent procedure yields the learning rule 33 ··· Modes s ,a = 1,...,N ordered so that s s s . An ex- descent; each time an example µ is presented,Thisof gradient variation the descent weights inample procedure the input, SVDtrixof of yieldswhose variationU a toy theis dataset columns an learning inN is the3 given contain rule input,N in3 Fig.orthogonaloutput-analyzingU 2. Asisa can an ma- beN3 Nsingular31orthogonal vectors ma-1 2 N1 S is accomplished in an online fashion via stochasticS gradient is accomplished in an online fashion via stochastic gradient a ⇥ ··· 32 21 21 32T µ µT 32 21 µ µT seen, theu SVDthat extracts reflect coherently independent⇥ covarying itemsmodes andample of prop- variation SVD of ina-1 toy the dataset output, is given in Fig. 2. As can be W and W areDW adjusted= l byW a smally x amountW W trixx inx the whose direction(1) columns contain=trix Properties whoseoutput-analyzing columns containsingularoutput-analyzing vectors singular vectors B descent; each time andescent; example eachµ is time presented, an example the weightsµ is presented, theB weights T 21 Properties 32 µ µT 31 32 21 µ µT a T 2 = erties from thisa dataset, with various modes pickingseen, out the the SVD extracts coherently covarying items and prop- 32 21 32 21 32 µ µT µ32 21 µ32µT u21D21thatWµ reflectl independentW andyux Sthat modesWis reflect anW N ofx3 independent variationx NItems: 1 matrixCanary, (1) inS the modesalmon, whose output,Oak, of onlyRose variation nonzero in elements the output, P W and W are adjustedWthat minimizesand by aW smallare the amountDW adjusted squared= inl the errorbyy x a direction smallWy W amountWx x W W inx the, betweenP direction(2) underlying hierarchy present in⇥ the toy environment.erties from this dataset, with various modes picking out the 3132 µ µTare on32 the3121 diagonal;µ µT 21Properes: T theseM elementsove, Fly, Swim, areBark, thePetals singular values µ 32 21 µ 2 µ 32 and21DWSµ 2is= an lN3y x N1andmatrixW SW whosexisx an NW only3 ,N nonzero1 matrix(2) elements whose only nonzero elements that minimizes the squaredthethat desired minimizes errorfor featurey each theW example squaredoutput,W µ,xwhere and errorbetween thel isy network’s a smallW learningW featurex rate. We output.betweenThe temporal dynamics of learning A central resultunderlying of hierarchy present in the toy environment. ⇥ sa,a = 1,...,N1 ordered⇥ so that s1 s2 sN . An ex- This gradientimagine descent that training procedure is divided yields into a the sequence learningare of on learning rule the diagonal;this work theseare is that on weelements the have diagonal; described are the the these full singular time elements course values of are the singular1 values the desired feature output,the desired and the feature network’s output, feature and the output. network’sfor each feature example output.µ, where l is a small learning rate. We ··· epochs, and in each epoch, the above rules are followed for learningample by solving SVD the nonlinear of a toy dynamical dataset equations is givenThe (3)-(4) temporal in Fig. 2. dynamics As can of be learning A central result of sa,a = 1,...,N1 orderedsa,a so= that1,...,s1 N1s2ordered sosN that1 . Ans1 ex-s2 sN1 . An ex- This gradient descentThis procedure gradientall yields21P descentexamples the learning in procedure32 randomT µ ruleorder.µT yields As long32imagine the as21 learningl isµ sufficiently thatµT training rule is divided into a sequence of learning···11 this work is that··· we have described the full time course of DW = lW y x W W amplex x SVD of(1)for a orthogonal toyseen, datasetample input the SVD is SVD representations given of extracts a in toy Fig. (S coherently dataset= 2.I), and As is arbitrary can covarying given be in Fig. items 2. and As prop- can be small so that the weights change by onlyepochs, a small amount and in per each epoch,input-output the above correlation rulesS31 are. In followed particular, for we find a class T T erties from this dataset, with variouslearning modes by solving picking the out nonlinear the dynamical equations (3)-(4) 21 32 µ µT21 32 21 32µ µT µ µT 32 seen,21 µ µT theT SVD extractsseen, coherently the SVD covarying extracts items coherently and prop- covarying items and prop- 11 DW = lW yDDWWxlearning32 W== epoch,WllW weyxµx canxµT averagey Wx 32 (1)-(2)W(1)21Wxall overµxWPµT allexamplesxWP xexamples21, in random(2)of(1) exact order. solutions As long(whose as derivationl is sufficiently will be presentedfor else- orthogonal input representations (S = I), and arbitrary and take a continuous time limit to obtain the mean change in where) forunderlyingW 21(t) and W hierarchy32(t) such that present the composite in the toymap- environment. 31 T smallerties so that from theT weights this dataset, changeerties with by only from various a small this modes dataset, amount picking per with variousinput-output out the modes correlation pickingS out. theIn particular, we find a class DW 32 = l yµxµT DWWweights3232W=21 perxµ learningxlµTyµWx epoch,µT21,W 32W(2)21xµxµT W 21, ping(2) at any time t is given by for each example µ, where l is a smalllearning learningunderlying epoch, rate. we hierarchy We can averageThe presentunderlying (1)-(2) temporal in over the hierarchy toy all dynamicsP environment.examples present of learning inof the exact toy solutionsA environment. central (whose result derivation of will be presented else- N2 d 21 32T 31 32 21 11 21 32 imagine that trainingt W is divided= W intoS aW sequenceandW takeS a continuous of learning(3) time limitW to32( obtaint)W 21(t)= the meana(t, changes ,a0 )ua invaT , where)(7) for W (t) and W (t) such that the composite map- for each example µ, forwhere eachl is example a smalldtµ, learningwhere rate.is a We small learning rate. We this work is that wea havea described the full time course of l weightsThe per temporal learning epoch, dynamicsThe of temporal learninga=1 dynamicsA central of result learningping at of any timeA centralt is given result by of epochs, and in eachd epoch,32 the31 above 32 rules21 11 are21T followed for learning by solving the nonlinear dynamical equations (3)-(4) imagine that trainingimagine is divided that into training at sequenceW is divided= ofS learning intoW W aS sequenceWthis, work of(4) learning is thatwhere we the havethis function work describeda(t,s is,a0) thatgoverning the we full have the time strength described course of each of the full time course of all P examples in randomdt order. As long as l is sufficientlyd forT orthogonal input representations (S11 = I), and arbitraryN2 epochs, and in each epoch,epochs, the and above in each rules epoch, are followed the above for ruleslearning aret followedW by21 solving=input-output forW 32 thelearning nonlinearS mode31 isW given32 byW dynamical solving21 byS11 the equations nonlinear(3) (3)-(4) dynamical32 equations21 (3)-(4) 0 a aT 11 T 31 W (t)W (t)=  a(t,sa,aa)u v , (7) small so thatwhere theS weightsE[xx ] changeis an N1 byN1 input only correlation a small matrix, amountdt per input-output correlation11 S . In particular,11 we find a class all P examples in randomall P order.examples As long in⌘ random as l is order. sufficiently⇥ As long asforl orthogonalis sufficiently input representationsfor orthogonal ( inputSse2st=/t representationsI), and arbitrary (S = I), and arbitrarya=1 learning epoch, we can average31 (1)-(2)T over all Pdexamples32 31 32a(t,s21,a )=11 21T . (8) S E[yx ] W(5) = of exactW 31W solutions0 W2st/t (whose, derivation31 will be presented else-0 small so that the weightssmall change so that by the only weights a small change amount by per only ainput-output smallt amount correlation per S input-outputS . InS particular,e correlation1 + s/ wea0(4)S find.where In a class particular, the function wea find(t,s,a a) classgoverning the strength of each and take a continuous time limit⌘ to obtain the meandt change in where) for W 21(t) and W 32(t) such that the composite map- learning epoch, we canlearning average epoch, (1)-(2) we over can all averageP examples (1)-(2) overof all exactP examples solutions (whose derivation will be presentedinput-output else- mode is given by 11 T of exact solutions (whose derivation will be presented else- weights per learning epoch, where S E[xx ]21is an N1pingN1 at32input any correlation time21t is given matrix, by32 and take a continuousand time take limit a continuousto obtain the time mean limit change to obtain in thewhere) mean⌘ for changeW ( int) and⇥where)W (t) forsuchW that(t) theand compositeW (t) such map- that the composite map-se2st/t 31 T N2 a(t,s,a )= . (8) weights per learning epoch,weights perd learning21 epoch,32T 31 32 21ping11 at any time tSis givenpingE[yx by at] any time t is given(5) by 0 2st/t = 32 21 0 a aT e 1 + s/a0 t W W S W W S (3) ⌘ W (t)W (t)= a(t,sa,aa)u v , (7) dt N  d T d T 2 a=1N2 21 32 31 21 32 21 1132 31 32 21 11 32 21 32 0 21a aT 0 a aT t W = W tSd W32W W= SW31 S32 (3)21W 11W S21T W (3)(t)W (t)= a(Wt,sa(,ta)W)u (vt)=, a(t,(7)s ,a )u v , (7) dt t dtW = S W W S W , (4)  a 0  a a dt where thea=1 function a(t,s,a ) governinga=1 the strength of each d 32 31 d32 3221 11 2131T 32 21 11 21T input-output mode is given by t W = S 11tW WW TS= W S , W W(4) S whereW the, function(4) a(t,s,a0) governing the strength0 of each dt where S dtE[xx ] is an N1 N1 input correlation matrix, where the function a(t,s,a ) governing the strength of each ⌘ ⇥ input-output mode is given by se2st/t 11 T 31 T input-output modea(t,s, isa given)= by . (8) where S E[xx ] is an N 11N inputT correlation matrix, 0 2st/t where1S 1 E[xx ] is anS N1 EN[yx1 input] correlation matrix,(5) / e 1 + s/a0 ⌘ ⇥ ⌘ ⌘⇥ se2st t se2st/t 31 T a(t,s,a )= . (8) S E[yx ] 31 T(5) 0 2st/t a(t,s,a0)= . (8) S E[yx ] (5) e 1 + s/a0 e2st/t 1 + s/a ⌘ ⌘ 0 Items Modes Modes Items P C S O R 1 2 3 1 2 3 C S O R is an N3 N1 input-output correlation matrix, and t l . Here + ⇥ ⌘ 1 M M t measures time in units of learning epochs; as t varies from 2 F F 0 0 to 1, the network has seen P examples corresponding to Modes Modes

Items S Items S Modes Modes 3 one learning epoch. We note that, although the network we P C S O R - is an N N input-output correlation matrix, and t . Here =1 Properties 2 3 1 2 3 C S O R 3 1 l B B Properties Properties analyze+ is completely⇥ linear with the simple input-output map ⌘

1 32 21 M M t measures time in units of learning epochs; as t varies from P P y = W W x, the gradient descent learning dynamics given

Items Modes Modes 2 Items

F T F 31 in0 Eqns.0 to (3)-(4) 1, the are network highly nonlinear. has seen P examplesP corresponding to C S O R is an N N input-output correlation matrix, and t . Here = 1 2Modes 3 1 2 3 3 1 Modes C VS O R l Σ U S + ⇥ ⌘ 1 S S 3 M Input-outputM Output Input t measuresone learning time in units epoch. of learning We epochs; note that, as t varies although from the network we Singular values Decomposing the input-output correlations Our funda- correlation matrix singular vectors singular vectors - 2 Properties Properties F F = 0 0 to 1, the network has seen P examples corresponding to B B Properties Properties Modes

Modes mentalanalyze goal is tois understand completely the linear dynamics with of learning the simple in (3)- input-output map S S Figure 2: Example singular value decomposition3 for a toy one learning32 epoch.21 We note that, although11 the31 network we - (4) as a function of the input statistics S and S . In general, P P = Properties y = W W x, the gradient descent learning dynamics given B B dataset. Properties Left: The learning environment is specified by an analyze is completely linear with the simple input-output map T the outcome32 21 of learning will reflect an interplay between the P P y =inW Eqns.W x, (3)-(4)the gradient are descent highly learning nonlinear. dynamics given 3input-output1 correlation matrix. This example dataset has perceptual correlations in the input patterns, described by S11, = U S V T in Eqns. (3)-(4) are highly nonlinear. Σ four items: Canary,31 Salmon,= Oak,U and Rose.S The two animalsV and the input-output correlations described by S31. To begin, Input-output Σ Output Input Our funda- share theInput-output property that theyOutput canSingular Move values, while the twoInput plants Decomposing the input-output correlations correlation matrix singular vectors Singular values singular vectors though,Decomposing we consider the input-output the case of orthogonal correlations inputOur representa- funda- correlation matrix singular vectors singular vectors cannot. In addition each item has a unique property: can Fly, tionsmentalmental where goal each is goal to understanditem is isto designated understand the dynamics by a the of single learning dynamics active in (3)-input of learning in (3)- Figure 2: Example singular value decomposition for a toy 11 31 11 31all, to a state in which the network fully incorporates that rela- Figure 2:can ExampleSwim, has singularBark, and valuehas Petals decomposition, respectively. Right: for The a toy unit,(4)(4) as as200 a used function as a by function of (Rumelhart the input of statistics the & Todd, inputS 1993) statisticsand S and . In (Rogers general,S and & S . In general, dataset. Left: The31 learning environment is specified by an Simulation tion. To formalize this, we begin with the sigmoidal learning dataset. Left:SVD decomposes The learningS into environment input-output modes is specifiedthat link a by set an McClelland,the outcome 2004). of learning In this will case, reflectS11 ancorresponds interplay between to the iden- the input-output correlation matrix. This example dataset has the outcomeTheory of learning will reflect an interplay11 curve between in (8). the If we assume the initial strength of the mode a0 input-outputof coherently correlation covarying matrix. properties This (output example singular dataset vectors in has tityperceptual matrix. correlations Under this inscenario, the input the patterns, only aspect described of the by S train-, 11 the columnsfour items: of UCanary,) to a set Salmon, of coherently Oak, and covarying Rose. The two items animals (in- andperceptual the150 input-output correlations correlations described in the input by S31 patterns,. To begin, describedsatisfies bya0