Journal of Statistical Mechanics: Theory and Experiment
PAPER Related content
- Probabilistic reconstruction in compressed Statistical mechanics of complex neural systems sensing: algorithms, phase diagrams, and threshold achieving matrices Florent Krzakala, Marc Mézard, Francois and high dimensional data Sausset et al.
- Constrained low-rank matrix estimation: To cite this article: Madhu Advani et al J. Stat. Mech. (2013) P03014 phase transitions, approximate message passing and applications Thibault Lesieur, Florent Krzakala and Lenka Zdeborová
- Clusters of solutions and replica symmetry View the article online for updates and enhancements. breaking in random k-satisfiability Andrea Montanari, Federico Ricci- Tersenghi and Guilhem Semerjian
Recent citations
- Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning Yao Zhang et al
- Statistical physics of community ecology: a cavity solution to MacArthur’s consumer resource model Madhu Advani et al
- Minimum and Maximum Entropy Distributions for Binary Systems with Known Means and Pairwise Correlations Badr Albanna et al
This content was downloaded from IP address 171.67.216.21 on 28/09/2018 at 18:56 J. Stat. Mech. (2013) P03014 66$33.00 + Theory and Experiment Theory Recent experimental advances in neuroscience have opened new Department of Applied Physics, Stanford University, Stanford, CA, USA Madhu Advani, Subhaneil Lahiri and Surya Ganguli E-mail: , [email protected] [email protected] and [email protected] Received 9 October 2012 Accepted 14 January 2013 Published 12 March 2013 Online at stacks.iop.org/JSTAT/2013/P03014 doi:10.1088/1742-5468/2013/03/P03014 Abstract. vistas into theof immense data challenges complexitytheoretical us of on frameworks neuronal twocooperate for parallel networks. fronts. understanding across This First,computational proliferation how widely how problems? can dynamical disparate we Second,neuronal network spatiotemporal form systems how from processes adequate scales high cangive dimensional to a we datasets? pedagogical To extract solve aid reviewat of in meaningful important a these the collection challenges, models we of intersectionWe ideas of of and introduce theoretical statistical methods thein arising physics, interrelated statistical computer replica physicsheterogeneous science systems as and of and many powerful cavity interactingthe neurobiology. ways degrees closely methods, of related to notion freedom. which of We quantitatively message alsoin passing originated introduce computer analyze in graphical science models, large as whichand originated a highly optimization distributed problems algorithm involving capable manyboth coupled of the variables. solving statistical We large then physicsa inference show and wide how computer diversity science ofdata perspectives contexts can analysis. to be Along problems appliedof the arising in structure way in in we noise, theoreticalsensing, random discuss neuroscience matrices, all spin and dimensionality reduction glasses, withinreview and learning the compressed theory, recent unified illusions formalism conceptualmodels, and of connections neural the computationstatistical between and replica learning. physics method. message Overall, and these computer Moreover,can passing ideas science we uncover illustrate emergent might in how computational provide functionscomplexities graphical a buried of lens deep neuronal within through the networks. which dynamical we
ournal of Statistical Mechanics: of Statistical ournal
2013 IOP Publishing Ltd and SISSA Medialab srl 1742-5468/13/P03014
c J
systems and high dimensional data Statistical mechanics of complex neural J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data cavity and replica method, spin glasses (theory), message-passing Keywords: algorithms, computational neuroscience minimization...... 44 1 L Acknowledgments Appendix. Replica theory 55 55 7.1. Network dynamics...... 7.2. 51 Learning and7.3. generalization...... Machine52 learning and data analysis...... 53 6.2. Replica analysis...... 6.3. 45 From message passing to network dynamics...... 49 6.1. 5.1. Point clouds...... 5.2. 39 Manifold reduction...... 5.3. 40 Correlated extreme value theory and dimensionality reduction...... 42 4.1. Replica formalism4.2. for randomThe matrices...... Wishart4.3. ensemble33 and theCoulomb gas Marchenko–Pastur4.4. distribution...... formalism...... 34 Tracy–Widom36 fluctuations...... 38 3.1. Perceptron learning...... 3.2. 22 Unsupervised learning...... 3.3. 23 Replica analysis3.4. of learning...... Perceptrons and25 3.5. Purkinje cells inIllusions the of3.6. cerebellum structure...... inFrom high27 message dimensional passing noise...... to synaptic28 learning...... 31 2.1. Replica solution...... 8 2.2. Chaos in2.3. the SKCavity model method...... and2.4. the13 HopfieldMessage solution...... passing...... 11 16 doi:10.1088/1742-5468/2013/03/P030142 Contents 7. Discussion 50 6. Compressed sensing 44 5. Random dimensionality reduction 39 4. Random matrix theory 32 3. Statistical mechanics of learning 22 2. Spin glass models of neural7 networks 1. 3 Introduction J. Stat. Mech. (2013) P03014 obvious) can provide a priori that is useful for an organism in terms of function Statistical mechanics of complex neural systems and high dimensional data A.3.1. SK model...... A.3.2. 57 Perceptron and unsupervised...... learning. 58 However, such networks of neurons and synapses, as well as the dynamical processes References 61 A.1. Overall...... framework A.2. 55 Physical meaningA.3. of...... overlaps Replica57 symmetric...... equations 57 A.4. Distribution ofA.5...... alignments Inverting59 the Stieltjies transform...... 60 doi:10.1088/1742-5468/2013/03/P030143 a powerful way to[4]. understand As both the its functions performed structurecan by and be neuronal the useful networks are details to oftenfor turn of computational sources to its in of ideas nature, complex insight from it into dynamics manner. how distributed networks In computing of this algorithms neurons paper may in we learnis computer also and to science compute focus compute in on the a distributed marginalinteracting distributed message probability system. distribution passing Many of algorithms problems a whoseand single in goal constraint degree computer satisfaction, of can science, freedom be including inreview formulated as a error below, message large correcting message passing codes problems passing [5]. is As intimately we shall related to the replica and cavity methods of Neuronal networks areof highly neurons complex interacting dynamicalmultiple through timescales. systems synapses For example, consisting [1]–[3]. onconnectivity of fast Such is timescales, large networks approximately of constant, subserve numbers the andactivity order dynamics this through of connectivity over milliseconds, neurons. directs synaptic the Onbeyond, flow slower the of timescales, synaptic electrical ofby connectivity the the itself statistical order can structure of changetimescales. of seconds through These experience, to synaptic which synaptic minutes plasticityexperience. itself To and changes induced can the stay extent are that constanttools such thought over separations even from of to longer timescale the hold, underlieunderstanding one statistical of can our physics exploit neuronal powerful ability dynamics ofthe and to disordered replica synaptic method systems learning learn and in torelevant the from basic because cavity obtain they models. method, allow a For which us example, wedegrees remarkably to introduce understand precise and of the review statistical freedom below, propertiesinteractions become of that that many may interacting are be highly coupled heterogeneous, to or disordered. eachthat occur other on them, are through notInstead, simply some they tangled webs have fixed, of complexity beenand that or sculpted exist adaptation, for quenched, over to their time, own solve sake. biological important through computational neuronal the problems networks processes necessary serve of for a evolution, survival. Thus, learning 1. Introduction improving its evolutionary fitness. Thestatistical very physics, concept as of function largebiological does disordered polymers, not statistical of do mechanical course not systems, arisethat arise a in like through biological glasses network evolutionary or performs processes. (which non- may In not general, always the be function J. Stat. Mech. (2013) P03014 P points in P is small. This is the N is large and P Statistical mechanics of complex neural systems and high dimensional data are large, but their ratio is O(1). For example, we can simultaneously measure dimensional feature space. Much of the edifice of classical statistics and machine N This combination of ideas from statistical physics and computer science is not only We give an outline and summary of this paper as follows. In2, section we introduce the We end section2 by introducing message passing, which provides an algorithmic N useful in thinking aboutbut how network dynamics also and plasticity forthroughput may experiments mediate thinking computation, in about neuroscience. ways Consider a to data analyze set large consisting of scale datasets arising from high statistical physics, and can serve asprocesses a of framework neuronal for plasticity thinking and aboutlike how network learning specific dynamics and dynamical may inference. solve computational problems an learning has been tailored to the situation in which doi:10.1088/1742-5468/2013/03/P030144 fundamental techniques of theparadigmatic replica example, method the and Sherrington–Kirkpatrick cavity (SK)In method model a within [6] neuronal the of network contextin a interpretation, of such spin which a a glass the [7]–[9]. system heterogeneousdisorder. qualitatively synaptic models On connectivity a is the large fixedunderstanding network other the and statistical hand, plays properties the neuronalstatistical of role the activity properties, of neuronal can quenched activity. termed Werealization fluctuate will self-averaging find and of properties, that we certain the dopaper; are in not disordered interested large depend in connectivity randomdeterministic on macroscopic systems matrix. order with the can This microscopic ariseheterogeneity. detailed heterogeneity, in Such is striking ways order that levels can a do of governwell not the as almost recurring depend dynamics the on and theme performance the learning details ofthis in in of machine order the learning neuronal can this algorithms networks, be as in understood analyzing theoretically data, and, through moreover, perspective the replica on and the cavityphysics methods. replica are and essentially cavity equivalentwhich methods. to are equivalently Many joint known and probability modelsMoreover, described distributions in as over graphical many equilibrium many modelsprobabilities statistical in variables, of computations computer a science single [10]. in variablespecial in cases such as graphical statistical belief models. propagation Message [11], physics passing, involves also a known class involve in of algorithms that computing yield dynamical marginal low dimensional data scenario inmany which classical we have unsupervised largepatterns amounts machine of in data. learning data, In algorithms suchin when situations, can neuroscience they easily has exist.and find pushed However, the us structures advent into or the of a activity high high of throughput O(100) dimensionalO(100)) techniques neurons data for but any scenario often givenexpression in only experimental levels which under condition. of both a Also, O(100)dimensional limited we scenario, genes can number it can of measure but be the trialsas only difficult single (i.e. to often in find classical cell also statistically a unsupervised gene significantstatistical limited patterns machine physics in learning number of the algorithms disordered data, of yield systemsdimensional cells. again illusory provides data, In structures. a The because such powerful toolminimization many a to of machine high a understand data-dependent high learning energy algorithmshow function on statistical can a physics set be plays of formulatedin parameters. a We high as useful review below dimensional role the data, insensing, as understanding which well possible are as illusions tailored approaches of to like structure the random high projections dimensional and data compressed limit. J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data In section3, we apply the ideas of replicas, cavities and messages introduced in In section4, we discuss the eigenvalue spectrum of random matrices. Matrices from -means clustering, revealing that all of these algorithms are prone to discovering illusory doi:10.1088/1742-5468/2013/03/P030145 section2 to the(see problem [12] for of a beautiful learning bookor in length data, review play neuronal of this the networks topic).learning role In parameters as of this of context, quenched well a training disorder, examples, machine asmechanical learning and algorithm, machine the degrees play synaptic the learning of role weights freedom.are of of fluctuating a optimized, In statistical network, or the or learned, zeroas the by temperature aspects minimizing limit, of an thesethat the energy degrees function. do learned of The not structure, learning freedom show can depend how error, to be on as compute described well these thecomputing order by detailed its parameters macroscopic for realization storage the orderlearning of classical capacity. algorithms, parameters perceptron including Also, the [13, Hebbian we training14], learning,K thereby principal compute examples, component analysis these or (PCA) orderstructures and data. that parameters We reliably for arise inend classical random section3 realizations ofby highvalued discussing dimensional synapses, noise. an known Finally, application to we derived be a of biologically an plausible message learning NP-complete algorithm passingof problem capable this of to [15, solving problem random learning16]. by instantiations synaptic approximating The with message weights authors binary passing determined of in by [ 17, a the18] joint training probability examples. distribution over many random matrix ensemblesdisplay have fascinating eigenvalue macroscopic spectra structuresof that whose the do probability not matrix distributions depend elements.of on These fields the spectral detailed [19, distributions realization in play20]; a within understanding central the role thenonlinear in context networks stability a of [21] wide of neuralby and variety showing networks, the linear how for analysis neural replica example,typical of theory eigenvalue networks, they high distribution can play the dimensional of alsoon a a data. provide transition understanding variety role a We of to begin general an randomensemble framework section4 [22]) ensemble chaos matrix whose for eigenvalue ensembles. of distribution, in computing known Then,[23], random as the we the provides empirical Marchenko–Pastur focus distribution a covarianceMoreover, null matrices model we (the for review Wishart thought the how of outcome as the Coulomb ofeigenvalues charges eigenvalues can PCA living be of applied in thought of the many to asgas, complex the which high random plane, thermally is equilibrated dimensional and matrix stabilized charge the data. interaction density via ensembles of distribution the and this of competing can an Coulomb these effects attractive bestatistics of confining of a external repulsive the potential. two largestcan dimensional Moreover, be eigenvalue, Coulomb we which understood review obeys simply how the in the Tracy–Widom terms distribution of [24, thermal25 ], fluctuations of this Coulomb gas [26, 27]. systems whose fixed points are designedmodels. to Another approximate recurring marginal theme probabilities in in thismay graphical paper profitably is be that certain viewed aspects(and through of also the neuronal synaptic) lens dynamics of dynamicsin message can a passing; be suitably in viewed essence, definedand as these message graphical approximate neuronal versions passing model. of allowssignificance This message for correspondence of passing the between possibility existing neuronalneuronal of dynamics neuronal dynamics both from dynamics understanding a computational the and perspective. computational deriving hypotheses for new forms of J. Stat. Mech. (2013) P03014 minimization based 1 L minimization to neuronal 1 L Statistical mechanics of complex neural systems and high dimensional data minimization [47], learning sparse representations [48, 49], 1 L In section5, we discuss the notion of random dimensionality reduction. High The manifold of sparse signals forms a ubiquitous and interesting low dimensional After introducing CS in section 6.1, we show how replica theory can be used to analyze minimization. In section6, we focus mainly on the analysis of 1 dimensional data can be difficultsuch difficulties to is both to model reducealgorithms and the dimensionality process. search of One the for approach data; tosection optimal indeed,3.5, circumvent many directions such machine algorithms learning on yieldillusory projected which structures data that to distributions do that project not revealthe exist low the data in dimensional, onto the data. a data. An random Asambient alternate subspace. dimensionality discussed approach As of is the in to the dimensionality simply featurewill of project space this necessarily subspace in be is which lower lost. thelow than dimensional data However, the submanifolds reside, it in features is theirprojection of ambient often the feature above the space. data a In case critical suchof dimension, that situations, the which a submanifold interesting random is than data to morea the sets closely dimensionality surprising lie of related amount the to along of ambient the feature structurerandom space, dimensionality of projections often the preserves and submanifold. their Inend ability section5,5 section we toby review preserve introducing thereduction the a theory of geometry simple of statistical of random mechanics submanifolds, dataconnects like approach submanifolds. point random to clouds We dimensionality random and reduction hyperplanes. dimensionality discussed to This in analysis extremal sections fluctuations4.3 ofand 2D4.4. Coulomb gases structure that accurately captures(CS) [28, many29 ], types discussed ofdimensional in signal data. section6, can rests The be upon fielddimension recovered the of by from central solving a observation compressed a that random sensing computationally aL projection tractable sparse down convex high to optimization a problem, surprisingly known as low The statistics of thiswe largest discuss eigenvalue how will random makeillustrates projections an the power appearance distort of the later thephysics geometry in replica of section5, formalism, of two and dimensional when manifolds. plays Coulomb Overall, a gasesinduced section4 to role by PCA in dimensionality in connecting reduction section the in3.5 statistical and section geometric. 5.3 distortions doi:10.1088/1742-5468/2013/03/P030146 its performance in section 6.2.discussed Remarkably, the in performance section of CS,3.5, unlikethere displays other is a algorithms a phase critical lower transition. bound For on any the given dimensionality of level a of random signal projection sparsity, that is required regularized learning of highaxonally dimensional efficient synaptic long weights range from brain limited communication examples through [50] random and projections [51]–[54]. on statistical mechanicsapplications and of message random projections, passing.information compressed processing For sensing and readers data and analysis, whoof we refer are them how to more [30]. the There,dimensional interested diverse techniques neuronal applications in data in arecompressed discussed, sections5 gene includingand6 magnetic expression resonance arrayscanmeasurements imaging [34], and [31]–[33], be compressed fluorescence connectomics used microscopy [35, spatiotemporal [37, 36], to resolution38] receptive [39] acquire of field using multipleapplications and single molecular of pixel analyze species camera these at high [40, in same high 41] techniques technology. [30], Also, to diverse including46], neuronal neural semantic information processing circuits information are for processing discussed 44], [42]–[ short-term memory [45, J. Stat. Mech. (2013) P03014 ) β (3) (2) (1) is the synaptic J 1. In a neural network ± is an independent, identically . ij J /N minimization, demonstrated in [55], 1 L minimization problem can be formulated as a 1 L , j s , i ) s J , ij s ) Statistical mechanics of complex neural systems and high dimensional data ( J is an inverse temperature reflecting sources of noise. The J , s ( βH β ij − X βH e ] 1 2 − e J [ − 1 Z s X spin degrees of freedom taking the values ) = represents the activity state of a neuron and J ) = , ] = N s s i ( J ( s [ J Z P H are i s The main property of interest is the statistical structure of high probability (low Finally, the appendix provides an overview of the replica method, in a general form minimization via gradient descent has been proposed as a framework for neuronal 1 energy) activity patterns. Muchpicture progress in in which spin the glass Gibbs theory distribution [7] in has (2) revealed decomposes a at physical low temperature (large doi:10.1088/1742-5468/2013/03/P030147 distributed (i.i.d.) zero mean Gaussian with variance 1 into many ‘lumps’ ofsubsets probability mass of (more activity rigorously, patterns. pure Equivalently, these states [61]) lumps concentrated can on be thought of as concentrated is the partition functionconnectivity and matrix is chosen to be random, where each where connectivity matrixdistribution of of the neural activity network. given by This Hamiltonian yields an equilibrium Gibbs The SK model [6]It is has a been prototypical employed as example aand of simple has a model made disordered of a spin statistical recent glassesmodeling mechanical resurgence [7,8], system. of as in well spike neuroscience as within trains neural the networks [59, [58], context].60 of It maximum entropy is defined by the energy function 2. Spin glass models of neural networks that is immediately applicable torandom spin matrices glasses, perceptron and learning,non-rigorous, compressed unsupervised method learning, sensing. for Overall,disorder. analyzing the We the replica hope statistical method thatand mechanics is this message of a exposition passing systems of powerful, methodscontexts, with the will if discussed help replica quenched in to method, this enablephysics students combined paper to and with learn within researchers about the in athe both exciting cavity wide theoretical interdisciplinary intersection variety neuroscience advances of of made and statistical in disparate physics, the computer last science few and decades neurobiology. at may motivate revisiting the issuedependence of in sparse sparse coding coding in network neuroscience, dynamics. and the role of history where the interpretation, dynamics underlying sparsehand, coding the in efficiency both of message vision passing [56] in and solving olfaction [57]. On the other message passing problem [55]. This formulationthat yields qualitatively mimics a neural message network passing dynamics dynamicalL with system a crucial history dependence term. to accurately recover the signal;Also, this in critical section dimension6.3, decreases we with review increasing how sparsity. the J. Stat. Mech. (2013) P03014 , ), a i (4) (5) , it s (the J limit, a P → − N does not i s q , which is the q is an average over a (i.e. the mean pattern i · a h ). In the large a , and is hard to compute. ) may not be self-averaging, J q ( J P , where a i i , and a probability mass s a h = a i m ]. Correlations between neurons can then be and provides a measure of the variability of . If there is indeed one state, then J . 2 i [ ) J m Z ab i q P − ) q ( ] = ln Statistical mechanics of complex neural systems and high dimensional data . /N δ b i J [ b , then the overlap is m P b a i limit. As we see below, typical values of such quantities, for a βF = (1 , can be computed theoretically by computing their average m P − q J N and i ab X X a 1 N ) = can still yield a wealth of information about the geometric organization q = ( J J ab is the probability that a randomly chosen activity pattern belongs to valley P q a P ; despite the fact that the overlap distribution J . One interesting quantity that probes the geometry of free energy minima is J ii ) q vanish in the large ( Now, the detailed activity pattern in any free energy minimum J ) depends on the detailed realization of the connectivity J P a i ), unless there is only one valley, or state (modulo the reflection symmetry , the distribution of overlaps between any two pairs of activity patterns independently a chosen from equation (2) is given by Now, since This distribution turnsJ out not to be self-averaging (it fluctuates across realizations of self-overlap of the state, depend on the detailed realization of on the minima of a free energy landscape with many valleys. Each lump, indexed by in which case the distribution becomes concentrated at a singlemean number activity acrosscase neurons of due multiple to valleys,hh one the can quenched also disorder compute the in disorder the averaged connectivity. overlap distribution In the doi:10.1088/1742-5468/2013/03/P030148 is characterized by a mean activityprobability pattern that a random activity pattern belongs to valley its average over To understand the statisticalcompute properties its of freecomputed the energy via Gibbs suitable distribution derivativesaveraging, in of which (2), the means it free that is energy. Fortunately, to useful the understand to free the energy is free self- energy for any realization of 2.1. Replica solution of free energy minimamethod, in which neural we activity now space. introduce. This can be carried out using the replica any given realizationover of all the distribution of overlapsbelong between to all pairs two valleys, of activity patterns. If the activity patterns configurations belonging to the freefree energy energy barriers valley betweenif valleys an diverge, activity so pattern that startsergodicity in in is dynamical one valley, versions broken, it ofaverage will as this activity stay time model, pattern. in that average Theare valley network activity for interested can infinite patterns in thus time. understanding are Thus, maintain the not multiple structure steady equal of states, these tom and steady the states. we fullHowever, Gibbs many interesting quantities,averaging, which which involve by averages definitionof over means all that neurons, their are fluctuations self- across different realizations J. Stat. Mech. (2013) P03014 ], J (9) (8) (6) (7) [ (10) (11) Z . Thus, even J , the preferred , the replicated J J , 2 ab . Thus, minimization . Applying this to (8) Q 2 ab 2 ab σ Q P ab 4) / replicated neuronal activity 2 P β ( 0 limit. The appendix provides n 4) N / e 2 → } β a ( . n s X J { . This average is difficult to perform − J = ++ 2 ) = ) alone does not determine the overlap ) a j a j s Q s Q a i . ( a i ( s n s ij E E , J Z yields =1 J n a ij P , a j ∂ ∂n ii P s J ( ] =1 a i 0 in (10), which yields an entropic term corresponding n a ij s J [ → a a P P n ) s Z β P N e it is useful to introduce = lim 4 . However, for any fixed realization of ln β / Statistical mechanics of complex neural systems and high dimensional data } J n a 1 , (1 hh s = b i 2 Z e X { x s − 2 x n } a i = , yielding σ a n s s J 2) X can be performed because it is reduced to a set of Gaussian { ** Z / ii N =1 0 and i (1 ] X J = = → J , we expect this similarity to survive, and hence we expect average [ J J n , . . . , n 1 , and we have in (A.2) N J = e /N , the replicas will prefer certain patterns. Which patterns are preferred were independent, marginalizing over, or integrating out, the disorder ii ii Q = lim z J βF n n a = i = 1 s Z Z Z = 1 zx a ab e 2 hh ln hh hh − Q h denotes an average over the disorder σ , J , for ij . One must still sum over a ii J s ab · is a zero mean Gaussian random variable with variance Q = hh z z However, minimization of the energy Thus, although for any fixed realization of the quenched disorder where doi:10.1088/1742-5468/2013/03/P030149 matrix of this energy functionfixed realization promotes of alignment of the replicas. The intuition is that for any after averaging over overlaps between replicas to be nonzero. will vary across realizations of where with activity patterns set of patternsneuronal will activity pattern be are controlled similar by across the same replicas quenched connectivity since the fluctuations of each replicated introduces attractive interactionsframework, presented between in the thethe appendix, overlap matrix replicas. the interaction Consistent between replicas with depends only the on general integrals. To do so, we use the fundamental identity because the logarithmdifficulty by appears exploiting inside the identity the average. The replica trick circumvents this patterns Now the average over is the overlap matrix between replicated activity patterns. where a general outline ofto the compute replica the approach average that over can be used for many problems. Basically, suffices to compute its average over all which can be performed more easily, and then take the This identity is useful because it allows us to first average over an integer power of J. Stat. Mech. (2013) P03014 ). ab 1), for ab ) = Q (13) (14) (12) q n Q = β > ab , . . . , s Q 1 s ( P . , J ) ab Q − q . Unfortunately, we will not explore this 0 limit with this replica symmetric ansatz ( ij δ J (figure1(A)). At lower temperature ( b → 6= i a X (i.e. permuting the rows and columns of n . Now, the physical meaning of the saddle point is implicitly an ansatz about the geometry and b a 1) s ab . ab − 1 z (see equation (A.24) for the derivation), Q Q n q a
( s = 0 is the only solution, representing a ‘paramagnetic’ ) n Statistical mechanics of complex neural systems and high dimensional data is 0 for all ab 0 q qz i → P √ n m β 1), , β n ( = lim − i 2 b J s = β < a ii s ) is unstable [63], and so one must search for solutions in which eff tanh h q ( H (weighted by their probability) is simply the distribution of off-diagonal ab
= J a i . Q P i (a special case of (A.8) and (A.9)), ab = m m hh q Q Q , with denotes an average with respect to the Gibbs distribution ) is given by (5). Therefore, the distribution of overlaps between pairs of free eff q n ( i . This is equivalent to an assumption that there is only one free energy valley, and J βH · b h − P 6= )e Now, the effective Hamiltonian yielding the average in (12) is symmetric with respect While this scenario seems plausible, a further analysis of this solution [6, 62] yields a /Z measures its heterogeneity. Taking the yields a saddle point equation for all q to permutations of the replica indices multiplicity of free energy valleys in (2), averaged over Therefore, it is natural to search for a replica symmetric saddle point in which matrix elements ofany the ansatz replica about overlap matrix. the Thus, structure in of searching for solutions to (12), (1 replica overlap matrixaveraged is overlap distribution, explained in section A.2; it is simply related to thewhere disorder energy minima At high temperaturestate ( (figure1(A)), in whichand average activity neural patterns activity fluctuatea over nonzero all solution possible risesstate configurations, continuously corresponding from to 0, amean suggesting single activity a valley (figure1(B)) phase in transition which to each ainconsistent neuron ‘frozen’ physical has predictions (like a negative different framework, entropy for this the system). inconsistency Withinsaddle the can point replica be for detected by showing that the replica symmetric doi:10.1088/1742-5468/2013/03/P03014 10 where breaks the replicamany symmetry. free This energy corresponds minima. to Apredicts great a a deal nested physical of hierarchical, picture work(see has tree figure1(C) in led like and which to organization (D)), there ahighly on known remarkably symmetric are rich as the and ansatz an ordered space which from ultrametric low of temperature structure purely free hierarchical [64]. energy random, structure Itphenomenon emerges minima disordered is further generically striking here, couplings that since for this most of the applications of replica theory to neuronal to the number ofminimization replicated drives activity overlaps patternssmall, with to since a there be given areoverlaps. large, set many This of more entropy competition overlaps. replicated maximization between While configurationsoverlap energy drives with energy matrix. and small, After overlaps entropy rather computing tomatrix leads than this can large, to be entropic a be term,equations potentially the computed for nontrivial most via likely the value of saddle the point overlap method, yielding a set of self-consistent J. Stat. Mech. (2013) P03014 . 2 J q . This k [65]–[67]. J RSB ansatz [7]. ∞ = describing the typical k 1 q that does not depend on the q = 2. The true low temperature phase of k is characterized by two order parameters, or to the connectivity matrix , and found that such a network exhibits Q β J is 0. (B) The replica symmetric ansatz for a low q -step RSB schemes describing scenarios in which the k induce macroscopic changes in the location of energy was called temperature or disorder chaos respectively J J Statistical mechanics of complex neural systems and high dimensional data or or β . (C) One possible ansatz for replica symmetry breaking (RSB) β J Probability lumps in free energy valleys. Schematic figures of the space . This ansatz, known as one-step RSB, corresponds to a scenario in which 2 > q 1 Figure 1. of all possible neuronalconfigurations or spin with configurations non-negligible (large(2) circle) probability (shaded and areas). under the (A) space theby At of the Gibbs high spin temperature Gibbs distribution alldrawn distribution. in spin from Thus, configurations the the arethe Gibbs explored inner replica product distribution order will between parameter two typically random have spins 0 inner product, and so Gibbs distribution decomposes into a nested hierarchy ofthe lumps SK of model depth is thought to be described by a particular temperature phase: the spins freezevalley), into which a small can set differ of configurations from (free energy realization to realization of the connectivity the Gibbs distribution breaksinner into multiple product lumps, between with describing two the configurations typical inner chosen product(D) from between There configurations exists the from a same different series lumps. of lump, and figure describes a possible scenario for q However, the inner productreplica order between two parameter,realization random takes of spins, a and nonzeroin therefore value which also the the replica overlap matrix broken replica symmetrystable with corresponding respect to topossibility thermal a that or this noise hierarchy multiplicity induced ofprocessing of fluctuations. states It may tasks. low be is However, useful tempting energywith several for to performing states explore works respect neural the have that to information perturbations noted are either thermal that to fluctuations, the while inverse temperature they these are states not are stable structurally stable with respect to So far, in order towith introduce a the random replica symmetric method, connectivity we matrix have analyzed a toy neuronal network 2.2. Chaos in the SK model and the Hopfield solution doi:10.1088/1742-5468/2013/03/P03014 11 processing and data analysiscorrect. discussed below a replica symmetric analysis turns out to be Indeed very small changes to minima in the space ofactivity neuronal patterns activity patterns. to This sensitive either in dependence [66]. of low For energy connectivities neural information whose processing, noisy it dynamics would not be only useful thermally to instead stabilize have a network prescribed set of J. Stat. Mech. (2013) P03014 P (15) is O(1) neurons is large, patterns µ µ N P m m play the role µ , and when all . This problem µ ξ j ξ µ i ξ , the system is in a P/N c -dimensional patterns is initialized to either = s , the network will relax α µ , denoting the overlap of PN α > α ξ s ) are learned, or stored, in is chosen independently to can a network of µ · µ i ξ µ P ξ ξ ) /N 138. For . in each valley is large for one pattern µ = (1 = 0 µ c m m α < α 1. Hopfield’s proposal was to choose induces an equilibrium probability distribution over . Successful pattern completion is possible if there are ± J Statistical mechanics of complex neural systems and high dimensional data µ . = µ j that stabilizes a prescribed set of and the level of storage saturation µ i ξ . If so, then when network activity ξ through equation (2). Ideally, this distribution should have 2 J µ µ i β changes its synaptic weight by an amount proportional to the ξ ξ s i − =1 P are random and uncorrelated (each µ X µ 1 . This relaxation process is often called pattern completion. Thus, N is imposed upon the network, this correlation is ξ , where µ ξ µ = ξ ij to neuron J j ,...,P 1 with equal probability). These works extensively analyzed the properties − = 1 µ This synaptic connectivity A key issue then is storage capacity: how many patterns An early proposal to do just this was the Hopfield model [68]. Suppose one wishes to free energy valleys such that the average of and their reflections patterns are imposed upon the network in succession, the learned synaptic weights are , for and small for all the rest. These free energy valleys can be thought of as recall states. µ µ P fits the classic moldof of quenched disordered statistical disorder,of physics, and where freedom. neuronal the In patterns activitycollection particular, patterns of the self-averaging play order structure the parameters of role free of energy thermal minima degrees can be described by a free energy valleys, corresponding to lumpsξ of probability mass located near the store? This issue wasthe addressed stored in patterns [70, be71] via +1 the or replica method in the situation where neuronal activity with2 pattern of free energy valleysof in the the inverse temperature Gibbs distribution (2) with connectivity (15), as a function neuronal activity patterns P given by (15). a corrupted or(under partial a version dynamics of whosecorresponding stationary one distribution of to is the given by learned (2)) patterns to the free energy valley the network’s synaptic weights (i.e.be through viewed as (15)), motion and down subsequentsuccessful, a network free the dynamics energy minima landscape can determined ofthe by this the process weights. free If of energy learninginitial is landscape recalling network correspond past activity to patterns experience past induced corresponds experiences, by to current and stimuli. completing partial or corrupted µ The replica method inSolutions [70, to]71 yields the aare replica set found equations, of at in self-consistent low equations which temperature for precisely only these one when averages. order parameter find a network connectivity Hopfield’s prescription providesmemory: a the unifying structure of framework past for experience thinking (i.e. about the patterns learning and neuronal activity patterns, butto do changes so in in either a the manner connectivity that or is level structurally of stable noise. with respect This choice reflectsfrom the neuron outcome ofcorrelation a between Hebbian the learning activityactivity rule on pattern [69] its presynaptic in and which postsynaptic each neurons. synapse When the spin glass state withwith many free any energy of minima, the none patterns of (in which thedoi:10.1088/1742-5468/2013/03/P03014 have solutions a to macroscopic the overlap replica equations, no average 12 ξ J. Stat. Mech. (2013) P03014 (16) (17) (18) . Indeed, = +1 (or µ . Thus, all is O(1) for 1 ξ i s s µ i 1 m neurons in (16) J N . For example, because 1 , whenever s 1 is a sum of many terms, it i 1 J to h = N i 1 J becomes large behaves like the low . Since corresponding to low temperatures, , . . . , s N α 2 β s , . . . , s 2 1) and s . (0 O , = 1 \ α H j s + Statistical mechanics of complex neural systems and high dimensional data i 1 s h ij 1 J s ). At such high levels of storage, so many patterns ‘confuse’ i c s − =2 N i X 1 ij J 1 2 ) = , at low enough temperatures, spurious, metastable free energy through the symmetric coupling − c J and be found in [70, 71]. N =2 α > α i i , X s s = β phase plane with neurons can be written as ( = 1 . However, as the temperature is increased, such mixture states melt away. \ N β 1 N – with µ α < α h H H and α in (14), which may seem a bit obscure. In particular, we give an alternate α q → ∞ 1) this exerts a positive (or negative) effect on the combination decreases with increasing temperature. Nevertheless, in summary, there is a robust − c The starting point involves noting that the SK Hamiltonian (1) governing the Even for interacts with P,N α = 1 1 s doi:10.1088/1742-5468/2013/03/P03014 13 s by a Gaussian distribution.because However, such the a individual Gaussianarises terms approximation from is are a generally correlated common invalid coupling with of each all other. the One neurons source of correlation is tempting to approximate its thermal fluctuations in the full system of where is the local field acting on neuron 1, and is the Hamiltonian of the rest of the neurons the network, so that its low energy states do not look like any one pattern We now return tolight an on analysis the physical of meaningparameter the of SK the saddle model point through equation an for the alternate replica method symmetric that order sheds 2.3. Cavity method derivation of (14) through theintuition cavity for method [7, (14)72], bymethod, which describing while provides considerable indirect, it physical can asdirect often replica a provide methods. self-consistency intuition condition. for In the general, finalfluctuations the results of derived cavity via more region in the as the free energy landscape of the Hopfield model as more than one This phenomenon illustrates aHowever, beneficial there role is for aas noise tradeoff in to associative melting memory away operation. mixture statesin by which the increasing recall states temperature, and dominate the the free network energy can landscapedevice. successfully over operate neural Many activity as patterns, important afunction pattern details of completion, or about associative the memory phase diagram of free energy valleys as a valleys corresponding tocharacterized mixtures by of solutions patterns to can the replica also equations arise. in These which mixture the states average are temperature spin glass phase of the SK model discussed in the previous section. J. Stat. Mech. (2013) P03014 . 1 in s (20) (19) exerted from the 1 of neuron h ,...,N 1 s spins can be takes the form 1 N is known as the h 1 absence h and 1 in terms of the cavity s 1 h ) in (20) is Gaussian, then 1 N h . ( 1 1 βH \ and \ − P 1 neurons in (16), but instead the e βH s − ! e N i s i ! in the full system of i 1 . Note that this does not imply that the s J 1 i 1 1 \ h J N =2 , H i X ) by all other neurons can be approximated by ) (17) in a Gibbs distribution with respect to 1 neurons obtained by removing 1 1 N =2 s i − X 1 , that has been removed from the system. (C) In h − ( 1 1 h 1 s − h \ N 1 P
) h 1 δ
,h N 1 δ s is Gaussian. Indeed, if ( N ,...,s X N Statistical mechanics of complex neural systems and high dimensional data 2 βV s H , and therefore the thermal fluctuations of cavity activity − ,...,s X i 2 e and its local field N 1 s 1 1 J Z Z 1 1 \ s in the cavity system 1 = . Z 1 ) is that one can now plausibly make a Gaussian approximation The cavity method. (A) A network of neurons, or spins. (B) A cavity 1 ) = h 1 , unlike the case of these same fluctuations in the presence of h 1 i h 1 . 1 ( s ) = 1 , h 1 J 1 s , while of course correlated with each other, must be uncorrelated \ 1 − h s in a ‘cavity system’ of P N ( ( s 1 1 \ N h ) = a replica symmetric approximation, the full distribution of the field on the cavity (in thea absence Gaussian of distribution, whilein the equation joint (20). distribution of Figure 2. surrounding a single neuron, P P 1 acting on neuron 1 in the full system of in the full system 1 ,..., 1 , h 2 1 h h s s ), i.e. the distribution of (17) in the cavity system (18) of neurons 2 ( 1 h V ( 1 \ The advantage of writing the joint distribution of The key idea behind the cavity method is to consider not the distribution of the P where field distribution to doi:10.1088/1742-5468/2013/03/P03014 14 with the couplings patterns individual terms influctuating (17) neuron exhibit correlated fluctuations due tolocal common field coupling todistribution the of system, thereby leaving acavity ‘cavity’ field, (see or figures2(A) the and field (B)). exerted Then, on neuron 1 by all the others in the The joint distribution of the absence of 1. Becauseabout the the cavity system set does of not couplings couple to neuron 1, it does notMotivated know by thisthermal lack fluctuations of of correlation,local field we can make a Gaussian approximation to the (18), 1, and its distribution is given by that of written in terms of the cavity field distribution as follows: J. Stat. Mech. (2013) P03014 (27) (26) (21) (24) (25) (22) (23) to be limit. ), (20) q 1 induced N h limit fast 1 ( 1 h neurons, in \ N P N and 1 , s 2 ) 1 \ in the large i 1 h ij J −h 1 h ))( q − 2(1 / vanishes in the large in the full system of ) cannot be Gaussian; as discussed (1 . 1 1 i ) − \ ) 1 1 i , h j 1 ,h , h 1 s 1 s δs ( ( i s ( N 1 δs βV of its cavity field, ) is characterized by its mean N P \ h 1 − i P q j e h 1 in ] ( 1 s 1 1 − q δs \ 2 \ 1 \ i i i 1 s i 2 P s − ) δs s X i h h 1 j , = 1 1 δs 1 , N 1 ( \ 1 J i h i \ N i Statistical mechanics of complex neural systems and high dimensional data 1 1 i i 2 1 N =2 i J i h q, X q s J h , h [ i =2 − − − 2 N 1 N N Z =2 i X i i,j X J 1 i and variance 1 ). The simplification in replacing the network with a fluctuating , s 1 1 h = = = 1 = 1 1 N =2 \ ) = i \ X i 1 1 , h i N =1 \ 1 i 1 . Here, we have neglected various terms that vanish in the large 1 X i h = 1 s 2 , h h \ ( ) 1 h 1 1 i N 1 \ i | h s V i ( s 1 1 δh = N s ( h P h q h h − h i s is the order parameter = i q ) obtained by marginalizing out 1 δs h In the replica symmetric scenario, under the Gaussian approximation to Under a Gaussian approximation, limit, but, most importantly, in going from (22) to ( 23), we have made a strong ( N doi:10.1088/1742-5468/2013/03/P03014 15 becomes allowing us to computeterms the of mean the activity mean of neuron enough that we cansystem neglect (and all consequently off-diagonal thevalley. full terms On system) in the is (22). other accuratelyconnected This hand, described correlation can if by will be the a system receive truecannot single is contributions neglect if free described the from the energy off-diagonal by fluctuations cavity is terms multiple across tantamount [7]. free to valleys, Thus, energy an and thelandscape. valleys, assumption we validity the As of of discussed replica this above, symmetry,self-averaging: cavity under or it approximation the a does assumption single of not valleyFinally, a in depend we single the on note valley, free we the energy thatsymmetry expect detailed is the realization broken cavity of and method there can are multiple be valleys extended [7]. to scenarios in which replica N assumption that the connected correlation above, this non-Gaussianity arises due to positive correlations between and variance where and by their coupling field is shown in the transition from figure2(B) to figure2(C). P J. Stat. Mech. (2013) P03014 i \ to i i is a (28) (29) , for ik h i , and h i J i \ i in (14), k s q h for all ik in (27), and respectively. limit. Under i J . However, we i i q s J 6= qz N h k , which is itself a √ q P , β q , we can replace the = i i \ \ i i i i h h h = tanh h N , , which are uncorrelated with i i , in the large q ik J J − over random realizations of 1 i \ i i qz, h h √ . For each i | should be the same as the distribution of 1 s J . h . 2 N z i to obtain an expression for q i
for each − 2 N i i 1 \ , q i i i , which we can do by demanding self-consistency of the \ h q i − for a fixed realization of Statistical mechanics of complex neural systems and high dimensional data i h is computed via (26) and (27), and reflects the thermal 1 i h N | h i qz, . Mathematically, this corresponds to computing the marginal · ) that do not depend on the detailed realization i h ij √ 2 s J i | h i i s in (26), which yields N s =1 h i i X h , reflecting the heterogeneity of the mean cavity field across neurons, 1 P sh z N
) − = = /N , respectively. q q denotes a ‘quenched’ average with respect to a zero mean unit variance q across different realizations of ) = . Thus, we expect the distribution of z i − i = (1 ii s, h q · ( V hh by virtue of the fact that this thermal average occurs in a cavity system in the across different neurons and 1 i i \ Equation (29) is a self-consistent equation for the order parameter \ i i i k qz s h random variable due to theh randomness of theabsence couplings of be Gaussian, with a meanFurthermore, and we variance expect that are this easily distribution computed to to be be self-averaging, 0 i.e. and the distribution of this assumption, although we may not know each individual However, we do not yet know for a fixed h average over neurons in (28) with an average over a Gaussian distribution, yielding Here, Gaussian variable and the thermalfluctuations average of√ a single neuron in the presence of a cavity field with mean and variance So far, we haveexample seen two methods that allow us to calculate self-averaging quantities (for 2.4. Message passing measure of the heterogeneityreflects of a mean demand activity that across theheterogeneity statistical neurons. of properties Therefore, mean of physically, neural the (29) which activity. cavity Now, fields finally, are we consistent can withwhen the specialize this to is the substitutedderived SK via into model the (29), in replica we method. recover the self-consistent equation for may wish to understand the detailed pattern of mean neural activity, i.e. doi:10.1088/1742-5468/2013/03/P03014 16 However, now we must compute cavity approximation. First of1; all, the we above procedure notecarried of that out forming there a with was cavity any nothingwe system neuron. can by special Thus, removing average26) ( about neuron these and neuron 1 equations (27) could over hold have all individually been for all neurons some fixed realization of distribution of a singleefficient neuron in distributed a message full passinghave joint algorithms distribution been from given developed by computer (2).certain to science Here, factorization compute [5, we properties. introduce 11, such]10 marginals which in probability distributions that obey J. Stat. Mech. (2013) P03014 a x can (30) P . Here, the . if and only k ). (C) The j i s x i i x s ( ij that factorizes a variables, with J +1 and → β t i N corresponds to a j N M a , x i x ) = e j , . . . , x 1 , s and variable i x , s a ( a , and the factors correspond i ψ x ,...,P variables = 1 N a depends on. Thus, we systematically . The factorization properties of i (see3(A)). figure For example, the SK x also as a subset of the a i x a . ) ) (31) are the variables a N i x ( s a is computed exactly as a product of two messages. i ψ depends on x a depends on =1 , . . . , x P ψ Y a 1 a Statistical mechanics of complex neural systems and high dimensional data x 1 ( Z ψ P i 6= ) = is treated exactly, while the effects of all other interactions besides ,j N } a , and there is an edge between a factor j Message passing. (A) A factor graph in which variable nodes are X x a { ), and in the SK model of equation (2), , . . . , x ) = 1 i ij x x are approximated by a product of messages. (D) Exact message passing in a ( ( factors, or interactions, indexed by = ( P Figure 3. represented by circlesflow and of factor messages nodes involved are ininteraction represented the by update squares.a of (B) the The message chain; the marginal on message passing approximation to the joint distribution of P if and only if factor or factors P a a i ∈ is any arbitrary variable that could be either continuous or discrete, and i i , or equivalently factor x a Consider, for example, a joint distribution over The utility of the factor graph representation is that an iterative algorithm to compute ∈ i into a set of Here, denotes the collectionabuse of notation and variables think that of factor each factor index variable if be visualized in ato factor variables graph, which is a bipartite graph whosemodel, or nodes more correspond generally either anyto neural a system factor with graph an in equilibrium which distribution, the corresponds neurons to nonzero synaptic weights connecting pairs of neurons. Thus, each neuron pair the marginals doi:10.1088/1742-5468/2013/03/P03014 17 J. Stat. Mech. (2013) P03014 ) b a i by x ) is ( (see i (32) (33) (34) (35) i feels a x \ i i ( → b i a → ∈ M i j M . In contrast, there are two except ) the message b i t induced by the x b , variable ( i i a x , supplemented by → b t b ψ M on variables b ) as an approximation to j . In this case, x a ( b (the left-hand side of (32)) is , and by → i t b j in the factor M can be approximated via i a ∈ , ) alone on i to factor j b x (see figure3(B)). Message passing involves j , since in the absence of ( i b a x → t j . ) i M . i x \ ) ( i b a ∈ x Y j ( → ∞ i i , ) induced by all other interactions besides interaction b ) +1 → i approximate the true marginals though equations (34) t M ) (see3(B)). figure The (unnormalized) update equation b x i Statistical mechanics of complex neural systems and high dimensional data x j ( a alone. These messages will be used below to approximate x a ( b x M i ∈ i b ( Y → ψ a ∞ i b → in (35) (see also figure3(C)). This approximation treats the \ ) i ∞ a i \ a → b M ∈ t a Y j x x b X . Intuitively, we can think of M x ( i i ) as an approximation to the distribution on M a i ∈ ψ a x Y induced by all other interactions besides interaction and ) = ) = is connected to only one factor node ( , except for interaction i i i i j i ∝ x i ∝ x → x ( ( → t ) b ) the message from variable in the full joint distribution of all interactions (see e.g. (34)). ) i ∞ a a i a j i +1 +1 → → x x x M t t b i x M ( ( ( b M M P P to variable → t denotes the set of all variables connected to factor node j b i M \ can be visualized as the flow of messages along the factor graph (figure3(B)). We b i The (unnormalized) update equation for a factor to variable message is given by The update equations (32) and (33), while intuitive, lead to two natural questions: can be approximated via i for all types of messages, one fromdenote variables to by factors and the other from factors to variables. We first define this iterativeis algorithm a and probability then distribution later over give a justification single for variable, it. and Every at message any given time the distribution on from factor we can think of direct influence of interaction the marginal of where figure3(B)). Intuitively, the directobtained influence by of marginalizing out all variables other than accounting for the effectsthe of product all of of messages the other interactions besides for the variable to factor messages is then given by Intuitively, the distribution on (the left-hand side of (33))that is simply involve the variable product ofrandomly the direct initializing influences all of all the(32) interactions messages and (33) and until then convergence. Onewhere iteratively exception running any to the the variable random update initialization equations is the situation initialized to be a uniform distribution over no influence from the rest of the graph. Under the message passing dynamics, will remain aguaranteed, uniform but if distribution. the algorithm Now, doesx converge, for then the general marginal distribution factor of a graphs, variable convergence is not and indeed the joint distribution of all variables and (35)? Amarginal key of intuition the variables arises from thedoi:10.1088/1742-5468/2013/03/P03014 structure of the approximation to the joint 18 for which factor graphspoint will messages they converge, and, if they converge, how well will the fixed J. Stat. Mech. (2013) P03014 . ) a N (36) (37) (38) (39) ∈ ) is i i in the s ( i i that were → ) i ,i 1 − would require i ∞ ( i s M , ) 1 on these variables by − . Overall, this method i k b , and the normalization s i ( s ) ,k are independent, and their 1) = 1. Note that whereas 1 , and, after convergence, we − − a i ( k ( ∈ 2). A similar leftward iteration P → , i 1 . by explicitly including the factor , − , t +1 k k a +1 s ∞ k M k s 1 s k = s (+1) + − . k +1 t ) s P +1 i k k,k s s ), is initialized to be a uniform distribution, J k,k ( 1 1 i J 1 i − s − 1 → = ( =1 − N k k,k i k 2) J , +1) P β P (1 ). Each iteration converges in an amount of time β e i,i i β ). An exactly analogous approximation is made in ∞ ( e i → 1 s e 0 1 ( − x 1 N i M k ( − X s i M ) a → i ,...,s → s ∞ i Statistical mechanics of complex neural systems and high dimensional data through interaction X ( X ,...,s +1) i +1 ) = 1 i M s s i,i a k → ∞ ( ) s . ( ,i ∈ leads to a factor graph in which all the variables are now weakly coupled (ideally independent) under all the 1 a M k i ) = ) = − a i i → i a ∞ ) 6= ( s s ( ( ,k b i i 1 M − → → ) +1 k ∝ t marginals simultaneously, as (36) holds for all ( ,i 1 removes all paths through the factor graph between variables +1) ) . For example, the rightward iteration for computing i i M − N i i,i s ∞ ∞ a ( ( ( P M M ) = k s ( +1) k,k ) operations, this iterative procedure for computing the marginal requires only O( ( This weak coupling assumption under the removal of a single interaction holds exactly Although message passing is only exact on trees, it can nevertheless be applied to N → +1 . However, it approximates the effects of all other interactions t k a coupling of theψ variables a simple product of messages the update equation (32).removing Such approximations the might interaction bepreviously expected to connected work to wellremaining whenever interactions whenever the factor graphone is interaction a tree,In with the no loops. absence Indeed, ofjoint in any distribution such such factorizes, a paths, consistentIn case, all with general, removing pairs the whenever any of the approximationsin variables made finite factor time, graph in and is) (32 a the a and general fixed point tree, proof35). ( messages of thechain yield this message (see the fact, passing true figure3(D)). but marginals equations Consider wechain. [11]. converge will the This We illustrate will spin marginal it not feels distribution in give is an of the a interaction a case to product spin of its of a at left two one and position converged dimensional right, messages and Ising at so time (35) tells usEach of the these marginal two messageschain can to be position computed by iterating messages from either end of the M since spin 1leads is to only the connected calculation to of a single interaction (1 given by the path length from each corresponding end to where the first equalityThe is first message a in special this case iteration, of (33) and the second is a special case of (32). have Inserting (38) and (39) into (36) yields the correct marginal for doi:10.1088/1742-5468/2013/03/P03014 19 graphical models with loops, and, as discussed above, it should yield good approximate factor can bea fixed naive at sum the overO(2 all end spin by configurations demanding operations. to Moreover, two compute sweeps through theand the marginal therefore chain allow all over us tois compute essentially all the identical messages, generalization to of the the Bethe transfer approximation matrix [73]. method for the 1D Ising chain, and is a J. Stat. Mech. (2013) P03014 ), j , if s (43) (44) (42) (40) (41) ( j → ) → ∞ i,j t ( M by all the spins in , i.e. j i . ik J ) to a variable i, j , , ) ) i k s s ( ( i ) → i,j t through a nonzero ( k is removed. In terms of this parameterization, i → t i M s ij k M s J ) through i i j s s s . i . ik ( 0 s ), as the essential degrees of freedom upon which ) ) J i i ij β s i,j J βhs e → ( ( converges to the field exerted on spin β ) t k + . 0 j can be thought of as a type of cavity field; as e → i Statistical mechanics of complex neural systems and high dimensional data k i i,j s s , h → j ( j t i X i M ki s → → → j h βJss t t i t i i is coupled to X \ J e i h ( k βh ∈ M Y 0 u e s k s j ) = X \ j i ∝ s ). Then, the remaining message passing update (33) yields the ∈ ∝ X ) = ) ( i k i i j s s s s ) ( ( ( → ) = ) j j J,h j i,j +1 +1 ( i,j → → ( t t t i i ( +1 → t i βu → t i h e M M M if and only if M ) is defined implicitly through the relation i ≡ ∈ J, h ) ( i k s u ( j Now, each message is a distribution over a binary variable, and all such distributions We conclude this section by connecting message passing back to the replica method. → t i the message passing updates (41) yield a dynamical system on the cavity fields [76], a cavity system in which the interaction Here, doi:10.1088/1742-5468/2013/03/P03014 20 message passing is successful, Here, the scalar parameter dynamics can be usefully parameterized by a single scalar parameter, the message passingM dynamics operates. We simplify the notation a little by letting In general, suitable averages ofequations the and messaging passing the equationsmodel, replica reduce we outline equations to the both [5]. derivation of theperspective To the cavity of replica illustrate message saddle point this passing. equation Wehas in equation first degree (14) note the from 2, that the special since the every update case factor of node of a in the message the from SK SK model a factor ( marginals whenever the variablesremoval adjacent of to that alearning factor in factor node. section node3.6 Weand arein will compressed partially weakly sensing see justifying in correlated successful section the upon was6.3. examples a application An variational of of early connection: theoretical message this eachis advance solution passing in to in to the the fixed graphical one contexts pointapproximation models equations to to of with of the one message Gibbs loops passing free correspondencemodels energy that in with is variational approaches exact extrema toand on of inference precise trees in conditions (see a graphical under [75]is certain which for theoretically message a Bethe guaranteed passing review). to in free converge However,marginals. graphical to there energy messages models Nevertheless, are that with no [74], in yield many known a practice,in an loops general good message approximating approximation marginals to passing when the seems correlationsare between to indeed variables achieve adjacent weak to empirical after a success removal factor of node that factor. depends only on the message which is a specialnode case to of factor (32). messages, Thus, we can take one set of messages, for example the where J. Stat. Mech. (2013) P03014 0 . j ) ) k 0 ). s j to s J +1 → t i βh (45) (46) (47) (48) βh h J on (which s h tanh( tanh( (besides J ) reduces to J h . In (43), the k ( ). However, in . ≈ J h ) Q ( ! ) . We can think of Q k J, h ( , h 2 u k , the latter empirical , at a message passing ! ), and the cavity fields → ∞ ! J k J ( k J t across all choices of pairs on each side of (46). The ( u βh in the presence of its own j 2 P βh as 2 → ∞ , after marginalizing out k h → is a random variable due to k j ), which reflects the simple coupled with strength ∞ i h X s j h → N s t i βh − → tanh tanh h ∞ i . More generally, one can track h k 2 k h j J J ,
. j δ tanh( , due to the external field k k )] 0 ) X X s and J k βh
h i and ( ≈ ) ) i ), then the induced distribution on k k Q ) can be neglected), exerts a field h h h k ( ( ( s h Q J, h ) tanh( Q Q d ( ) in (45) reflects the back-reaction of k k u k h h βJ , Y d d J, h J ) ( k k k ) at a message passing fixed point, u . Then, the distributional equation for Y Y J h q ( ( ) ) by taking the expectation of . Second, for a fixed realization of k k Statistical mechanics of complex neural systems and high dimensional data P . To simplify the right-hand side, since the couplings Q q J J J k q ( ( J obtained by marginalizing out P P d are drawn i.i.d. from a distribution k k u turns out to be a simple sum over all the spins ). This yields a recursive distributional equation characterizing arctanh [tanh( k J J ik h j Y d d 1 ( β J +1 → t i Q , we can use the small coupling approximation Z k k h Y Y on both sides of) (46 yields ) = /N 2 Z Z ) = ) if the back-reaction from ) is the effective field on a binary spin h h . J, h ( = = i that experiences an external field of strength ( βh 0 u Q q → J, h s t k ( h u . The assumption of self-averaging means that as j are drawn i.i.d. from a distribution . The more complex form of Using (43), we are now ready to derive (14). The key point is to consider self- In general, it can be difficult to solve the distributional equation for i s → and t k Physically, another spin In the weakapproximation coupling that limit the ofwould average be small magnetization tanh( of on that becomes non-negligible atupdated larger cavity values of field the bi-directional coupling this distribution in two ways. First, for a fixed Explicitly, of this same effectivecavity field field the random choice of couplings fixed point, there is an empirical distribution of cavity fields should be identical to consistency conditions for the distribution of cavity fields the distribution of cavity fields i distribution converges to thewould distribution like of to the write down formerthis a random distribution self-consistent must variable. equation be In for self-reproducingin any this under (43), case, distribution, the if by we update the observing equation couplings h that (43). More precisely, Here, we have suppressed the arbitrary indices the SK model, onezero could mean make Gaussian an with approximation variance thata self-consistency the condition distribution for ofleft-hand cavity fields side is is a byhave a definition variance of 1 the time-dependent evolution oftechnique the known distribution as density of evolution cavity [5]. fields, an algorithmic analysis Then, averaging doi:10.1088/1742-5468/2013/03/P03014 21 J. Stat. Mech. (2013) P03014 , w (49) (50) , this is equivalent q synaptic weights is not relevant to the N w , so that the set of perceptrons N = ! , and fires depending on whether or k w = +1 represents the firing state and ξ · σ βh w 2 1 dimensional hyperplane orthogonal to the tanh − k ), where X ξ N · 1 N w
βh. 2 ) ) is zero mean Gaussian with variance Statistical mechanics of complex neural systems and high dimensional data k h h ( ( Q Q = sgn( ) tanh k h σ h ( d h Q k d Y Z Z 1 dimensional sphere. . Since the absolute scale of the weight vector = = w − N 1 represents the quiescent state. Geometrically, it separates its input space into two − In summary, we have employed a toy model of a neural network, the SK spin = The perceptron is a simple neuronal model defined by a vector of 3.1. Perceptron learning In the above sections,statistical we mechanics have of reviewed fluctuating powerfulsynaptic neural machinery connectivity activity designed matrices. patterns A to inthat key understand the this conceptual the presence advance same of madestatistical disordered machinery by mechanics Gardner could directly [79, beexamples on80] presented was applied the to to space thewe of the system will synaptic analysis playing explore connectivities, the of thisunsupervised role with learning, learning viewpoint of the (see and by quenched training [12] its disorder. performing for applications In an this to extensive section, review diverse of phenomena this in topic). neural and 3. Statistical mechanics of learning glass model, toto introduce analyzing disordered the statisticaldetail various mechanical the systems. replica, simplest In possible cavity eachnamely ansatz case the and replica concerning we symmetric message the have ansatz, discussed structurecorrelations corresponding passing in to of between a approaches the single degrees free valleymodel, with of energy weak it landscape, freedom. connected nevertheless While providesthe this a various assumption good methods. example isbelow, system not In the in true assumption addition, which for of toout fortunately, a the gain to for single familiarity SK be many with freeextended correct. energy of [7] Finally, to valley the scenarios we governing inenergy the note applications which valleys fluctuations replica and that discussed long symmetry will just range is turn viewing correlations, broken, as optimization so corresponding and too the to inference can many replica problems message free to through passing and a the approaches. new cavity Indeed, lens algorithm, of methods knownor statistical as physics can minimize survey has be propagation led costs, [77, that in78], can which free confound can energy find more landscapes good traditional, marginals, characterized local algorithms. by many metastable minima Now, since we have assumed to the replica symmetric saddle point equation (14). which linearly sums a pattern of incoming activity not the summed input isit above computes a threshold. the Mathematically, function in the case of zero threshold, σ classes, each on opposite sides of the weight vector doi:10.1088/1742-5468/2013/03/P03014 22 problem, we will normalize thelives weights on to satisfy an J. Stat. Mech. (2013) P03014 . P w w (53) (51) (52) data , such should , i.e. P w µ V to the ξ µ w input–output P ) determining λ P ∝ ( , corresponds to λ V w − 0, and 0 otherwise). . For example, each N ≥ ) = experimental conditions, λ ( , x with the weight vector V ) in (52)[12]. However, if we P stimuli. The overall goal of λ µ ( ) = 1 P V x ( θ genes across ), this choice leads to w N ( E ,...,P. = 1 of the data onto this single dimension yield a useful µ µ ξ ) limit becomes a uniform distribution on the space of Statistical mechanics of complex neural systems and high dimensional data , neurons in response to · ∀ 1 dimensional sphere of perceptrons as follows: ) ) is the alignment of example µ w w ( − λ N , where each vector is of dimension µ ( → ∞ N 1 βE ξ √ N V µ β − ) is the Heaviside function ( 0 e σ x = =1 P ( ≥ · 1 Z does a solution to the inequalities exist? µ X inequalities θ µ . Doing so requires a learning rule (an algorithm for modifying the ,...,P µ } λ µ w ξ P based on the inputs and outputs) that finds a set of synaptic weights µ ) σ µ ) = ) = = 1 σ N , σ w w w · µ → µ ( ( √ ) for various choices of potential functions ξ / µ w E P { w ), where ξ ( λ E , for − = (1 µ ( µ θ ξ λ ) = Suppose we wish to train a perceptron to memorize a desired set of A statistical mechanics based approach to answering this question involves defining λ that satisfies the ( This same statistical mechanics formulation canlearning be extended scenarios. to more In general unsupervised unsupervised learning, one often starts with a set of 3.2. Unsupervised learning in the zero temperatureperceptrons ( satisfying (51)(51), (see and,4). figure in Thus, particular, thestatistical whether mechanics volume or of of not (53) the it in space is the of nonzero, zero solutions temperature can be limit. to computed by analyzing the vectors vector could beor a a pattern pattern ofunsupervised expression of learning of activity issimplest to of approach is find to simple findthat an hidden the interesting single structures projections dimension or spanned by patterns the in vector the data. The one dimensional coordinatebe system defined for by minimizing the the data. energy This function (52), interesting with dimension the can choice often of associations, w We will seeinequalities, then, below remarkably, a that learningcan rule, as find known the long as solution. thetraining The as perceptron main data learning there remaining rule question [13], exists is then, a under what simultaneous conditions solution on the where the particular unsupervisedHebbian learning algorithm. learning. One Upon choice, minimization of synaptic weights an energy function on the Successfully memorizing all the patterns requires all alignments to be positive, so be a potential thatvariety penalizes of learning negative algorithms alignments for and thedescent favors perceptron on positive architecture can ones. be Indeed, formulated aare as gradient wide interested in probing theV space of solutions toWith the this inequalities choice, (51), theexamples, it is and energy useful so function to the in take Gibbs) (52 distribution simply counts the number of misclassified doi:10.1088/1742-5468/2013/03/P03014 23 J. Stat. Mech. (2013) P03014 is w i (55) (54) is the replica q potential cluster -mean clustering. to be the center of K K i , where w . Under this choice, q 2 λ − − than to any other centroid. i ) = λ w ( . In the case where the distance i V . µ . i . The cluster assignments of the data are then i , ) µ 2 , λ µ 1 assigned to cluster are optimized by minimizing the sum of the distances . At each iteration in the algorithm, each cluster across data points λ i ( K µ Statistical mechanics of complex neural systems and high dimensional data µ ξ w V w λ µ =1 ξ P µ X · i ,..., 1 Perceptron learning. (A) The total sphere of all perceptron weights w ) = w 2 = 2, and when both the data and cluster centroids are normalized . This is the direction of maximal variance in the data, also known N 1 w T √ , µ K 1 ξ , this iterative procedure can be viewed as an alternating minimization µ = w is the center of mass of cluster , this energy function can be written as ξ ( K µ i i µ N order parameter introduced in section 3.3. (gray circle) andperceptron a weights single that example(A), yield (black but an arrow). for output The aexamples +1 blue in different on region (A) example. the is and (C)weights example. (B). the shrinks, The (D) (B) set and set As The of its of more same typical weights examples volume as that are is yield added, given the +1 by space on 1 of both correct Figure 4. E λ w well separated clusters in the data, this iterative procedure should converge P = K to those data points i C w Beyond finding an interesting dimension in the data, another unsupervised learning For general points in the directionhas of its the center center of of mass mass at of the the origin, data. a In useful situations choice in is which the data points in the directionmatrix of the eigenvector of maximal eigenvalue of the data covariance doi:10.1088/1742-5468/2013/03/P03014 24 as the firstvariance principal of component the of distribution the of data, i.e. it is the direction that maximizes the task is to find clusters in the data. A popular algorithm for doing so is This is an iterativecentroids algorithm in in the which data, one maintains a guessThen, about all the cluster centroids from defined to be the set of data pointsmeasure closer is to a centroid Euclideanmass of distance, the this data step points just assigned to sets cluster each centroid where of a joint energythe function over special cluster case centroids of and cluster membership assignments. For recomputed with thethere new are centroids, andso the that whole each process repeats. The idea is that if to have norm J. Stat. Mech. (2013) P03014 1 is N ± b is a (58) (59) (57) w = µ · ξ u a with the σ w ) and /N → ∞ N , and an integral ab = (1 . Therefore, we can P,N Q µ to perform Hebbian ab σ i Q limit, from a Gaussian w or N can a perceptron with are drawn from a uniform µ , ξ µ P ξ , where , the integrand depends on the ++ µ a µν ) δ λ µ a ) (56) 1 λ ab ( λ V Q dimensions? To address this question − , =1 . = P µ 2 )] | 2 λ N Q P over these realizations. This can be done ( ii ( λ S θ ν b =1 2 Z − n a − λ ) λ µ a 1 Q P ( λ λ − − | E . These variables are jointly Gaussian distributed [ e 1 2 ) µ a hh N 2 a λ points in − λ − w e ) d − P 2 ab 1 λ n =1 Statistical mechanics of complex neural systems and high dimensional data λ Y a dQ ( + θ 1 1 into an integral over all possible overlaps Z . (For the case of perceptron learning, we can make the ab λ λ Y µ ( a 1 2 ξ − w ** Z · 1 sphere (or, equivalently, in the large , since both have the same distribution; in essence we absorb = µ a ) = = = − ξ 2 w ) ii ii N → , λ n n 1 N µ λ Z Z √ ξ ( held constant. Fortunately, this is the limit of relevance to neural models / hh hh V µ σ = (1 P/N µ a then reduces to averaging over = λ µ α ξ In both cases, the analysis simplifies in the ‘thermodynamic’ limit compute these observables byby averaging first log averaging the replicated partition function random vector drawn fromeach a with uniform probability distribution half.do on Similarly, a a we natural sphere assess question ofdimensional for the radius dataset unsupervised statistical learning consisting significance isit of how is of often any usefulitself structure to has or analyze no what pattern structure,distribution structure we on for we the example find may whendistribution in find with the in a identity data a covariance high points matrix). null data distribution that ratio with many synapticanalysis weights, involves understanding and the to low(53). energy high In configurations of dimensional the the thermodynamic data.configurations Gibbs limit, distribution The or in important starting the observables,averaging; point distribution like they the of do of volume not the data of depend along low on energy the the detailed optimal realization direction(s), of become self- where synapses memorize? One benchmark is the case of random associations, where doi:10.1088/1742-5468/2013/03/P03014 25 Both perceptronmechanics learning problems and as unsupervisedquestion above, for can learning, perceptron be when learning analyzed is formulated through how the as many replica statistical associations method. A natural 3.3. Replica analysis of learning learning only on the data points that are currently closest to it. Gradient descent on this energy function forces each centroid the replica overlap matrix.configuration Thus, of after replicated averaging weightsseparate over the only integral over through theirover all overlap. configurations Therefore, with it the is same useful overlap. Following to the appendix, this yields the sign of theover desired output into the input, yielding only positive examples.) Averaging redefinition with zero mean and covariance matrix and J. Stat. Mech. (2013) P03014 . ) q Q ) is ( + (63) (61) (62) (60) = 2. Q S c ( ab α δ E ) increases q , in (A.37). → q − z α arises from a ), ), while q → ∞ q a ( = (1 β F 1 as . For example, for ab ) → Q a Q λ q ( which, as explained in 0 for all βV q a > P drawn independently from − a limit is the typical overlap b λ w λ ) 1 λ ab − random associations. which reflect larger volumes in ( Q , a q i βV N λ → ∞ ) − q 2) )) / β q − (1 − − (1 e / 2 reflects the typical volume of the solution ) Q q qz ) in (62). The first term is an energetic term in the + ln(1 1 q √ − det ( q q − λ F √ q − ), in the zero temperature limit π a 2)(( λ 1 / can be performed via the saddle point method, and 2 λ h − (1 d ( √ 2 1 ab − θ e Q + n , reflecting a pressure for synaptic weights to agree on all =1 Statistical mechanics of complex neural systems and high dimensional data ) Y z a q q ) = ii Q ). The second term is an entropic term that is an increasing λ − Z ( q ζ λ weights can store at most 2 V (1 ln d ln π N α 2 hh Tr log − 2 1 α p 0 limit yields a saddle point equation for increases, placing greater weight on the first term in ) = Z ) = → ) = α Q q Q ( = n ( ( , the integral over E ζ S F half-spheres (see figure4). Thus, unlike the SK model, we expect a replica , which thus promotes smaller values of N q P At large Taking the Thus a perceptron with doi:10.1088/1742-5468/2013/03/P03014 26 weight space. As as energy becomes more important than entropy. As shown in [80], that is a decreasing functionexamples of (promoting larger function of is the partition functionNow, of in the distribution the appearing case inside of the perceptron average learning, over between 1 two zero energyminimization synaptic of weight configurations the (see sum section of twoA.2). terms in space to (51) (see4(D)), figure in that where is the entropy ofperceptron the learning volume when of weight vectors with overlap matrix where and an energetic term thatyield promotes the the correct alignment answeris of on the an any replicated entropic given weights termsince set so they that of that have promotes they examples larger replicated all (i.e. volumes. weight configurations with small overlaps, Given the connection (explainedand in section theA.2) distribution between of(53), replica overlaps this overlap of matrix choice pairs elements, suggests ofexpect the random existence as weights of most a ofconvex. single the Also, free energy in energy functions the valley. we This zerostate will is temperature energy be reasonable limit, analyzing configurations, to this forindeed if ansatz unsupervised suggests true degenerate, learning that for are should theof perceptron form space a of learning, a ground set since convex, of connected the set. space This of is ground states is the intersection symmetric assumption to be a good approximation. the competition betweenmake entropy the and ansatz energy that selects the saddle a point saddle has point a overlap replica symmetric matrix. form We section A.3.2 can be derived by extremizing a free energy J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data Now, the actual distribution of synaptic weights between granule cells and Purkinje The key intuition for why a majority of the synapses are silent comes from the doi:10.1088/1742-5468/2013/03/P03014 27 cells has beenhas measured a [84], deltaGaussian and function distribution. a In atIf prominent particular, 0, the feature about Purkinje while 80% ofa cell the of this is majority rest the implementing of distributionnetwork synaptic an of the is should important weights reflect synapses the that sensorimotor are the silent? properties nonzeroand mapping, it exactly In of outputs. why weights 0. the general, Thus, then learning one follow the ruleby are might as distribution a be positing well able of truncated as a to the synapticweight quantitatively particular distribution. statistics weights derive However, of learning the the in inputs authors distribution rule a did of of [81] not and weights took depend input–output an on evenPurkinje statistics more cell even elegant architecture and positing approach as that then a anyand perceptron, particular derive derived assumed the learning that the it distribution rule.a operated of They optimally synaptic replica simply at weights capacity, based modeled ofstatistics, the perceptrons Gardner whenever operating type the at analysis.associations capacity perceptron at via Remarkably, implemented a for given the levelconsisted a of maximal reliability of wide (its number capacity), a range of itsIndeed, delta distribution of input–output like of function input–output the synaptic weights data, aton a the 0 majority perceptron plus of operatingrule; a the at any truncated synapses (or learning were Gaussian near)distribution. rule silent. capacity, for that This and the can prediction does nonzero only achieve not relies capacity depend weights. would onconstraint the necessarily that learning yield all such thethe a granule Purkinje weight cell to cella Purkinje or nonnegative cell the synapses synapticactivity are perceptron patterns weight excitatory. and faces Thus, vector fires either for arest. that some It fraction difficult linearly of turns computational granule combines outperceptron cell that task: nonnegative patterns operating while false at it granule not positive or firing cell must errors near for the capacity: dominate find there the are weight many structure granule of cell the activation optimal patterns Interestingly, in [81], the authorsand developed a applied replica it based to analysiscells make of in predictions perceptron the about learning cerebellum. thewas distribution Indeed, first of an synaptic analogy posited weights betweenand of over the most Purkinje Purkinje 40 intricate cell yearsreceiving and dendritic excitatory ago the synaptic arbors inputs perceptron [82, from ofcerebellum about].83 all 100 devoted 000 The to neuronal granule cells motor Purkinjemotor cell which, control, in states, cell types; areas convey sensory of a has this feedback the can sparse and arbor one exert representation contextual is an of of states. influence ongoing capableinput, the The on internal of Purkinje each largest outgoing cell Purkinje motorthrough output, control cell a in signals. climbing receives turn, In fiber inputcell addition input, as to whose from well firing the on as induces granulefiring large plasticity average cell is complex in one often spikes the cell inconvey correlated granule an the in cell with error Purkinje to signal the errors Purkinjecell that in inferior can cell can motor guide olive synapses. be plasticity. tasks, Sincerelated Thus, thought inferior climbing at inputs of a olive fiber to as qualitative desired inputusing level, performing motor the error is supervised outputs, Purkinje corrective thought where signals learning to the transmitted in desired through order mapping the to is climbing fibers. map learned over ongoing time task 3.4. Perceptrons and Purkinje cells in the cerebellum J. Stat. Mech. (2013) P03014 1 µ \ ξ be w (67) (65) (64) (66) 1 \ → ∞ w β , this new , to vanish. P q ) in (62) by q − . ) has a unique ( will also change λ α and F ( , its overlap with 1 1 ξ V N ξ , the weight vector 1 . ξ 1 2∆ + z ) in (52). This is essentially the i ) w . ( λ i ( ) E λ V ( ) in the presence of all other examples + V ) in the presence of w 2 ( + ) w , ( z E 2 z ) E ii − z 2∆ minimizes λ , − 2∆ ( ) µ ∆)) h with ∆ remaining O(1). In this limit, (62) and (63) λ w ( λ . It can be shown [86] that for large z, λ h ∗ , is a zero mean unit variance random Gaussian variable. ( Statistical mechanics of complex neural systems and high dimensional data λ 1 − λ min ∗ ξ λ λ · ( → ∞ along the dimension discovered by unsupervised learning. 1 δ − , and \ does not know about the random example β µ µ α λ w 1 =1 ξ P ξ ( ) \ µ X − δ as · w N 1 P hh w √ /β ) / ∆) = argmin N ) = ) = (∆) = z, λ λ ( √ = ∆ F . Since ( ( = (1 ∗ / q 1 β λ P P z − = (1 ,...,P µ to an optimal alignment λ = 2 z Furthermore, the interesting observable for unsupervised learning is the distribution µ Equations (64), (66) andcavity (67) method have applied awhich to simple one interpretation unsupervised of within the learning the examples, [85, zero say temperature 86]. example Consider 1 in a equationthis cavity example, (52), system is in removed, and let the ‘cavity’ weight vectorfor that optimizes assuming 1 become where Now, suppose exampleupon 1 re-minimization is of the then total energy included in the unsupervised learning problem. Then, minimum leading to a non-degeneratelimit, ground we state. expect Thus, in thermal the fluctuationsIndeed, zero in temperature the we synaptic can weights, reflected find by 1 self-consistent solutions to the extremization of will change to a newfrom weight vector, and consequently its alignment with In contrast toformulation in perceptron (52) and learning, (53) to in unsupervised learning the discussed applications here, of the statistical mechanics 3.5. Illusions of structure in high dimensional noise doi:10.1088/1742-5468/2013/03/P03014 28 for which the perceptronachieve must this requirement remain with below nonnegative thresholdIndeed, weights is by and quantitatively to matching not set the fire, many parameterstheory of synapses and the to exactly the replica to physiological based only zero. data, perceptron the wayabout learning capacity to 40 of 000 the input–output genericthe associations, Purkinje weights cell corresponding of was a to estimated single to 5 cell be kB]. [81 of information stored in Extremization of (64) over ∆ determines the value of ∆ as a function of distribution of the data where optimal alignment arises through the minimization in (67). This minimization reflects a This distribution is derivedand via is the given by replica equation method (A.37). in Its section zeroA.4 temperatureat limit yields finite temperature, of alignments across examples with the optimal weight vector, J. Stat. Mech. (2013) P03014 α √ (see / , the + α λ √ is drawn / and leads µ ξ 2 λ , and leads to − λ the algorithm is , the distribution − 2 w . With this choice, ) = | λ λ λ ) = ( mean given by 1 − λ −| V ( 1 λ V ) = λ ( . Along the direction nonzero 2 V λ will be a zero mean unit variance + µ 1 determined by (64). Note that this . Thus, PCA scales up the alignment λ z ) α was already optimal with respect to all . Thus, unsupervised learning becomes 1 = √ α α − from (64). Thus, Hebbian learning yields w √ . However, along λ / α α √ . After learning, we find that the distribution √ / / incur an energy penalty with respect to the old α and dimensional space, where each point √ 2 w / λ across examples = 2, defined by the energy function in (56), involves N µ + )) = (1 + 1 K ξ 1 · α , in )∆, with ∆ = λ Statistical mechanics of complex neural systems and high dimensional data z P w ∆( ξ ) = z, N + ( λ ∗ √ ,..., λ + sgn( / 1 with a mean of 1 z ξ 2 , and outside this region the distribution is a split Gaussian. The λ = (1 . This term arises because α µ z + √ λ ∆) = 1 / before learning any example, yields the distribution of alignments across points, λ 1 z, ( z -mean clustering for ∗ P + ∆ from67), ( with ∆ = 1 ≤ K λ z , or the number of examples, increases and the weight vector responds less to λ α ≤ α ∆) = , and it is usually a decreasing function of √ Similarly, we could perform principal component analysis to find the direction of Finally, We can now apply these results to an analysis of illusions of structure in high z, / α ( 1 ∗ through (67) and64) ( to implies that the distribution− of alignments in (66) has a gap of zero density in the region maximal variance in the data. This corresponds to the choice doi:10.1088/1742-5468/2013/03/P03014 29 (see figure5(B)). Thus, anonzero high center dimensional of random mass Gaussian when point projected cloud onto typically the has optimal a Hebbian weight vector. to a newprincipal example, component and with (66) zero mean, leads but to a a standard Gaussian deviation distribution equal of to 1 alignments + along 1 the algorithm behaves like Hebbianthe learning, data so along we shouldmaximizing expect the a absolute Gaussian value(67) distribution of yields of the projection, so that of alignments in (66) is a unit variance Gaussian with a a projection ofHowever, the the data form onto ofthe two this dimensions, projected energy determined coordinates function by the in two (57) cluster reveals centroids. a lack of interaction between Gaussian (see figure5(A)).the However, center suppose of mass we of performed the Hebbian data.an This learning additive corresponds to shift tothe find the in number choice the of alignment previous examples to as a 1 new example whose magnitude decreases with figure5(C)). This extra width isand larger leads than to any an unityalong illusion eigenvalue of that the the the principal covariance high component matrix dimensional direction. Gaussian point cloud has a large width the presentation of anyrepeating new example. this Finally, analysis exampleof for 1 alignments is each not example, special and in averaging any way. over Thus, the Gaussian distribution competition between two effects: thewith second respect term to inoriginal (67) the alignment favors new optimizing the example, alignment but the firstexamples. term The parameter tries ∆ to playsthe the scale prevent role of changes of a an froma possible inverse self-consistency the realignment stiffness condition of constant that a foridentical determines weight ∆ to vector can with the be respect extremizationof derived to of within a (64). the new This cavity example,stiffer approximation extremization and as and makes is ∆ implicitly a function examples after learning in equation). (66 dimensional data. Consider anconsisting unstructured of dataset, i.e. a random Gaussian point cloud the other examples, so any changes in λ i.i.d. from a zeroidentity mean matrix. multivariate Thus, Gaussian if distribution we projectof whose these this covariance data matrix projection onto is a the random direction J. Stat. Mech. (2013) P03014 2 λ + = 2. 1 λ K direction (see 2 λ 2. = 2) of a random = 1000 dimensional dimensional space − ≤ increases. Indeed, it 1 N K N λ α P/N = -mean clustering with α K along the α √ / -mean clustering factorizes along random points in an K = 2000 random points in -mean clustering (with P P K ) as the amount of data α √ / Statistical mechanics of complex neural systems and high dimensional data = 2) are drawn from a structureless zero mean, identity covariance limit this histogram is Gaussian with 0 mean and unit variance. (B) Illusions of structure. α N,P Figure 5. space (so Gaussian distribution. These pointsA are histogram projected oflarge onto the different projection directions. of (A) A these histogram points of the onto(C) same A a point projection random cloud onto direction;projected the projected principal onto in onto component two the the vector. cluster (D) Hebbian directions The weight found same vector. by point cloud and does indeed have a gap of width 2 2 λ − 1 λ In summary,5 figure reveals different types of illusions in high dimensional data whose doi:10.1088/1742-5468/2013/03/P03014 30 effects diminish rather slowly as O(1 high dimensional Gaussian point cloud revealsclusters the in illusion that the there cloud. areand two There well numerical separated experiments is because not the aactually discontinuity in perfect leads the match derivative to of between replica thesymmetric the energy symmetry result in replica breaking (57) symmetric are [87]. theory relativelythis However, small, case; the in corrections and contrast, to replica(C)). it the symmetry is replica is exact for a Hebbian good learning approximation and in PCA (see e.g. figures5(B)should and be noteddepends that on the a verywill certain ability typically illusion lie of of on the structure: one perceptron side of to some store hyperplane random as patterns long as also joint distribution of high dimensional data in figure5(D)). Therefore, quite remarkably, and J. Stat. Mech. (2013) P03014 ∝ ) (68) i w ( µ → i M . µ ξ · for which a solution 1), and w } − µ = ( i I , σ → µ µ ξ { M . However, is it possible to find − synapses has the capacity to learn NP . µ i (+1) = ξ i . This positive feedback amplifies the N 1), the perceptron capacity is reduced = 1), µ → + P = 2. However, what learning algorithm µ ± µ → i i c σ h M = w i = → w i i → w µ u P/N < α . ) = µ , compute the current input ξ α µ · Statistical mechanics of complex neural systems and high dimensional data w ( θ , as this would imply . This system drives the messages to approximate the marginal N =1 P µ Y µ → i 1 Z h 83, and, moreover, can this algorithm be biologically plausible? 83 [90, 91]. Of course, below capacity, one can always find a solution . = 1, for all patterns). . ) = 0, do nothing. 0, update all weights: and 0 µ i w σ ≥ ( = 0 → µ I I < P c α < u 83, i.e. the space of binary weight vector solutions to (51) is nonempty only when . . The message passing equations (32) and33) ( then yield a dynamical system on i = 0 w P/N < α c µ Rule 1. If Rule 2. If Iterate to the next pattern, untilSuch all an algorithm patterns will are find realizable learned solutions correctly. to (51) in finite time for analog synaptic The work of [17, 18] provided the first such algorithm. Their approach was to consider When presented with pattern associations. However, we would like to find a single synaptic weight configuration, α random associations as long as → = . It is unlikely to expect to find a learning algorithm that provably finds solutions in a i • • • • h weights. However, what ifevidence synaptic suggests weights that cannot biological takethus synapses arbitrary behave can analog like values? reliably noisyThe Indeed, binary code general switches only [88, problem89],finite two and of discrete levels learning set ofan in of synaptic NP-complete networks values) weights, problem with isrevealed rather [15, that binary much than when].16 more weights weights a An are difficult (or binary continuum. exact than weights (say enumeration the with and analog a theoretical case; studies it have is in fact to time that is polynomial in through a brute-force search,N but such a search will require a time that is exponential in α message passing on thewith joint the probability desired distribution associations over (again all we synaptic assume weights consistent a learning algorithmtime that at can large typically (but not provably) find solutions in polynomial Here, the factorsand are synapses indexed tobe by examples parameterized examples. are by Thee real all messages numbers, distributions as from on examples binary to variables, synapses and therefore can the variables distribution of a synapseP across all synaptic weightnot configurations a that correctly distribution. learnby To all a do positive this, feedback term in on [18] the the updates for message passing equations are supplemented We have seen in section 3.1 that a perceptron with 3.6. From message passing to synaptic learning P can allow acase perceptron of to analog valued learnknown synaptic these as weights that associations, the weassociations perceptron up have that been learning to can discussing, be algorithm the a implementedweight [13, (i.e. simple critical vector those14] algorithm, capacity? associations to can Inset (51) be of the exists). proven randomly The to initializedgenerality, that learn perceptron weights as any learning follows set algorithm (for of iteratively simplicity, we updates assume, a without loss of doi:10.1088/1742-5468/2013/03/P03014 31 J. Stat. Mech. (2013) P03014 is 1. i , by ≥ h µ µ i ξ i 1 for all . h i ± h ). i h on pattern = 0). ) synapses, when i 5 I h = sgn( , but only if i µ i ξ w = O(10 ) presentations per pattern + 2 . 4 [17, 18]. Thus, one obtains a N i µ i i ξ h → 0 t µ , where → + 2 µ u i i ξ h =1 h · P µ w to only take a finite number of discrete → P i i =
N A ) , and integrating over all configurations b ν det − Q A T Q hDD h DD ( λ ab A ( a µ E Tr ( = = e = = Q T λ z 2 a i N u A − )
e − /P EE ab . In going from (80) to (82), we performed the (1 a a u Q P Q ) d P i can be performed via the saddle point method, and A α 2) T of the matrix ab i/ ab A + Y ( ( µ Q Statistical mechanics of complex neural systems and high dimensional data − I T a e Z u Q ) . This latter integral yields an entropic factor that depends = , yielding a single average over variables /P Q hDD µ . Therefore, we can compute the remaining integral over can be performed by a Gaussian integral over the variables is row (1 W . a a ab Tr log a µ u ii Tr log A P λ 2 a α Q d 1 2 ) 2) z i/ ( n =1 ( Y a ) = − n ) = W e Q Z Z Q ( , raised to the power ( , where hh DD = S E ab a u will not converge to that of the identity matrix. Even when Q W · = ii µ A ) a ) z ii ( with a given overlap b N λ n W a Now, the final integral over Inserting (77) into (76), we obtain Thus, consistent with the general framework in section A.1, averaging over the disorder a √ u Z λ / hh doi:10.1088/1742-5468/2013/03/P03014 35 the integral can be approximated by the value of the integrand at the saddle point matrix Here, in goinguncorrelated from for (79) different to (80), we have exploited the fact that the variables and is the usual entropicin factor. (84) The first comes term from in the (84) part comes outside from the ( 82) average while over the second term introduces interactions between the replicated degrees of freedom of interest here, therebe will thought be of somecompute as spread via another in the illusion replica the method. of density structure around 1, in and high this dimensional spread data, can which we now Now, the integrand depends on the quenched disorder hh on the overlap matrix equation (78) by integratingof over all overlaps on the overlap. In the end, (78) becomes where Gaussian integral over in its elements arethe strong data enough that its eigenvalue spectrum for typical realizations of distributed with zero mean and covariance, (1 Thus, the average over J. Stat. Mech. (2013) P03014 i λ for has (90) (89) (86) (87) (88) z A , where , where T , i.e. the + V A T Σ that charges . However, we A z U matrix Σ = = nonzero elements ), or equivalently N < z < z is given by q A ( − W N A z by F , and the eigenvalues P A in the measure (90). q eigenvalues? This distribution T N V Σ + log U zq , , and depends only on matrix whose only = ) , and due to the relation between the z + i q V N A − q + , and therefore we must include the Jacobian ). This illusory spread becomes smaller as we by i and scaling. Because the α z A ). We can make a decoupled replica symmetric P /α )( U 1 Q − πz ) d /P . With this choice, (75) leads to the electrostatic 1 + ( z 2 ± ab S A is a ( − . s are the singular values of 1 (so we have more data points than dimensions), i qδ − P z Σ σ log ( ) = . = 0 α Statistical mechanics of complex neural systems and high dimensional data Q has an imaginary part only when p A ( α > q ab iq. 1 − T α i ). q E A Q α . The = = + i = 2Tr σ z / W ) = W W 1 = + ii Q − ii ii ( e ) ) ii q ) ) is independent of z has a nonzero imaginary part. It is in these regions of z F z ( Σ has a unique singular value decomposition (SVD), ( ∝ ( q A α + i ) ( W W . In this region the charge density is A W α 2 P A Φ R ρ ) ( α − hh hh hh P √ dependent quadratic equation for i / 1 z are unitary matrices and ± satisfies the saddle point equation obtained by extremizing V q are simply the squares of these singular values. Thus, to obtain the joint distribution , we first perform the change of variables i Fortunately, λ = (1 and W , which extremizes ± U i.i.d. zero mean unit variance Gaussian elements, the distribution of Now, each matrix are on the diagonal, for of need to transform the full measure doi:10.1088/1742-5468/2013/03/P03014 36 In the previous section wematrix, found but the what marginal about density the of entire eigenvalues joint for distribution a of Wishart all random which is theto Marchenko–Pastur the (MP) high distribution dimensionalityspread out of (see around the figure6(A) 1 data, overobtain below). a the more range Thus, data eigenvalues of of (increased due O( the sample covariance matrix 4.3. Coulomb gas formalism matrix in equation (77) without the 1 ansatz for this saddlepotential point, and (72) leads to the electric field Here, the right-hand side of equation (86), This is a has a physically appealingdimensional interpretation statistics that discussed provides below. intuition Consider for applications the in distribution high of electric field and charge density in (71), we are interested in those real values of Q which the solution i (eigenvalues) will accumulate,part. and their In density the will regime be in proportionalz which to this imaginary a little algebra shows that i J. Stat. Mech. (2013) P03014 , A T (91) (92) A ) , where ) } /P i λ { , so we can ( i E = (1 σ = 2, rescaled by − e α W . ∝ | = 2) are drawn from k ) λ } α i λ − { j . ( ) λ | P V d Y j