Statistical Mechanics of Complex Neural Systems and High

Journal of Statistical Mechanics: Theory and Experiment

PAPER Related content

- Probabilistic reconstruction in compressed Statistical mechanics of complex neural systems sensing: algorithms, phase diagrams, and threshold achieving matrices Florent Krzakala, Marc Mézard, Francois and high dimensional data Sausset et al.

- Constrained low-rank matrix estimation: To cite this article: Madhu Advani et al J. Stat. Mech. (2013) P03014 phase transitions, approximate message passing and applications Thibault Lesieur, Florent Krzakala and Lenka Zdeborová

- Clusters of solutions and replica symmetry View the article online for updates and enhancements. breaking in random k-satisfiability Andrea Montanari, Federico Ricci- Tersenghi and Guilhem Semerjian

Recent citations

- Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning Yao Zhang et al

- Statistical physics of community ecology: a cavity solution to MacArthur’s consumer resource model Madhu Advani et al

- Minimum and Maximum Entropy Distributions for Binary Systems with Known Means and Pairwise Correlations Badr Albanna et al

This content was downloaded from IP address 171.67.216.21 on 28/09/2018 at 18:56 J. Stat. Mech. (2013) P03014 66$33.00 + Theory and Experiment Theory Recent experimental advances in neuroscience have opened new Department of Applied Physics, Stanford University, Stanford, CA, USA Madhu Advani, Subhaneil Lahiri and Surya Ganguli E-mail: , [email protected] [email protected] and [email protected] Received 9 October 2012 Accepted 14 January 2013 Published 12 March 2013 Online at stacks.iop.org/JSTAT/2013/P03014 doi:10.1088/1742-5468/2013/03/P03014 Abstract. vistas into theof immense data challenges complexitytheoretical us of on frameworks neuronal twocooperate for parallel networks. fronts. understanding across This First,computational proliferation how widely how problems? can dynamical disparate we Second,neuronal network spatiotemporal form systems how from processes adequate scales high cangive dimensional to a we datasets? pedagogical To extract solve aid reviewat of in meaningful important a these the collection challenges, models we of intersectionWe ideas of of and introduce theoretical statistical methods thein arising physics, interrelated statistical computer replica physicsheterogeneous science systems as and of and many powerful cavity interactingthe neurobiology. ways degrees closely methods, of related to notion freedom. which of We quantitatively message alsoin passing originated introduce computer analyze in graphical science models, large as whichand originated a highly optimization distributed problems algorithm involving capable manyboth coupled of the variables. solving statistical We large then physicsa inference show and wide how computer diversity science ofdata perspectives contexts can analysis. to be Along problems appliedof the arising in structure way in in we noise, theoreticalsensing, random discuss neuroscience matrices, all spin and dimensionality reduction glasses, withinreview and learning the compressed theory, recent uniﬁed illusions formalism conceptualmodels, and of connections neural the computationstatistical between and replica learning. physics method. message Overall, and these computer Moreover,can passing ideas science we uncover illustrate emergent might in how computational provide functionscomplexities graphical a buried of lens deep neuronal within through the networks. which dynamical we

ournal of Statistical Mechanics: of Statistical ournal

2013 IOP Publishing Ltd and SISSA Medialab srl 1742-5468/13/P03014

c J

systems and high dimensional data Statistical mechanics of complex neural J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data cavity and replica method, spin glasses (theory), message-passing Keywords: algorithms, computational neuroscience minimization...... 44 1 L Acknowledgments Appendix. Replica theory 55 55 7.1. Network dynamics...... 7.2. 51 Learning and7.3. generalization...... Machine52 learning and data analysis...... 53 6.2. Replica analysis...... 6.3. 45 From message passing to network dynamics...... 49 6.1. 5.1. Point clouds...... 5.2. 39 Manifold reduction...... 5.3. 40 Correlated extreme value theory and dimensionality reduction...... 42 4.1. Replica formalism4.2. for randomThe matrices...... Wishart4.3. ensemble33 and theCoulomb gas Marchenko–Pastur4.4. distribution...... formalism...... 34 Tracy–Widom36 fluctuations...... 38 3.1. Perceptron learning...... 3.2. 22 Unsupervised learning...... 3.3. 23 Replica analysis3.4. of learning...... Perceptrons and25 3.5. Purkinje cells inIllusions the of3.6. cerebellum structure...... inFrom high27 message dimensional passing noise...... to synaptic28 learning...... 31 2.1. Replica solution...... 8 2.2. Chaos in2.3. the SKCavity model method...... and2.4. the13 HopfieldMessage solution...... passing...... 11 16 doi:10.1088/1742-5468/2013/03/P030142 Contents 7. Discussion 50 6. Compressed sensing 44 5. Random dimensionality reduction 39 4. Random matrix theory 32 3. Statistical mechanics of learning 22 2. Spin glass models of neural7 networks 1. 3 Introduction J. Stat. Mech. (2013) P03014 obvious) can provide a priori that is useful for an organism in terms of function Statistical mechanics of complex neural systems and high dimensional data A.3.1. SK model...... A.3.2. 57 Perceptron and unsupervised...... learning. 58 However, such networks of neurons and synapses, as well as the dynamical processes References 61 A.1. Overall...... framework A.2. 55 Physical meaningA.3. of...... overlaps Replica57 symmetric...... equations 57 A.4. Distribution ofA.5...... alignments Inverting59 the Stieltjies transform...... 60 doi:10.1088/1742-5468/2013/03/P030143 a powerful way to[4]. understand As both the its functions performed structurecan by and be neuronal the useful networks are details to oftenfor turn of computational sources to its in of ideas nature, complex insight from it into dynamics manner. how distributed networks In computing of this algorithms neurons paper may in we learnis computer also and to science compute focus compute in on the a distributed marginalinteracting distributed message probability system. distribution passing Many of algorithms problems a whoseand single in goal constraint degree computer satisfaction, of can science, freedom be including inreview formulated as a error below, message large correcting message passing codes problems passing [5]. is As intimately we shall related to the replica and cavity methods of Neuronal networks areof highly neurons complex interacting dynamicalmultiple through timescales. systems synapses For example, consisting [1]–[3]. onconnectivity of fast Such is timescales, large networks approximately of constant, subserve numbers the andactivity order dynamics this through of connectivity over milliseconds, neurons. directs synaptic the Onbeyond, flow slower the of timescales, synaptic electrical ofby connectivity the the itself statistical order can structure of changetimescales. of seconds through These experience, to synaptic which synaptic minutes plasticityexperience. itself To and changes induced can the stay extent are that constanttools such thought over separations even from of to longer timescale the hold, underlieunderstanding one statistical of can our physics exploit neuronal powerful ability dynamics ofthe and to disordered replica synaptic method systems learning learn and in torelevant the from basic because cavity obtain they models. method, allow a For which us example, wedegrees remarkably to introduce understand precise and of the review statistical freedom below, propertiesinteractions become of that that many may interacting are be highly coupled heterogeneous, to or disordered. eachthat occur other on them, are through notInstead, simply some they tangled webs have fixed, of complexity beenand that or sculpted exist adaptation, for quenched, over to their time, own solve sake. biological important through computational neuronal the problems networks processes necessary serve of for a evolution, survival. Thus, learning 1. Introduction improving its evolutionary fitness. Thestatistical very physics, concept as of function largebiological does disordered polymers, not statistical of do mechanical course not systems, arisethat arise a in like through biological glasses network evolutionary or performs processes. (which non- may In not general, always the be function J. Stat. Mech. (2013) P03014 P points in P is small. This is the N is large and P Statistical mechanics of complex neural systems and high dimensional data are large, but their ratio is O(1). For example, we can simultaneously measure dimensional feature space. Much of the edifice of classical statistics and machine N This combination of ideas from statistical physics and computer science is not only We give an outline and summary of this paper as follows. In2, section we introduce the We end section2 by introducing message passing, which provides an algorithmic N useful in thinking aboutbut how network dynamics also and plasticity forthroughput may experiments mediate thinking computation, in about neuroscience. ways Consider a to data analyze set large consisting of scale datasets arising from high statistical physics, and can serve asprocesses a of framework neuronal for plasticity thinking and aboutlike how network learning specific dynamics and dynamical may inference. solve computational problems an learning has been tailored to the situation in which doi:10.1088/1742-5468/2013/03/P030144 fundamental techniques of theparadigmatic replica example, method the and Sherrington–Kirkpatrick cavity (SK)In method model a within [6] neuronal the of network contextin a interpretation, of such spin which a a glass the [7]–[9]. system heterogeneousdisorder. qualitatively synaptic models On connectivity a is the large fixedunderstanding network other the and statistical hand, plays properties the neuronalstatistical of role the activity properties, of neuronal can quenched activity. termed Werealization fluctuate will self-averaging find and of properties, that we certain the dopaper; are in not disordered interested large depend in connectivity randomdeterministic on macroscopic systems matrix. order with the can This microscopic ariseheterogeneity. detailed heterogeneity, in Such is striking ways order that levels can a do of governwell not the as almost recurring depend dynamics the on and theme performance the learning details ofthis in in of machine order the learning neuronal can this algorithms networks, be as in understood analyzing theoretically data, and, through moreover, perspective the replica on and the cavityphysics methods. replica are and essentially cavity equivalentwhich methods. to are equivalently Many joint known and probability modelsMoreover, described distributions in as over graphical many equilibrium many modelsprobabilities statistical in variables, of computations computer a science single [10]. in variablespecial in cases such as graphical statistical belief models. propagation Message [11], physics passing, involves also a known class involve in of algorithms that computing yield dynamical marginal low dimensional data scenario inmany which classical we have unsupervised largepatterns amounts machine of in data. learning data, In algorithms suchin when situations, can neuroscience they easily has exist.and find pushed However, the us structures advent into or the of a activity high high of throughput O(100) dimensionalO(100)) techniques neurons data for but any scenario often givenexpression in only experimental levels which under condition. of both a Also, O(100)dimensional limited we scenario, genes can number it can of measure but be the trialsas only difficult single (i.e. to often in find classical cell also statistically a unsupervised gene significantstatistical limited patterns machine physics in learning number of the algorithms disordered data, of yield systemsdimensional cells. again illusory provides data, In structures. a The because such powerful toolminimization many a to of machine high a understand data-dependent high learning energy algorithmshow function on statistical can a physics set be plays of formulatedin parameters. a We high as useful review below dimensional role the data, insensing, as understanding which well possible are as illusions tailored approaches of to like structure the random high projections dimensional and data compressed limit. J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data In section3, we apply the ideas of replicas, cavities and messages introduced in In section4, we discuss the eigenvalue spectrum of random matrices. Matrices from -means clustering, revealing that all of these algorithms are prone to discovering illusory doi:10.1088/1742-5468/2013/03/P030145 section2 to the(see problem [12] for of a beautiful learning bookor in length data, review play neuronal of this the networks topic).learning role In parameters as of this of context, quenched well a training disorder, examples, machine asmechanical learning and algorithm, machine the degrees play synaptic the learning of role weights freedom.are of of fluctuating a optimized, In statistical network, or the or learned, zeroas the by temperature aspects minimizing limit, of an thesethat the energy degrees function. do learned of The not structure, learning freedom show can depend how error, to be on as compute described well these thecomputing order by detailed its parameters macroscopic for realization storage the orderlearning of classical capacity. algorithms, parameters perceptron including Also, the [13, Hebbian we training14], learning,K thereby principal compute examples, component analysis these or (PCA) orderstructures and data. that parameters We reliably for arise inend classical random section3 realizations ofby highvalued discussing dimensional synapses, noise. an known Finally, application to we derived be a of biologically an plausible message learning NP-complete algorithm passingof problem capable this of to [15, solving problem random learning16]. by instantiations synaptic approximating The with message weights authors binary passing determined of in by [ 17, a the18] joint training probability examples. distribution over many random matrix ensemblesdisplay have fascinating eigenvalue macroscopic spectra structuresof that whose the do probability not matrix distributions depend elements.of on These fields the spectral detailed [19, distributions realization in play20]; a within understanding central the role thenonlinear in context networks stability a of [21] wide of neuralby and variety showing networks, the linear how for analysis neural replica example,typical of theory eigenvalue networks, they high distribution can play the dimensional of alsoon a a data. provide transition understanding variety role a We of to begin general an randomensemble framework section4 [22]) ensemble chaos matrix whose for eigenvalue ensembles. of distribution, in computing known Then,[23], random as the we the provides empirical Marchenko–Pastur focus distribution a covarianceMoreover, null matrices model we (the for review Wishart thought the how of outcome as the Coulomb ofeigenvalues charges eigenvalues can PCA living be of applied in thought of the many to asgas, complex the which high random plane, thermally is equilibrated dimensional and matrix stabilized charge the data. interaction density via ensembles of distribution the and this of competing can an Coulomb these effects attractive bestatistics of confining of a external repulsive the potential. two largestcan dimensional Moreover, be eigenvalue, Coulomb we which understood review obeys simply how the in the Tracy–Widom terms distribution of [24, thermal25 ], fluctuations of this Coulomb gas [26, 27]. systems whose fixed points are designedmodels. to Another approximate recurring marginal theme probabilities in in thismay graphical paper profitably is be that certain viewed aspects(and through of also the neuronal synaptic) lens dynamics of dynamicsin message can a passing; be suitably in viewed essence, definedand as these message graphical approximate neuronal versions passing model. of allowssignificance This message for correspondence of passing the between possibility existing neuronalneuronal of dynamics neuronal dynamics both from dynamics understanding a computational the and perspective. computational deriving hypotheses for new forms of J. Stat. Mech. (2013) P03014 minimization based 1 L minimization to neuronal 1 L Statistical mechanics of complex neural systems and high dimensional data minimization [47], learning sparse representations [48, 49], 1 L In section5, we discuss the notion of random dimensionality reduction. High The manifold of sparse signals forms a ubiquitous and interesting low dimensional After introducing CS in section 6.1, we show how replica theory can be used to analyze minimization. In section6, we focus mainly on the analysis of 1 dimensional data can be difficultsuch difficulties to is both to model reducealgorithms and the dimensionality process. search of One the for approach data; tosection optimal indeed,3.5, circumvent many directions such machine algorithms learning on yieldillusory projected which structures data that to distributions do that project not revealthe exist low the data in dimensional, onto the data. a data. An random Asambient alternate subspace. dimensionality discussed approach As of is the in to the dimensionality simply featurewill of project space this necessarily subspace in be is which lower lost. thelow than dimensional data However, the submanifolds reside, it in features is theirprojection of ambient often the feature above the space. data a In case critical suchof dimension, that situations, the which a submanifold interesting random is than data to morea the sets closely dimensionality surprising lie of related amount the to along of ambient the feature structurerandom space, dimensionality of projections often the preserves and submanifold. their Inend ability section5,5 section we toby review preserve introducing thereduction the a theory of geometry simple of statistical of random mechanics submanifolds, dataconnects like approach submanifolds. point random to clouds We dimensionality random and reduction hyperplanes. dimensionality discussed to This in analysis extremal sections fluctuations4.3 ofand 2D4.4. Coulomb gases structure that accurately captures(CS) [28, many29 ], types discussed ofdimensional in signal data. section6, can rests The be upon fielddimension recovered the of by from central solving a observation compressed a that random sensing computationally aL projection tractable sparse down convex high to optimization a problem, surprisingly known as low The statistics of thiswe largest discuss eigenvalue how will random makeillustrates projections an the power appearance distort of the later thephysics geometry in replica of section5, formalism, of two and dimensional when manifolds. plays Coulomb Overall, a gasesinduced section4 to role by PCA in dimensionality in connecting reduction section the in3.5 statistical and section geometric. 5.3 distortions doi:10.1088/1742-5468/2013/03/P030146 its performance in section 6.2.discussed Remarkably, the in performance section of CS,3.5, unlikethere displays other is a algorithms a phase critical lower transition. bound For on any the given dimensionality of level a of random signal projection sparsity, that is required regularized learning of highaxonally dimensional efficient synaptic long weights range from brain limited communication examples through [50] random and projections [51]–[54]. on statistical mechanicsapplications and of message random projections, passing.information compressed processing For sensing and readers data and analysis, whoof we refer are them how to more [30]. the There,dimensional interested diverse techniques neuronal applications in data in arecompressed discussed, sections5 gene includingand6 magnetic expression resonance arrayscanmeasurements imaging [34], and [31]–[33], be compressed fluorescence connectomics used microscopy [35, spatiotemporal [37, 36], to resolution38] receptive [39] acquire of field using multipleapplications and single molecular of pixel analyze species camera these at high [40, in same high 41] techniques technology. [30], Also, to diverse including46], neuronal neural semantic information processing circuits information are for processing discussed 44], [42]–[ short-term memory [45, J. Stat. Mech. (2013) P03014 ) β (3) (2) (1) is the synaptic J 1. In a neural network ± is an independent, identically . ij J /N minimization, demonstrated in [55], 1 L minimization problem can be formulated as a 1 L , j s , i ) s J , ij s ) Statistical mechanics of complex neural systems and high dimensional data ( J is an inverse temperature reflecting sources of noise. The J , s ( βH β ij − X βH e ] 1 2 − e J [ − 1 Z s X spin degrees of freedom taking the values ) = represents the activity state of a neuron and J ) = , ] = N s s i ( J ( s [ J Z P H are i s The main property of interest is the statistical structure of high probability (low Finally, the appendix provides an overview of the replica method, in a general form minimization via gradient descent has been proposed as a framework for neuronal 1 energy) activity patterns. Muchpicture progress in in which spin the glass Gibbs theory distribution [7] in has (2) revealed decomposes a at physical low temperature (large doi:10.1088/1742-5468/2013/03/P030147 distributed (i.i.d.) zero mean Gaussian with variance 1 into many ‘lumps’ ofsubsets probability mass of (more activity rigorously, patterns. pure Equivalently, these states [61]) lumps concentrated can on be thought of as concentrated is the partition functionconnectivity and matrix is chosen to be random, where each where connectivity matrixdistribution of of the neural activity network. given by This Hamiltonian yields an equilibrium Gibbs The SK model [6]It is has a been prototypical employed as example aand of simple has a model made disordered of a spin statistical recent glassesmodeling mechanical resurgence [7,8], system. of as in well spike neuroscience as within trains neural the networks [59, [58], context].60 of It maximum entropy is defined by the energy function 2. Spin glass models of neural networks that is immediately applicable torandom spin matrices glasses, perceptron and learning,non-rigorous, compressed unsupervised method learning, sensing. for Overall,disorder. analyzing the We the replica hope statistical method thatand mechanics is this message of a exposition passing systems of powerful, methodscontexts, with the will if discussed help replica quenched in to method, this enablephysics students combined paper to and with learn within researchers about the in athe both exciting cavity wide theoretical interdisciplinary intersection variety neuroscience advances of of made and statistical in disparate physics, the computer last science few and decades neurobiology. at may motivate revisiting the issuedependence of in sparse sparse coding coding in network neuroscience, dynamics. and the role of history where the interpretation, dynamics underlying sparsehand, coding the in efficiency both of message vision passing [56] in and solving olfaction [57]. On the other message passing problem [55]. This formulationthat yields qualitatively mimics a neural message network passing dynamics dynamicalL with system a crucial history dependence term. to accurately recover the signal;Also, this in critical section dimension6.3, decreases we with review increasing how sparsity. the J. Stat. Mech. (2013) P03014 , ), a i (4) (5) , it s (the J limit, a P → − N does not i s q , which is the q is an average over a (i.e. the mean pattern i · a h ). In the large a , and is hard to compute. ) may not be self-averaging, J q ( J P , where a i i , and a probability mass s a h = a i m ]. Correlations between neurons can then be and provides a measure of the variability of . If there is indeed one state, then J . 2 i [ ) J m Z ab i q P − ) q ( ] = ln Statistical mechanics of complex neural systems and high dimensional data . /N δ b i J [ b , then the overlap is m P b a i limit. As we see below, typical values of such quantities, for a βF = (1 , can be computed theoretically by computing their average m P − q J N and i ab X X a 1 N ) = can still yield a wealth of information about the geometric organization q = ( J J ab is the probability that a randomly chosen activity pattern belongs to valley P q a P ; despite the fact that the overlap distribution J . One interesting quantity that probes the geometry of free energy minima is J ii ) q vanish in the large ( Now, the detailed activity pattern in any free energy minimum J ) depends on the detailed realization of the connectivity J P a i ), unless there is only one valley, or state (modulo the reflection symmetry , the distribution of overlaps between any two pairs of activity patterns independently a chosen from equation (2) is given by Now, since This distribution turnsJ out not to be self-averaging (it fluctuates across realizations of self-overlap of the state, depend on the detailed realization of on the minima of a free energy landscape with many valleys. Each lump, indexed by in which case the distribution becomes concentrated at a singlemean number activity acrosscase neurons of due multiple to valleys,hh one the can quenched also disorder compute the in disorder the averaged connectivity. overlap distribution In the doi:10.1088/1742-5468/2013/03/P030148 is characterized by a mean activityprobability pattern that a random activity pattern belongs to valley its average over To understand the statisticalcompute properties its of freecomputed the energy via Gibbs suitable distribution derivativesaveraging, in of which (2), the means it free that is energy. Fortunately, to useful the understand to free the energy is free self- energy for any realization of 2.1. Replica solution of free energy minimamethod, in which neural we activity now space. introduce. This can be carried out using the replica any given realizationover of all the distribution of overlapsbelong between to all pairs two valleys, of activity patterns. If the activity patterns configurations belonging to the freefree energy energy barriers valley betweenif valleys an diverge, activity so pattern that startsergodicity in in is dynamical one valley, versions broken, it ofaverage will as this activity stay time model, pattern. in that average Theare valley network activity for interested can infinite patterns in thus time. understanding are Thus, maintain the not multiple structure steady equal of states, these tom and steady the states. we fullHowever, Gibbs many interesting quantities,averaging, which which involve by averages definitionof over means all that neurons, their are fluctuations self- across different realizations J. Stat. Mech. (2013) P03014 ], J (9) (8) (6) (7) [ (10) (11) Z . Thus, even J , the preferred , the replicated J J , 2 ab . Thus, minimization . Applying this to (8) Q 2 ab 2 ab σ Q P ab 4) / replicated neuronal activity 2 P β ( 0 limit. The appendix provides n 4) N / e 2 → } β a ( . n s X J { . This average is difficult to perform − J = ++ 2 ) = ) alone does not determine the overlap ) a j a j s Q s Q a i . ( a i ( s n s ij E E , J Z yields =1 J n a ij P , a j ∂ ∂n ii P s J ( ] =1 a i 0 in (10), which yields an entropic term corresponding n a ij s J [ → a a P P n ) s Z β P N e it is useful to introduce = lim 4 . However, for any fixed realization of ln β / Statistical mechanics of complex neural systems and high dimensional data } J n a 1 , (1 hh s = b i 2 Z e X { x s − 2 x n } a i = , yielding σ a n s s J 2) X can be performed because it is reduced to a set of Gaussian { ** Z / ii N =1 0 and i (1 ] X J = = → J , we expect this similarity to survive, and hence we expect average [ J J n , . . . , n 1 , and we have in (A.2) N J = e /N , the replicas will prefer certain patterns. Which patterns are preferred were independent, marginalizing over, or integrating out, the disorder ii ii Q = lim z J βF n n a = i = 1 s Z Z Z = 1 zx a ab e 2 hh ln hh hh − Q h denotes an average over the disorder σ , J , for ij . One must still sum over a ii J s ab · is a zero mean Gaussian random variable with variance Q = hh z z However, minimization of the energy Thus, although for any fixed realization of the quenched disorder where doi:10.1088/1742-5468/2013/03/P030149 matrix of this energy functionfixed realization promotes of alignment of the replicas. The intuition is that for any after averaging over overlaps between replicas to be nonzero. will vary across realizations of where with activity patterns set of patternsneuronal will activity pattern be are controlled similar by across the same replicas quenched connectivity since the fluctuations of each replicated introduces attractive interactionsframework, presented between in the thethe appendix, overlap matrix replicas. the interaction Consistent between replicas with depends only the on general integrals. To do so, we use the fundamental identity because the logarithmdifficulty by appears exploiting inside the identity the average. The replica trick circumvents this patterns Now the average over is the overlap matrix between replicated activity patterns. where a general outline ofto the compute replica the approach average that over can be used for many problems. Basically, suffices to compute its average over all which can be performed more easily, and then take the This identity is useful because it allows us to first average over an integer power of J. Stat. Mech. (2013) P03014 ). ab 1), for ab ) = Q (13) (14) (12) q n Q = β > ab , . . . , s Q 1 s ( P . , J ) ab Q − q . Unfortunately, we will not explore this 0 limit with this replica symmetric ansatz ( ij δ J (figure1(A)). At lower temperature ( b → 6= i a X (i.e. permuting the rows and columns of n . Now, the physical meaning of the saddle point is implicitly an ansatz about the geometry and b a 1) s ab . ab − 1 z (see equation (A.24) for the derivation), Q Q n q a

( s = 0 is the only solution, representing a ‘paramagnetic’ ) n Statistical mechanics of complex neural systems and high dimensional data is 0 for all ab 0 q qz i → P √ n m β 1), , β n ( = lim − i 2 b J s = β < a ii s ) is unstable [63], and so one must search for solutions in which eﬀ tanh h q ( H (weighted by their probability) is simply the distribution of oﬀ-diagonal ab

= J a i . Q P i (a special case of (A.8) and (A.9)), ab = m m hh q Q Q , with denotes an average with respect to the Gibbs distribution ) is given by (5). Therefore, the distribution of overlaps between pairs of free eff q n ( i . This is equivalent to an assumption that there is only one free energy valley, and J βH · b h − P 6= )e Now, the effective Hamiltonian yielding the average in (12) is symmetric with respect While this scenario seems plausible, a further analysis of this solution [6, 62] yields a /Z measures its heterogeneity. Taking the yields a saddle point equation for all q to permutations of the replica indices multiplicity of free energy valleys in (2), averaged over Therefore, it is natural to search for a replica symmetric saddle point in which matrix elements ofany the ansatz replica about overlap matrix. the Thus, structure in of searching for solutions to (12), (1 replica overlap matrixaveraged is overlap distribution, explained in section A.2; it is simply related to thewhere disorder energy minima At high temperaturestate ( (figure1(A)), in whichand average activity neural patterns activity fluctuatea over nonzero all solution possible risesstate configurations, continuously corresponding from to 0, amean suggesting single activity a valley (figure1(B)) phase in transition which to each ainconsistent neuron ‘frozen’ physical has predictions (like a negative different framework, entropy for this the system). inconsistency Withinsaddle the can point replica be for detected by showing that the replica symmetric doi:10.1088/1742-5468/2013/03/P03014 10 where breaks the replicamany symmetry. free This energy corresponds minima. to Apredicts great a a deal nested physical of hierarchical, picture work(see has tree figure1(C) in led like and which to organization (D)), there ahighly on known remarkably symmetric are rich as the and ansatz an ordered space which from ultrametric low of temperature structure purely free hierarchical [64]. energy random, structure Itphenomenon emerges minima disordered is further generically striking here, couplings that since for this most of the applications of replica theory to neuronal to the number ofminimization replicated drives activity overlaps patternssmall, with to since a there be given areoverlaps. large, set many This of more entropy competition overlaps. replicated maximization between While configurationsoverlap energy drives with energy matrix. and small, After overlaps entropy rather computing tomatrix leads than this can large, to be entropic a be term,equations potentially the computed for nontrivial most via likely the value of saddle the point overlap method, yielding a set of self-consistent J. Stat. Mech. (2013) P03014 . 2 J q . This k [65]–[67]. J RSB ansatz [7]. ∞ = describing the typical k 1 q that does not depend on the q = 2. The true low temperature phase of k is characterized by two order parameters, or to the connectivity matrix , and found that such a network exhibits Q β J is 0. (B) The replica symmetric ansatz for a low q -step RSB schemes describing scenarios in which the k induce macroscopic changes in the location of energy was called temperature or disorder chaos respectively J J Statistical mechanics of complex neural systems and high dimensional data or or β . (C) One possible ansatz for replica symmetry breaking (RSB) β J Probability lumps in free energy valleys. Schematic figures of the space . This ansatz, known as one-step RSB, corresponds to a scenario in which 2 > q 1 Figure 1. of all possible neuronalconfigurations or spin with configurations non-negligible (large(2) circle) probability (shaded and areas). under the (A) space theby At of the Gibbs high spin temperature Gibbs distribution alldrawn distribution. in spin from Thus, configurations the the arethe Gibbs explored inner replica product distribution order will between parameter two typically random have spins 0 inner product, and so Gibbs distribution decomposes into a nested hierarchy ofthe lumps SK of model depth is thought to be described by a particular temperature phase: the spins freezevalley), into which a small can set differ of configurations from (free energy realization to realization of the connectivity the Gibbs distribution breaksinner into multiple product lumps, between with describing two the configurations typical inner chosen product(D) from between There configurations exists the from a same different series lumps. of lump, and figure describes a possible scenario for q However, the inner productreplica order between two parameter,realization random takes of spins, a and nonzeroin therefore value which also the the replica overlap matrix broken replica symmetrystable with corresponding respect to topossibility thermal a that or this noise hierarchy multiplicity induced ofprocessing of fluctuations. states It may tasks. low be is However, useful tempting energywith several for to performing states explore works respect neural the have that to information perturbations noted are either thermal that to fluctuations, the while inverse temperature they these are states not are stable structurally stable with respect to So far, in order towith introduce a the random replica symmetric method, connectivity we matrix have analyzed a toy neuronal network 2.2. Chaos in the SK model and the Hopfield solution doi:10.1088/1742-5468/2013/03/P03014 11 processing and data analysiscorrect. discussed below a replica symmetric analysis turns out to be Indeed very small changes to minima in the space ofactivity neuronal patterns activity patterns. to This sensitive either in dependence [66]. of low For energy connectivities neural information whose processing, noisy it dynamics would not be only useful thermally to instead stabilize have a network prescribed set of J. Stat. Mech. (2013) P03014 P (15) is O(1) neurons is large, patterns µ µ N P m m play the role µ , and when all . This problem µ ξ j ξ µ i ξ , the system is in a P/N c -dimensional patterns is initialized to either = s , the network will relax α µ , denoting the overlap of PN α > α ξ s ) are learned, or stored, in is chosen independently to can a network of µ · µ i ξ µ P ξ ξ ) /N 138. For . in each valley is large for one pattern µ = (1 = 0 µ c m m α < α 1. Hopfield’s proposal was to choose induces an equilibrium probability distribution over . Successful pattern completion is possible if there are ± J Statistical mechanics of complex neural systems and high dimensional data µ . = µ j that stabilizes a prescribed set of and the level of storage saturation µ i ξ . If so, then when network activity ξ through equation (2). Ideally, this distribution should have 2 J µ µ i β changes its synaptic weight by an amount proportional to the ξ ξ s i − =1 P are random and uncorrelated (each µ X µ 1 . This relaxation process is often called pattern completion. Thus, N is imposed upon the network, this correlation is ξ , where µ ξ µ = ξ ij to neuron J j ,...,P 1 with equal probability). These works extensively analyzed the properties − = 1 µ This synaptic connectivity A key issue then is storage capacity: how many patterns An early proposal to do just this was the Hopfield model [68]. Suppose one wishes to free energy valleys such that the average of and their reflections patterns are imposed upon the network in succession, the learned synaptic weights are , for and small for all the rest. These free energy valleys can be thought of as recall states. µ µ P fits the classic moldof of quenched disordered statistical disorder,of physics, and where freedom. neuronal the In patterns activitycollection particular, patterns of the self-averaging play order structure the parameters of role free of energy thermal minima degrees can be described by a free energy valleys, corresponding to lumpsξ of probability mass located near the store? This issue wasthe addressed stored in patterns [70, be71] via +1 the or replica method in the situation where neuronal activity with2 pattern of free energy valleysof in the the inverse temperature Gibbs distribution (2) with connectivity (15), as a function neuronal activity patterns P given by (15). a corrupted or(under partial a version dynamics of whosecorresponding stationary one distribution of to is the given by learned (2)) patterns to the free energy valley the network’s synaptic weights (i.e.be through viewed as (15)), motion and down subsequentsuccessful, a network free the dynamics energy minima landscape can determined ofthe by this the process weights. free If of energy learninginitial is landscape recalling network correspond past activity to patterns experience past induced corresponds experiences, by to current and stimuli. completing partial or corrupted µ The replica method inSolutions [70, to]71 yields the aare replica set found equations, of at in self-consistent low equations which temperature for precisely only these one when averages. order parameter find a network connectivity Hopfield’s prescription providesmemory: a the unifying structure of framework past for experience thinking (i.e. about the patterns learning and neuronal activity patterns, butto do changes so in in either a the manner connectivity that or is level structurally of stable noise. with respect This choice reflectsfrom the neuron outcome ofcorrelation a between Hebbian the learning activityactivity rule on pattern [69] its presynaptic in and which postsynaptic each neurons. synapse When the spin glass state withwith many free any energy of minima, the none patterns of (in which thedoi:10.1088/1742-5468/2013/03/P03014 have solutions a to macroscopic the overlap replica equations, no average 12 ξ J. Stat. Mech. (2013) P03014 (16) (17) (18) . Indeed, = +1 (or µ . Thus, all is O(1) for 1 ξ i s s µ i 1 m neurons in (16) J N . For example, because 1 , whenever s 1 is a sum of many terms, it i 1 J to h = N i 1 J becomes large behaves like the low . Since corresponding to low temperatures, , . . . , s N α 2 β s , . . . , s 2 1) and s . (0 O , = 1 \ α H j s + Statistical mechanics of complex neural systems and high dimensional data i 1 s h ij 1 J s ). At such high levels of storage, so many patterns ‘confuse’ i c s − =2 N i X 1 ij J 1 2 ) = , at low enough temperatures, spurious, metastable free energy through the symmetric coupling − c J and be found in [70, 71]. N =2 α > α i i , X s s = β phase plane with neurons can be written as ( = 1 . However, as the temperature is increased, such mixture states melt away. \ N β 1 N – with µ α < α h H H and α in (14), which may seem a bit obscure. In particular, we give an alternate α q → ∞ 1) this exerts a positive (or negative) effect on the combination decreases with increasing temperature. Nevertheless, in summary, there is a robust − c The starting point involves noting that the SK Hamiltonian (1) governing the Even for interacts with P,N α = 1 1 s doi:10.1088/1742-5468/2013/03/P03014 13 s by a Gaussian distribution.because However, such the a individual Gaussianarises terms approximation from is are a generally correlated common invalid coupling with of each all other. the One neurons source of correlation is tempting to approximate its thermal fluctuations in the full system of where is the local field acting on neuron 1, and is the Hamiltonian of the rest of the neurons the network, so that its low energy states do not look like any one pattern We now return tolight an on analysis the physical of meaningparameter the of SK the saddle model point through equation an for the alternate replica method symmetric that order sheds 2.3. Cavity method derivation of (14) through theintuition cavity for method [7, (14)72], bymethod, which describing while provides considerable indirect, it physical can asdirect often replica a provide methods. self-consistency intuition condition. for In the general, finalfluctuations the results of derived cavity via more region in the as the free energy landscape of the Hopfield model as more than one This phenomenon illustrates aHowever, beneficial there role is for aas noise tradeoff in to associative melting memory away operation. mixture statesin by which the increasing recall states temperature, and dominate the the free network energy can landscapedevice. successfully over operate neural Many activity as patterns, important afunction pattern details of completion, or about associative the memory phase diagram of free energy valleys as a valleys corresponding tocharacterized mixtures by of solutions patterns to can the replica also equations arise. in These which mixture the states average are temperature spin glass phase of the SK model discussed in the previous section. J. Stat. Mech. (2013) P03014 . 1 in s (20) (19) exerted from the 1 of neuron h ,...,N 1 s spins can be takes the form 1 N is known as the h 1 absence h and 1 in terms of the cavity s 1 h ) in (20) is Gaussian, then 1 N h . ( 1 1 βH \ and \ − P 1 neurons in (16), but instead the e βH s − ! e N i s i ! in the full system of i 1 . Note that this does not imply that the s J 1 i 1 1 \ h J N =2 , H i X ) by all other neurons can be approximated by ) (17) in a Gibbs distribution with respect to 1 neurons obtained by removing 1 1 N =2 s i − X 1 , that has been removed from the system. (C) In h − ( 1 1 h 1 s − h \ N 1 P

) h 1 δ

,h N 1 δ s is Gaussian. Indeed, if ( N ,...,s X N Statistical mechanics of complex neural systems and high dimensional data 2 βV s H , and therefore the thermal fluctuations of cavity activity − ,...,s X i 2 e and its local field N 1 s 1 1 J Z Z 1 1 \ s in the cavity system 1 = . Z 1 ) is that one can now plausibly make a Gaussian approximation The cavity method. (A) A network of neurons, or spins. (B) A cavity 1 ) = h 1 , unlike the case of these same fluctuations in the presence of h 1 i h 1 . 1 ( s ) = 1 , h 1 J 1 s , while of course correlated with each other, must be uncorrelated \ 1 − h s in a ‘cavity system’ of P N ( ( s 1 1 \ N h ) = a replica symmetric approximation, the full distribution of the field on the cavity (in thea absence Gaussian of distribution, whilein the equation joint (20). distribution of Figure 2. surrounding a single neuron, P P 1 acting on neuron 1 in the full system of in the full system 1 ,..., 1 , h 2 1 h h s s ), i.e. the distribution of (17) in the cavity system (18) of neurons 2 ( 1 h V ( 1 \ The advantage of writing the joint distribution of The key idea behind the cavity method is to consider not the distribution of the P where field distribution to doi:10.1088/1742-5468/2013/03/P03014 14 with the couplings patterns individual terms influctuating (17) neuron exhibit correlated fluctuations due tolocal common field coupling todistribution the of system, thereby leaving acavity ‘cavity’ field, (see or figures2(A) the and field (B)). exerted Then, on neuron 1 by all the others in the The joint distribution of the absence of 1. Becauseabout the the cavity system set does of not couplings couple to neuron 1, it does notMotivated know by thisthermal lack fluctuations of of correlation,local field we can make a Gaussian approximation to the (18), 1, and its distribution is given by that of written in terms of the cavity field distribution as follows: J. Stat. Mech. (2013) P03014 (27) (26) (21) (24) (25) (22) (23) to be limit. ), (20) q 1 induced N h limit fast 1 ( 1 h neurons, in \ N P N and 1 , s 2 ) 1 \ in the large i 1 h ij J −h 1 h ))( q − 2(1 / vanishes in the large in the full system of ) cannot be Gaussian; as discussed (1 . 1 1 i ) − \ ) 1 1 i , h j 1 ,h , h 1 s 1 s δs ( ( i s ( N 1 δs βV of its cavity field, ) is characterized by its mean N P \ h 1 − i P q j e h 1 in ] ( 1 s 1 1 − q δs \ 2 \ 1 \ i i i 1 s i 2 P s − ) δs s X i h h 1 j , = 1 1 δs 1 , N 1 ( \ 1 J i h i \ N i Statistical mechanics of complex neural systems and high dimensional data 1 1 i i 2 1 N =2 i J i h q, X q s J h , h [ i =2 − − − 2 N 1 N N Z =2 i X i i,j X J 1 i and variance 1 ). The simplification in replacing the network with a fluctuating , s 1 1 h = = = 1 = 1 1 N =2 \ ) = i \ X i 1 1 , h i N =1 \ 1 i 1 . Here, we have neglected various terms that vanish in the large 1 X i h = 1 s 2 , h h \ ( ) 1 h 1 1 i N 1 \ i | h s V i ( s 1 1 δh = N s ( h P h q h h − h i s is the order parameter = i q ) obtained by marginalizing out 1 δs h In the replica symmetric scenario, under the Gaussian approximation to Under a Gaussian approximation, limit, but, most importantly, in going from (22) to ( 23), we have made a strong ( N doi:10.1088/1742-5468/2013/03/P03014 15 becomes allowing us to computeterms the of mean the activity mean of neuron enough that we cansystem neglect (and all consequently off-diagonal thevalley. full terms On system) in the is (22). other accuratelyconnected This hand, described correlation can if by will be the a system receive truecannot single is contributions neglect if free described the from the energy off-diagonal by fluctuations cavity is terms multiple across tantamount [7]. free to valleys, Thus, energy an and thelandscape. valleys, assumption we validity the As of of discussed replica this above, symmetry,self-averaging: cavity under or it approximation the a does assumption single of not valleyFinally, a in depend we single the on note valley, free we the energy thatsymmetry expect detailed is the realization broken cavity of and method there can are multiple be valleys extended [7]. to scenarios in which replica N assumption that the connected correlation above, this non-Gaussianity arises due to positive correlations between and variance where and by their coupling field is shown in the transition from figure2(B) to figure2(C). P J. Stat. Mech. (2013) P03014 i \ to i i is a (28) (29) , for ik h i , and h i J i \ i in (14), k s q h for all ik in (27), and respectively. limit. Under i J . However, we i i q s J 6= qz N h k , which is itself a √ q P , β q , we can replace the = i i \ \ i i i i h h h = tanh h N , , which are uncorrelated with i i , in the large q ik J J − over random realizations of 1 i \ i i qz, h h √ . For each i | should be the same as the distribution of 1 s J . h . 2 N z i to obtain an expression for q i

for each − 2 N i i 1 \ , q i i i , which we can do by demanding self-consistency of the \ h q i − for a fixed realization of Statistical mechanics of complex neural systems and high dimensional data i h is computed via (26) and (27), and reflects the thermal 1 i h N | h i qz, . Mathematically, this corresponds to computing the marginal · ) that do not depend on the detailed realization i h ij √ 2 s J i | h i i s in (26), which yields N s =1 h i i X h , reflecting the heterogeneity of the mean cavity field across neurons, 1 P sh z N

) − = = /N , respectively. q q denotes a ‘quenched’ average with respect to a zero mean unit variance q across different realizations of ) = . Thus, we expect the distribution of z i − i = (1 ii s, h q · ( V hh by virtue of the fact that this thermal average occurs in a cavity system in the across different neurons and 1 i i \ Equation (29) is a self-consistent equation for the order parameter \ i i i k qz s h random variable due to theh randomness of theabsence couplings of be Gaussian, with a meanFurthermore, and we variance expect that are this easily distribution computed to to be be self-averaging, 0 i.e. and the distribution of this assumption, although we may not know each individual However, we do not yet know for a fixed h average over neurons in (28) with an average over a Gaussian distribution, yielding Here, Gaussian variable and the thermalfluctuations average of√ a single neuron in the presence of a cavity field with mean and variance So far, we haveexample seen two methods that allow us to calculate self-averaging quantities (for 2.4. Message passing measure of the heterogeneityreflects of a mean demand activity that across theheterogeneity statistical neurons. of properties Therefore, mean of physically, neural the (29) which activity. cavity Now, fields finally, are we consistent can withwhen the specialize this to is the substitutedderived SK via into model the (29), in replica we method. recover the self-consistent equation for may wish to understand the detailed pattern of mean neural activity, i.e. doi:10.1088/1742-5468/2013/03/P03014 16 However, now we must compute cavity approximation. First of1; all, the we above procedure notecarried of that out forming there a with was cavity any nothingwe system neuron. can by special Thus, removing average26) ( about neuron these and neuron 1 equations (27) could over hold have all individually been for all neurons some fixed realization of distribution of a singleefficient neuron in distributed a message full passinghave joint algorithms distribution been from given developed by computer (2).certain to science Here, factorization compute [5, we properties. introduce 11, such]10 marginals which in probability distributions that obey J. Stat. Mech. (2013) P03014 a x can (30) P . Here, the . if and only k ). (C) The j i s x i i x s ( ij that factorizes a variables, with J +1 and → β t i N corresponds to a j N M a , x i x ) = e j , . . . , x 1 , s and variable i x , s a ( a , and the factors correspond i ψ x ,...,P variables = 1 N a depends on. Thus, we systematically . The factorization properties of i (see3(A)). figure For example, the SK x also as a subset of the a i x a . ) ) (31) are the variables a N i x ( s a is computed exactly as a product of two messages. i ψ depends on x a depends on =1 , . . . , x P ψ Y a 1 a Statistical mechanics of complex neural systems and high dimensional data x 1 ( Z ψ P i 6= ) = is treated exactly, while the effects of all other interactions besides ,j N } a , and there is an edge between a factor j Message passing. (A) A factor graph in which variable nodes are X x a { ), and in the SK model of equation (2), , . . . , x ) = 1 i ij x x are approximated by a product of messages. (D) Exact message passing in a ( ( factors, or interactions, indexed by = ( P Figure 3. represented by circlesflow and of factor messages nodes involved are ininteraction represented the by update squares.a of (B) the The message chain; the marginal on message passing approximation to the joint distribution of P if and only if factor or factors P a a i ∈ is any arbitrary variable that could be either continuous or discrete, and i i , or equivalently factor x a Consider, for example, a joint distribution over The utility of the factor graph representation is that an iterative algorithm to compute ∈ i into a set of Here, denotes the collectionabuse of notation and variables think that of factor each factor index variable if be visualized in ato factor variables graph, which is a bipartite graph whosemodel, or nodes more correspond generally either anyto neural a system factor with graph an in equilibrium which distribution, the corresponds neurons to nonzero synaptic weights connecting pairs of neurons. Thus, each neuron pair the marginals doi:10.1088/1742-5468/2013/03/P03014 17 J. Stat. Mech. (2013) P03014 ) b a i by x ) is ( (see i (32) (33) (34) (35) i feels a x \ i i ( → b i a → ∈ M i j M . In contrast, there are two except ) the message b i t induced by the x b , variable ( i i a x , supplemented by → b t b ψ M on variables b ) as an approximation to j . In this case, x a ( b (the left-hand side of (32)) is , and by → i t b j in the factor M can be approximated via i a ∈ , ) alone on i to factor j b x (see figure3(B)). Message passing involves j , since in the absence of ( i b a x → t j . ) i M . i x \ ) ( i b a ∈ x Y j ( → ∞ i i , ) induced by all other interactions besides interaction b ) +1 → i approximate the true marginals though equations (34) t M ) (see3(B)). figure The (unnormalized) update equation b x i Statistical mechanics of complex neural systems and high dimensional data x j ( a alone. These messages will be used below to approximate x a ( b x M i ∈ i b ( Y → ψ a ∞ i b → in (35) (see also figure3(C)). This approximation treats the \ ) i ∞ a i \ a → b M ∈ t a Y j x x b X . Intuitively, we can think of M x ( i i ) as an approximation to the distribution on M a i ∈ ψ a x Y induced by all other interactions besides interaction and ) = ) = is connected to only one factor node ( , except for interaction i i i i j i ∝ x i ∝ x → x ( ( → t ) b ) the message from variable in the full joint distribution of all interactions (see e.g. (34)). ) i ∞ a a i a j i +1 +1 → → x x x M t t b i x M ( ( ( b M M P P to variable → t denotes the set of all variables connected to factor node j b i M \ can be visualized as the flow of messages along the factor graph (figure3(B)). We b i The (unnormalized) update equation for a factor to variable message is given by The update equations (32) and (33), while intuitive, lead to two natural questions: can be approximated via i for all types of messages, one fromdenote variables to by factors and the other from factors to variables. We first define this iterativeis algorithm a and probability then distribution later over give a justification single for variable, it. and Every at message any given time the distribution on from factor we can think of direct influence of interaction the marginal of where figure3(B)). Intuitively, the directobtained influence by of marginalizing out all variables other than accounting for the effectsthe of product all of of messages the other interactions besides for the variable to factor messages is then given by Intuitively, the distribution on (the left-hand side of (33))that is simply involve the variable product ofrandomly the direct initializing influences all of all the(32) interactions messages and (33) and until then convergence. Onewhere iteratively exception running any to the the variable random update initialization equations is the situation initialized to be a uniform distribution over no influence from the rest of the graph. Under the message passing dynamics, will remain aguaranteed, uniform but if distribution. the algorithm Now, doesx converge, for then the general marginal distribution factor of a graphs, variable convergence is not and indeed the joint distribution of all variables and (35)? Amarginal key of intuition the variables arises from thedoi:10.1088/1742-5468/2013/03/P03014 structure of the approximation to the joint 18 for which factor graphspoint will messages they converge, and, if they converge, how well will the fixed J. Stat. Mech. (2013) P03014 . ) a N (36) (37) (38) (39) ∈ ) is i i in the s ( i i that were → ) i ,i 1 − would require i ∞ ( i s M , ) 1 on these variables by − . Overall, this method i k b , and the normalization s i ( s ) ,k are independent, and their 1) = 1. Note that whereas 1 , and, after convergence, we − − a i ( k ( ∈ 2). A similar leftward iteration P → , i 1 . by explicitly including the factor , − , t +1 k k a +1 s ∞ k M k s 1 s k = s (+1) + − . k +1 t ) s P +1 i k k,k s s ), is initialized to be a uniform distribution, J k,k ( 1 1 i J 1 i − s − 1 → = ( =1 − N k k,k i k 2) J , +1) P β P (1 ). Each iteration converges in an amount of time β e i,i i β ). An exactly analogous approximation is made in ∞ ( e i → 1 s e 0 1 ( − x 1 N i M k ( − X s i M ) a → i ,...,s → s ∞ i Statistical mechanics of complex neural systems and high dimensional data through interaction X ( X ,...,s +1) i +1 ) = 1 i M s s i,i a k → ∞ ( ) s . ( ,i ∈ leads to a factor graph in which all the variables are now weakly coupled (ideally independent) under all the 1 a M k i ) = ) = − a i i → i a ∞ ) 6= ( s s ( ( ,k b i i 1 M − → → ) +1 k ∝ t marginals simultaneously, as (36) holds for all ( ,i 1 removes all paths through the factor graph between variables +1) ) . For example, the rightward iteration for computing i i M − N i i,i s ∞ ∞ a ( ( ( P M M ) = k s ( +1) k,k ) operations, this iterative procedure for computing the marginal requires only O( ( This weak coupling assumption under the removal of a single interaction holds exactly Although message passing is only exact on trees, it can nevertheless be applied to N → +1 . However, it approximates the effects of all other interactions t k a coupling of theψ variables a simple product of messages the update equation (32).removing Such approximations the might interaction bepreviously expected to connected work to wellremaining whenever interactions whenever the factor graphone is interaction a tree,In with the no loops. absence Indeed, ofjoint in any distribution such such factorizes, a paths, consistentIn case, all with general, removing pairs the whenever any of the approximationsin variables made finite factor time, graph in and is) (32 a the a and general fixed point tree, proof35). ( messages of thechain yield this message (see the fact, passing true figure3(D)). but marginals equations Consider wechain. [11]. converge will the This We illustrate will spin marginal it not feels distribution in give is an of the a interaction a case to product spin of its of a at left two one and position converged dimensional right, messages and Ising at so time (35) tells usEach of the these marginal two messageschain can to be position computed by iterating messages from either end of the M since spin 1leads is to only the connected calculation to of a single interaction (1 given by the path length from each corresponding end to where the first equalityThe is first message a in special this case iteration, of (33) and the second is a special case of (32). have Inserting (38) and (39) into (36) yields the correct marginal for doi:10.1088/1742-5468/2013/03/P03014 19 graphical models with loops, and, as discussed above, it should yield good approximate factor can bea fixed naive at sum the overO(2 all end spin by configurations demanding operations. to Moreover, two compute sweeps through theand the marginal therefore chain allow all over us tois compute essentially all the identical messages, generalization to of the the Bethe transfer approximation matrix [73]. method for the 1D Ising chain, and is a J. Stat. Mech. (2013) P03014 ), j , if s (43) (44) (42) (40) (41) ( j → ) → ∞ i,j t ( M by all the spins in , i.e. j i . ik J ) to a variable i, j , , ) ) i k s s ( ( i ) → i,j t through a nonzero ( k is removed. In terms of this parameterization, i → t i M s ij k M s J ) through i i j s s s . i . ik ( 0 s ), as the essential degrees of freedom upon which ) ) J i i ij β s i,j J βhs e → ( ( converges to the field exerted on spin β ) t k + . 0 j can be thought of as a type of cavity field; as e → i Statistical mechanics of complex neural systems and high dimensional data k i i,j s s , h → j ( j t i X i M ki s → → → j h βJss t t i t i i is coupled to X \ J e i h ( k βh ∈ M Y 0 u e s k s j ) = X \ j i ∝ s ). Then, the remaining message passing update (33) yields the ∈ ∝ X ) = ) ( i k i i j s s s s ) ( ( ( → ) = ) j j J,h j i,j +1 +1 ( i,j → → ( t t t i i ( +1 → t i βu → t i h e M M M if and only if M ) is defined implicitly through the relation i ≡ ∈ J, h ) ( i k s u ( j Now, each message is a distribution over a binary variable, and all such distributions We conclude this section by connecting message passing back to the replica method. → t i the message passing updates (41) yield a dynamical system on the cavity fields [76], a cavity system in which the interaction Here, doi:10.1088/1742-5468/2013/03/P03014 20 message passing is successful, Here, the scalar parameter dynamics can be usefully parameterized by a single scalar parameter, the message passingM dynamics operates. We simplify the notation a little by letting In general, suitable averages ofequations the and messaging passing the equationsmodel, replica reduce we outline equations to the both [5]. derivation of theperspective To the cavity of replica illustrate message saddle point this passing. equation Wehas in equation first degree (14) note the from 2, that the special since the every update case factor of node of a in the message the from SK SK model a factor ( marginals whenever the variablesremoval adjacent of to that alearning factor in factor node. section node3.6 Weand arein will compressed partially weakly sensing see justifying in correlated successful section the upon was6.3. examples a application An variational of of early connection: theoretical message this eachis advance solution passing in to in to the the fixed graphical one contexts pointapproximation models equations to to of with of the one message Gibbs loops passing free correspondencemodels energy that in with is variational approaches exact extrema toand on of inference precise trees in conditions (see a graphical under [75]is certain which for theoretically message a Bethe guaranteed passing review). to in free converge However,marginals. graphical to there energy messages models Nevertheless, are that with no [74], in yield many known a practice,in an loops general good message approximating approximation marginals to passing when the seems correlationsare between to indeed variables achieve adjacent weak to empirical after a success removal factor of node that factor. depends only on the message which is a specialnode case to of factor (32). messages, Thus, we can take one set of messages, for example the where J. Stat. Mech. (2013) P03014 0 . j ) ) k 0 ). s j to s J +1 → t i βh (45) (46) (47) (48) βh h J on (which s h tanh( tanh( (besides J ) reduces to J h . In (43), the k ( ). However, in . ≈ J h ) Q ( ! ) . We can think of Q k J, h ( , h 2 u k , the latter empirical , at a message passing ! ), and the cavity fields → ∞ ! J k J ( k J t across all choices of pairs on each side of (46). The ( u βh in the presence of its own j 2 P βh as 2 → ∞ , after marginalizing out k h → is a random variable due to k j ), which reflects the simple coupled with strength ∞ i h X s j h → N s t i βh − → tanh tanh h ∞ i . More generally, one can track h k 2 k h j J J ,

. j δ tanh( , due to the external ﬁeld k k )] 0 ) X X s and J k βh

h i and ( ≈ ) ) i ), then the induced distribution on k k Q ) can be neglected), exerts a field h h h k ( ( ( s h Q J, h ) tanh( Q Q d ( ) in (45) reflects the back-reaction of k k u k h h βJ , Y d d J, h J ) ( k k k ) at a message passing fixed point, u . Then, the distributional equation for Y Y J h q ( ( ) ) by taking the expectation of . Second, for a fixed realization of k k Statistical mechanics of complex neural systems and high dimensional data P . To simplify the right-hand side, since the couplings Q q J J J k q ( ( J obtained by marginalizing out P P d are drawn i.i.d. from a distribution k k u turns out to be a simple sum over all the spins ). This yields a recursive distributional equation characterizing arctanh [tanh( k J J ik h j Y d d 1 ( β J +1 → t i Q , we can use the small coupling approximation Z k k h Y Y on both sides of) (46 yields ) = /N 2 Z Z ) = ) if the back-reaction from ) is the effective field on a binary spin h h . J, h ( = = i that experiences an external field of strength ( βh 0 u Q q → J, h s t k ( h u . The assumption of self-averaging means that as j are drawn i.i.d. from a distribution . The more complex form of Using (43), we are now ready to derive (14). The key point is to consider self- In general, it can be difficult to solve the distributional equation for i s → and t k Physically, another spin In the weakapproximation coupling that limit the ofwould average be small magnetization tanh( of on that becomes non-negligible atupdated larger cavity values of field the bi-directional coupling this distribution in two ways. First, for a fixed Explicitly, of this same effectivecavity field field the random choice of couplings fixed point, there is an empirical distribution of cavity fields should be identical to consistency conditions for the distribution of cavity fields the distribution of cavity fields i distribution converges to thewould distribution like of to the write down formerthis a random distribution self-consistent must variable. equation be In for self-reproducingin any this under (43), case, distribution, the if by we update the observing equation couplings h that (43). More precisely, Here, we have suppressed the arbitrary indices the SK model, onezero could mean make Gaussian an with approximation variance thata self-consistency the condition distribution for ofleft-hand cavity fields side is is a byhave a definition variance of 1 the time-dependent evolution oftechnique the known distribution as density of evolution cavity [5]. fields, an algorithmic analysis Then, averaging doi:10.1088/1742-5468/2013/03/P03014 21 J. Stat. Mech. (2013) P03014 , w (49) (50) , this is equivalent q synaptic weights is not relevant to the N w , so that the set of perceptrons N = ! , and fires depending on whether or k w = +1 represents the firing state and ξ · σ βh w 2 1 dimensional hyperplane orthogonal to the tanh − k ), where X ξ N · 1 N w

βh. 2 ) ) is zero mean Gaussian with variance Statistical mechanics of complex neural systems and high dimensional data k h h ( ( Q Q = sgn( ) tanh k h σ h ( d h Q k d Y Z Z 1 dimensional sphere. . Since the absolute scale of the weight vector = = w − N 1 represents the quiescent state. Geometrically, it separates its input space into two − In summary, we have employed a toy model of a neural network, the SK spin = The perceptron is a simple neuronal model defined by a vector of 3.1. Perceptron learning In the above sections,statistical we mechanics have of reviewed fluctuating powerfulsynaptic neural machinery connectivity activity designed matrices. patterns A to inthat key understand the this conceptual the presence advance same of madestatistical disordered machinery by mechanics Gardner could directly [79, beexamples on80] presented was applied the to to space thewe of the system will synaptic analysis playing explore connectivities, the of thisunsupervised role with learning, learning viewpoint of the (see and by quenched training [12] its disorder. performing for applications In an this to extensive section, review diverse of phenomena this in topic). neural and 3. Statistical mechanics of learning glass model, toto introduce analyzing disordered the statisticaldetail various mechanical the systems. replica, simplest In possible cavity eachnamely ansatz case the and replica concerning we symmetric message the have ansatz, discussed structurecorrelations corresponding passing in to of between a approaches the single degrees free valleymodel, with of energy weak it landscape, freedom. connected nevertheless While providesthe this a various assumption good methods. example isbelow, system not In the in true assumption addition, which for of toout fortunately, a the gain to for single familiarity SK be many with freeextended correct. energy of [7] Finally, to valley the scenarios we governing inenergy the note applications which valleys fluctuations replica and that discussed long symmetry will just range is turn viewing correlations, broken, as optimization so corresponding and too the to inference can many replica problems message free to through passing and a the approaches. new cavity Indeed, lens algorithm, of methods knownor statistical as physics can minimize survey has be propagation led costs, [77, that in78], can which free confound can energy find more landscapes good traditional, marginals, characterized local algorithms. by many metastable minima Now, since we have assumed to the replica symmetric saddle point equation (14). which linearly sums a pattern of incoming activity not the summed input isit above computes a threshold. the Mathematically, function in the case of zero threshold, σ classes, each on opposite sides of the weight vector doi:10.1088/1742-5468/2013/03/P03014 22 problem, we will normalize thelives weights on to satisfy an J. Stat. Mech. (2013) P03014 . P w w (53) (51) (52) data , such should , i.e. P w µ V to the ξ µ w input–output P ) determining λ P ∝ ( , corresponds to λ V w − 0, and 0 otherwise). . For example, each N ≥ ) = experimental conditions, λ ( , x with the weight vector V ) in (52)[12]. However, if we P stimuli. The overall goal of λ µ ( ) = 1 P V x ( θ genes across ), this choice leads to w N ( E ,...,P. = 1 of the data onto this single dimension yield a useful µ µ ξ ) limit becomes a uniform distribution on the space of Statistical mechanics of complex neural systems and high dimensional data , neurons in response to · ∀ 1 dimensional sphere of perceptrons as follows: ) ) is the alignment of example µ w w ( − λ N , where each vector is of dimension µ ( → ∞ N 1 βE ξ √ N V µ β − ) is the Heaviside function ( 0 e σ x = =1 P ( ≥ · 1 Z does a solution to the inequalities exist? µ X inequalities θ µ . Doing so requires a learning rule (an algorithm for modifying the ,...,P µ } λ µ w ξ P based on the inputs and outputs) that finds a set of synaptic weights µ ) σ µ ) = ) = = 1 σ N , σ w w w · µ → µ ( ( √ ) for various choices of potential functions ξ / µ w E P { w ), where ξ ( λ E , for − = (1 µ ( µ θ ξ λ ) = Suppose we wish to train a perceptron to memorize a desired set of A statistical mechanics based approach to answering this question involves defining λ that satisfies the ( This same statistical mechanics formulation canlearning be extended scenarios. to more In general unsupervised unsupervised learning, one often starts with a set of 3.2. Unsupervised learning in the zero temperatureperceptrons ( satisfying (51)(51), (see and,4). figure in Thus, particular, thestatistical whether mechanics volume or of of not (53) the it in space is the of nonzero, zero solutions temperature can be limit. to computed by analyzing the vectors vector could beor a a pattern pattern ofunsupervised expression of learning of activity issimplest to of approach is find to simple findthat an hidden the interesting single structures projections dimension or spanned by patterns the in vector the data. The one dimensional coordinatebe system defined for by minimizing the the data. energy This function (52), interesting with dimension the can choice often of associations, w We will seeinequalities, then, below remarkably, a that learningcan rule, as find known the long as solution. thetraining The as perceptron main data learning there remaining rule question [13], exists is then, a under what simultaneous conditions solution on the where the particular unsupervisedHebbian learning algorithm. learning. One Upon choice, minimization of synaptic weights an energy function on the Successfully memorizing all the patterns requires all alignments to be positive, so be a potential thatvariety penalizes of learning negative algorithms alignments for and thedescent favors perceptron on positive architecture can ones. be Indeed, formulated aare as gradient wide interested in probing theV space of solutions toWith the this inequalities choice, (51), theexamples, it is and energy useful so function to the in take Gibbs) (52 distribution simply counts the number of misclassified doi:10.1088/1742-5468/2013/03/P03014 23 J. Stat. Mech. (2013) P03014 is w i (55) (54) is the replica q potential cluster -mean clustering. to be the center of K K i , where w . Under this choice, q 2 λ − − than to any other centroid. i ) = λ w ( . In the case where the distance i V . µ . i . The cluster assignments of the data are then i , ) µ 2 , λ µ 1 assigned to cluster are optimized by minimizing the sum of the distances . At each iteration in the algorithm, each cluster across data points λ i ( K µ Statistical mechanics of complex neural systems and high dimensional data µ ξ w V w λ µ =1 ξ P µ X · i ,..., 1 Perceptron learning. (A) The total sphere of all perceptron weights w ) = w 2 = 2, and when both the data and cluster centroids are normalized . This is the direction of maximal variance in the data, also known N 1 w T √ , µ K 1 ξ , this iterative procedure can be viewed as an alternating minimization µ = w is the center of mass of cluster , this energy function can be written as ξ ( K µ i i µ N order parameter introduced in section 3.3. (gray circle) andperceptron a weights single that example(A), yield (black but an arrow). for output The aexamples +1 blue in different on region (A) example. the is and (C)weights example. (B). the shrinks, The (D) (B) set and set As The of its of more same typical weights examples volume as that are is yield added, given the +1 by space on 1 of both correct Figure 4. E λ w well separated clusters in the data, this iterative procedure should converge P = K to those data points i C w Beyond finding an interesting dimension in the data, another unsupervised learning For general points in the directionhas of its the center center of of mass mass at of the the origin, data. a In useful situations choice in is which the data points in the directionmatrix of the eigenvector of maximal eigenvalue of the data covariance doi:10.1088/1742-5468/2013/03/P03014 24 as the firstvariance principal of component the of distribution the of data, i.e. it is the direction that maximizes the task is to find clusters in the data. A popular algorithm for doing so is This is an iterativecentroids algorithm in in the which data, one maintains a guessThen, about all the cluster centroids from defined to be the set of data pointsmeasure closer is to a centroid Euclideanmass of distance, the this data step points just assigned to sets cluster each centroid where of a joint energythe function over special cluster case centroids of and cluster membership assignments. For recomputed with thethere new are centroids, andso the that whole each process repeats. The idea is that if to have norm J. Stat. Mech. (2013) P03014 1 is N ± b is a (58) (59) (57) w = µ · ξ u a with the σ w ) and /N → ∞ N , and an integral ab = (1 . Therefore, we can P,N Q µ to perform Hebbian ab σ i Q limit, from a Gaussian w or N can a perceptron with are drawn from a uniform µ , ξ µ P ξ , where , the integrand depends on the ++ µ a µν ) δ λ µ a ) (56) 1 λ ab ( λ V Q dimensions? To address this question − , =1 . = P µ 2 )] | 2 λ N Q P over these realizations. This can be done ( ii ( λ S θ ν b =1 2 Z − n a − λ ) λ µ a 1 Q P ( λ λ − − | E . These variables are jointly Gaussian distributed [ e 1 2 ) µ a hh N 2 a λ points in − λ − w e ) d − P 2 ab 1 λ n =1 Statistical mechanics of complex neural systems and high dimensional data λ Y a dQ ( + θ 1 1 into an integral over all possible overlaps Z . (For the case of perceptron learning, we can make the ab λ λ Y µ ( a 1 2 ξ − w ** Z · 1 sphere (or, equivalently, in the large , since both have the same distribution; in essence we absorb = µ a ) = = = − ξ 2 w ) ii ii N → , λ n n 1 N µ λ Z Z √ ξ ( held constant. Fortunately, this is the limit of relevance to neural models / hh hh V µ σ = (1 P/N µ a then reduces to averaging over = λ µ α ξ In both cases, the analysis simplifies in the ‘thermodynamic’ limit compute these observables byby averaging first log averaging the replicated partition function random vector drawn fromeach a with uniform probability distribution half.do on Similarly, a a we natural sphere assess question ofdimensional for the radius dataset unsupervised statistical learning consisting significance isit of how is of often any usefulitself structure to has or analyze no what pattern structure,distribution structure we on for we the example find may whendistribution in find with the in a identity data a covariance high points matrix). null data distribution that ratio with many synapticanalysis weights, involves understanding and the to low(53). energy high In configurations of dimensional the the thermodynamic data.configurations Gibbs limit, distribution The or in important starting the observables,averaging; point distribution like they the of do of volume not the data of depend along low on energy the the detailed optimal realization direction(s), of become self- where synapses memorize? One benchmark is the case of random associations, where doi:10.1088/1742-5468/2013/03/P03014 25 Both perceptronmechanics learning problems and as unsupervisedquestion above, for can learning, perceptron be when learning analyzed is formulated through how the as many replica statistical associations method. A natural 3.3. Replica analysis of learning learning only on the data points that are currently closest to it. Gradient descent on this energy function forces each centroid the replica overlap matrix.configuration Thus, of after replicated averaging weightsseparate over the only integral over through theirover all overlap. configurations Therefore, with it the is same useful overlap. Following to the appendix, this yields the sign of theover desired output into the input, yielding only positive examples.) Averaging redefinition with zero mean and covariance matrix and J. Stat. Mech. (2013) P03014 . ) q Q ) is ( + (63) (61) (62) (60) = 2. Q S c ( ab α δ E ) increases q , in (A.37). → q − z α arises from a ), ), while q → ∞ q a ( = (1 β F 1 as . For example, for ab ) → Q a Q λ q ( which, as explained in 0 for all βV q a > P drawn independently from − a limit is the typical overlap b λ w λ ) 1 λ ab − random associations. which reflect larger volumes in ( Q , a q i βV N λ → ∞ ) − q 2) )) / β q − (1 − − (1 e / 2 reflects the typical volume of the solution ) Q q qz ) in (62). The first term is an energetic term in the + ln(1 1 q √ − det ( q q − λ F √ q − ), in the zero temperature limit π a 2)(( λ 1 / can be performed via the saddle point method, and 2 λ h − (1 d ( √ 2 1 ab − θ e Q + n , reflecting a pressure for synaptic weights to agree on all =1 Statistical mechanics of complex neural systems and high dimensional data ) Y z a q q ) = ii Q ). The second term is an entropic term that is an increasing λ − Z ( q ζ λ weights can store at most 2 V (1 ln d ln π N α 2 hh Tr log − 2 1 α p 0 limit yields a saddle point equation for increases, placing greater weight on the first term in ) = Z ) = → ) = α Q q Q ( = n ( ( , the integral over E ζ S F half-spheres (see figure4). Thus, unlike the SK model, we expect a replica , which thus promotes smaller values of N q P At large Taking the Thus a perceptron with doi:10.1088/1742-5468/2013/03/P03014 26 weight space. As as energy becomes more important than entropy. As shown in [80], that is a decreasing functionexamples of (promoting larger function of is the partition functionNow, of in the distribution the appearing case inside of the perceptron average learning, over between 1 two zero energyminimization synaptic of weight configurations the (see sum section of twoA.2). terms in space to (51) (see4(D)), figure in that where is the entropy ofperceptron the learning volume when of weight vectors with overlap matrix where and an energetic term thatyield promotes the the correct alignment answeris of on the an any replicated entropic given weights termsince set so they that of that have promotes they examples larger replicated all (i.e. volumes. weight configurations with small overlaps, Given the connection (explainedand in section theA.2) distribution between of(53), replica overlaps this overlap of matrix choice pairs elements, suggests ofexpect the random existence as weights of most a ofconvex. single the Also, free energy in energy functions the valley. we This zerostate will is temperature energy be reasonable limit, analyzing configurations, to this forindeed if ansatz unsupervised suggests true degenerate, learning that for are should theof perceptron form space a of learning, a ground set since convex, of connected the set. space This of is ground states is the intersection symmetric assumption to be a good approximation. the competition betweenmake entropy the and ansatz energy that selects the saddle a point saddle has point a overlap replica symmetric matrix. form We section A.3.2 can be derived by extremizing a free energy J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data Now, the actual distribution of synaptic weights between granule cells and Purkinje The key intuition for why a majority of the synapses are silent comes from the doi:10.1088/1742-5468/2013/03/P03014 27 cells has beenhas measured a [84], deltaGaussian and function distribution. a In atIf prominent particular, 0, the feature about Purkinje while 80% ofa cell the of this is majority rest the implementing of distributionnetwork synaptic an of the is should important weights reflect synapses the that sensorimotor are the silent? properties nonzeroand mapping, it exactly In of outputs. why weights 0. the general, Thus, then learning one follow the ruleby are might as distribution a be positing well able of truncated as a to the synapticweight quantitatively particular distribution. statistics weights derive However, of learning the the in inputs authors distribution rule a did of of [81] not and weights took depend input–output an on evenPurkinje statistics more cell even elegant architecture and positing approach as that then a anyand perceptron, particular derive derived assumed the learning that the it distribution rule.a operated of They optimally synaptic replica simply at weights capacity, based modeled ofstatistics, the perceptrons Gardner whenever operating type the at analysis.associations capacity perceptron at via Remarkably, implemented a for given the levelconsisted a of maximal reliability of wide (its number capacity), a range of itsIndeed, delta distribution of input–output like of function input–output the synaptic weights data, aton a the 0 majority perceptron plus of operatingrule; a the at any truncated synapses (or learning were Gaussian near)distribution. rule silent. capacity, for that This and the can prediction does nonzero only achieve not relies capacity depend weights. would onconstraint the necessarily that learning yield all such thethe a granule Purkinje weight cell to cella Purkinje or nonnegative cell the synapses synapticactivity are perceptron patterns weight excitatory. and faces Thus, vector fires either for arest. that some It fraction difficult linearly of turns computational granule combines outperceptron cell that task: nonnegative patterns operating while false at it granule not positive or firing cell must errors near for the capacity: dominate find there the are weight many structure granule of cell the activation optimal patterns Interestingly, in [81], the authorsand developed a applied replica it based to analysiscells make of in predictions perceptron the about learning cerebellum. thewas distribution Indeed, first of an synaptic analogy posited weights betweenand of over the most Purkinje Purkinje 40 intricate cell yearsreceiving and dendritic excitatory ago the synaptic arbors inputs perceptron [82, from ofcerebellum about].83 all 100 devoted 000 The to neuronal granule cells motor Purkinjemotor cell which, control, in states, cell types; areas convey sensory of a has this feedback the can sparse and arbor one exert representation contextual is an of of states. influence ongoing capableinput, the The on internal of Purkinje each largest outgoing cell Purkinje motorthrough output, control cell a in signals. climbing receives turn, In fiber inputcell addition input, as to whose from well firing the on as induces granulefiring large plasticity average cell is complex in one often spikes the cell inconvey correlated granule an the in cell with error Purkinje to signal the errors Purkinjecell that in inferior can cell can motor guide olive synapses. be plasticity. tasks, Sincerelated Thus, thought inferior climbing at inputs of a olive fiber to as qualitative desired inputusing level, performing motor the error is supervised outputs, Purkinje corrective thought where signals learning to the transmitted in desired through order mapping the to is climbing fibers. map learned over ongoing time task 3.4. Perceptrons and Purkinje cells in the cerebellum J. Stat. Mech. (2013) P03014 1 µ \ ξ be w (67) (65) (64) (66) 1 \ → ∞ w β , this new , to vanish. P q ) in (62) by q − . ) has a unique ( will also change λ α and F ( , its overlap with 1 1 ξ V N ξ , the weight vector 1 . ξ 1 2∆ + z ) in (52). This is essentially the i ) w . ( λ i ( ) E λ V ( ) in the presence of all other examples + V ) in the presence of w 2 ( + ) w , ( z E 2 z ) E ii − z 2∆ minimizes λ , − 2∆ ( ) µ ∆)) h with ∆ remaining O(1). In this limit, (62) and (63) λ w ( λ . It can be shown [86] that for large z, λ h ∗ , is a zero mean unit variance random Gaussian variable. ( Statistical mechanics of complex neural systems and high dimensional data λ 1 − λ min ∗ ξ λ λ · ( → ∞ along the dimension discovered by unsupervised learning. 1 δ − , and \ does not know about the random example β µ µ α λ w 1 =1 ξ P ξ ( ) \ µ X − δ as · w N 1 P hh w √ /β ) / ∆) = argmin N ) = ) = (∆) = z, λ λ ( √ = ∆ F . Since ( ( = (1 ∗ / q 1 β λ P P z − = (1 ,...,P µ to an optimal alignment λ = 2 z Furthermore, the interesting observable for unsupervised learning is the distribution µ Equations (64), (66) andcavity (67) method have applied awhich to simple one interpretation unsupervised of within the learning the examples, [85, zero say temperature 86]. example Consider 1 in a equationthis cavity example, (52), system is in removed, and let the ‘cavity’ weight vectorfor that optimizes assuming 1 become where Now, suppose exampleupon 1 re-minimization is of the then total energy included in the unsupervised learning problem. Then, minimum leading to a non-degeneratelimit, ground we state. expect Thus, in thermal the fluctuationsIndeed, zero in temperature the we synaptic can weights, reflected find by 1 self-consistent solutions to the extremization of will change to a newfrom weight vector, and consequently its alignment with In contrast toformulation in perceptron (52) and learning, (53) to in unsupervised learning the discussed applications here, of the statistical mechanics 3.5. Illusions of structure in high dimensional noise doi:10.1088/1742-5468/2013/03/P03014 28 for which the perceptronachieve must this requirement remain with below nonnegative thresholdIndeed, weights is by and quantitatively to matching not set the fire, many parameterstheory of synapses and the to exactly the replica to physiological based only zero. data, perceptron the wayabout learning capacity to 40 of 000 the input–output genericthe associations, Purkinje weights cell corresponding of was a to estimated single to 5 cell be kB]. [81 of information stored in Extremization of (64) over ∆ determines the value of ∆ as a function of distribution of the data where optimal alignment arises through the minimization in (67). This minimization reflects a This distribution is derivedand via is the given by replica equation method (A.37). in Its section zeroA.4 temperatureat limit yields finite temperature, of alignments across examples with the optimal weight vector, J. Stat. Mech. (2013) P03014 α √ (see / , the + α λ √ is drawn / and leads µ ξ 2 λ , and leads to − λ the algorithm is , the distribution − 2 w . With this choice, ) = | λ λ λ ) = ( mean given by 1 − λ −| V ( 1 λ V ) = λ ( . Along the direction nonzero 2 V λ will be a zero mean unit variance + µ 1 determined by (64). Note that this . Thus, PCA scales up the alignment λ z ) α was already optimal with respect to all . Thus, unsupervised learning becomes 1 = √ α α − from (64). Thus, Hebbian learning yields w √ . However, along λ / α α √ . After learning, we find that the distribution √ / / incur an energy penalty with respect to the old α and dimensional space, where each point √ 2 w / λ across examples = 2, defined by the energy function in (56), involves N µ + )) = (1 + 1 K ξ 1 · α , in )∆, with ∆ = λ Statistical mechanics of complex neural systems and high dimensional data z P w ∆( ξ ) = z, N + ( λ ∗ √ ,..., λ + sgn( / 1 with a mean of 1 z ξ 2 , and outside this region the distribution is a split Gaussian. The λ = (1 . This term arises because α µ z + √ λ ∆) = 1 / before learning any example, yields the distribution of alignments across points, λ 1 z, ( z -mean clustering for ∗ P + ∆ from67), ( with ∆ = 1 ≤ K λ z , or the number of examples, increases and the weight vector responds less to λ α ≤ α ∆) = , and it is usually a decreasing function of √ Similarly, we could perform principal component analysis to find the direction of Finally, We can now apply these results to an analysis of illusions of structure in high z, / α ( 1 ∗ through (67) and64) ( to implies that the distribution− of alignments in (66) has a gap of zero density in the region maximal variance in the data. This corresponds to the choice doi:10.1088/1742-5468/2013/03/P03014 29 (see figure5(B)). Thus, anonzero high center dimensional of random mass Gaussian when point projected cloud onto typically the has optimal a Hebbian weight vector. to a newprincipal example, component and with (66) zero mean, leads but to a a standard Gaussian deviation distribution equal of to 1 alignments + along 1 the algorithm behaves like Hebbianthe learning, data so along we shouldmaximizing expect the a absolute Gaussian value(67) distribution of yields of the projection, so that of alignments in (66) is a unit variance Gaussian with a a projection ofHowever, the the data form onto ofthe two this dimensions, projected energy determined coordinates function by the in two (57) cluster reveals centroids. a lack of interaction between Gaussian (see figure5(A)).the However, center suppose of mass we of performed the Hebbian data.an This learning additive corresponds to shift tothe find the in number choice the of alignment previous examples to as a 1 new example whose magnitude decreases with figure5(C)). This extra width isand larger leads than to any an unityalong illusion eigenvalue of that the the the principal covariance high component matrix dimensional direction. Gaussian point cloud has a large width the presentation of anyrepeating new example. this Finally, analysis exampleof for 1 alignments is each not example, special and in averaging any way. over Thus, the Gaussian distribution competition between two effects: thewith second respect term to inoriginal (67) the alignment favors new optimizing the example, alignment but the firstexamples. term The parameter tries ∆ to playsthe the scale prevent role of changes of a an froma possible inverse self-consistency the realignment stiffness condition of constant that a foridentical determines weight ∆ to vector can with the be respect extremizationof derived to of within a (64). the new This cavity example,stiffer approximation extremization and as and makes is ∆ implicitly a function examples after learning in equation). (66 dimensional data. Consider anconsisting unstructured of dataset, i.e. a random Gaussian point cloud the other examples, so any changes in λ i.i.d. from a zeroidentity mean matrix. multivariate Thus, Gaussian if distribution we projectof whose these this covariance data matrix projection onto is a the random direction J. Stat. Mech. (2013) P03014 2 λ + = 2. 1 λ K direction (see 2 λ 2. = 2) of a random = 1000 dimensional dimensional space − ≤ increases. Indeed, it 1 N K N λ α P/N = -mean clustering with α K along the α √ / -mean clustering factorizes along random points in an K = 2000 random points in -mean clustering (with P P K ) as the amount of data α √ / Statistical mechanics of complex neural systems and high dimensional data = 2) are drawn from a structureless zero mean, identity covariance limit this histogram is Gaussian with 0 mean and unit variance. (B) Illusions of structure. α N,P Figure 5. space (so Gaussian distribution. These pointsA are histogram projected oflarge onto the different projection directions. of (A) A these histogram points of the onto(C) same A a point projection random cloud onto direction;projected the projected principal onto in onto component two the the vector. cluster (D) Hebbian directions The weight found same vector. by point cloud and does indeed have a gap of width 2 2 λ − 1 λ In summary,5 figure reveals different types of illusions in high dimensional data whose doi:10.1088/1742-5468/2013/03/P03014 30 effects diminish rather slowly as O(1 high dimensional Gaussian point cloud revealsclusters the in illusion that the there cloud. areand two There well numerical separated experiments is because not the aactually discontinuity in perfect leads the match derivative to of between replica thesymmetric the energy symmetry result in replica breaking (57) symmetric are [87]. theory relativelythis However, small, case; the in corrections and contrast, to replica(C)). it the symmetry is replica is exact for a Hebbian good learning approximation and in PCA (see e.g. figures5(B)should and be noteddepends that on the a verywill certain ability typically illusion lie of of on the structure: one perceptron side of to some store hyperplane random as patterns long as also joint distribution of high dimensional data in figure5(D)). Therefore, quite remarkably, and J. Stat. Mech. (2013) P03014 ∝ ) (68) i w ( µ → i M . µ ξ · for which a solution 1), and w } − µ = ( i I , σ → µ µ ξ { M . However, is it possible to find − synapses has the capacity to learn NP . µ i (+1) = ξ i . This positive feedback amplifies the N 1), the perceptron capacity is reduced = 1), µ → + P = 2. However, what learning algorithm µ ± µ → i i c σ h M = w i = → w i i → w µ u P/N < α . ) = µ , compute the current input ξ α µ · Statistical mechanics of complex neural systems and high dimensional data w ( θ , as this would imply . This system drives the messages to approximate the marginal N =1 P µ Y µ → i 1 Z h 83, and, moreover, can this algorithm be biologically plausible? 83 [90, 91]. Of course, below capacity, one can always find a solution . = 1, for all patterns). . ) = 0, do nothing. 0, update all weights: and 0 µ i w σ ≥ ( = 0 → µ I I < P c α < u 83, i.e. the space of binary weight vector solutions to (51) is nonempty only when . . The message passing equations (32) and33) ( then yield a dynamical system on i = 0 w P/N < α c µ Rule 1. If Rule 2. If Iterate to the next pattern, untilSuch all an algorithm patterns will are find realizable learned solutions correctly. to (51) in finite time for analog synaptic The work of [17, 18] provided the first such algorithm. Their approach was to consider When presented with pattern associations. However, we would like to find a single synaptic weight configuration, α random associations as long as → = . It is unlikely to expect to find a learning algorithm that provably finds solutions in a i • • • • h weights. However, what ifevidence synaptic suggests weights that cannot biological takethus synapses arbitrary behave can analog like values? reliably noisyThe Indeed, binary code general switches only [88, problem89],finite two and of discrete levels learning set ofan in of synaptic NP-complete networks values) weights, problem with isrevealed rather [15, that binary much than when].16 more weights weights a An are difficult (or binary continuum. exact than weights (say enumeration the with and analog a theoretical case; studies it have is in fact to time that is polynomial in through a brute-force search,N but such a search will require a time that is exponential in α message passing on thewith joint the probability desired distribution associations over (again all we synaptic assume weights consistent a learning algorithmtime that at can large typically (but not provably) find solutions in polynomial Here, the factorsand are synapses indexed tobe by examples parameterized examples. are by Thee real all messages numbers, distributions as from on examples binary to variables, synapses and therefore can the variables distribution of a synapseP across all synaptic weightnot configurations a that correctly distribution. learnby To all a do positive this, feedback term in on [18] the the updates for message passing equations are supplemented We have seen in section 3.1 that a perceptron with 3.6. From message passing to synaptic learning P can allow acase perceptron of to analog valued learnknown synaptic these as weights that associations, the weassociations perceptron up have that been learning to can discussing, be algorithm the a implementedweight [13, (i.e. simple critical vector those14] algorithm, capacity? associations to can Inset (51) be of the exists). proven randomly The to initializedgenerality, that learn perceptron weights as any learning follows set algorithm (for of iteratively simplicity, we updates assume, a without loss of doi:10.1088/1742-5468/2013/03/P03014 31 J. Stat. Mech. (2013) P03014 is 1. i , by ≥ h µ µ i ξ i 1 for all . h i ± h ). i h on pattern = 0). ) synapses, when i 5 I h = sgn( , but only if i µ i ξ w = O(10 ) presentations per pattern + 2 . 4 [17, 18]. Thus, one obtains a N i µ i i ξ h → 0 t µ , where → + 2 µ u i i ξ h =1 h · P µ w to only take a finite number of discrete → P i i =

N A ) , and integrating over all conﬁgurations b ν det − Q A T Q hDD h DD ( λ ab A ( a µ E Tr ( = = e = = Q T λ z 2 a i N u A − )

e − /P EE ab . In going from (80) to (82), we performed the (1 a a u Q P Q ) d P i can be performed via the saddle point method, and A α 2) T of the matrix ab i/ ab A + Y ( ( µ Q Statistical mechanics of complex neural systems and high dimensional data − I T a e Z u Q ) . This latter integral yields an entropic factor that depends = , yielding a single average over variables /P Q hDD µ . Therefore, we can compute the remaining integral over can be performed by a Gaussian integral over the variables is row (1 W . a a ab Tr log a µ u ii Tr log A P λ 2 a α Q d 1 2 ) 2) z i/ ( n =1 ( Y a ) = − n ) = W e Q Z Z Q ( , raised to the power ( , where hh DD = S E ab a u will not converge to that of the identity matrix. Even when Q W · = ii µ A ) a ) z ii ( with a given overlap b N λ n W a Now, the final integral over Inserting (77) into (76), we obtain Thus, consistent with the general framework in section A.1, averaging over the disorder a √ u Z λ / hh doi:10.1088/1742-5468/2013/03/P03014 35 the integral can be approximated by the value of the integrand at the saddle point matrix Here, in goinguncorrelated from for (79) different to (80), we have exploited the fact that the variables and is the usual entropicin factor. (84) The first comes term from in the (84) part comes outside from the ( 82) average while over the second term introduces interactions between the replicated degrees of freedom of interest here, therebe will thought be of somecompute as spread via another in the illusion replica the method. of density structure around 1, in and high this dimensional spread data, can which we now Now, the integrand depends on the quenched disorder hh on the overlap matrix equation (78) by integratingof over all overlaps on the overlap. In the end, (78) becomes where Gaussian integral over in its elements arethe strong data enough that its eigenvalue spectrum for typical realizations of distributed with zero mean and covariance, (1 Thus, the average over J. Stat. Mech. (2013) P03014 i λ for has (90) (89) (86) (87) (88) z A , where , where T , i.e. the + V A T Σ that charges . However, we A z U matrix Σ = = nonzero elements ), or equivalently N < z < z is given by q A ( − W N A z by F , and the eigenvalues P A in the measure (90). q eigenvalues? This distribution T N V Σ + log U zq , , and depends only on matrix whose only = ) , and due to the relation between the z + i q V N A − q + , and therefore we must include the Jacobian ). This illusory spread becomes smaller as we by i and scaling. Because the α z A ). We can make a decoupled replica symmetric P /α )( U 1 Q − πz ) d /P . With this choice, (75) leads to the electrostatic 1 + ( z 2 ± ab S A is a ( − . s are the singular values of 1 (so we have more data points than dimensions), i qδ − P z Σ σ log ( ) = . = 0 α Statistical mechanics of complex neural systems and high dimensional data Q has an imaginary part only when p A ( α > q ab iq. 1 − T α i ). q E A Q α . The = = + i = 2Tr σ z / W ) = W W 1 = + ii Q − ii ii ( e ) ) ii q ) ) is independent of z has a nonzero imaginary part. It is in these regions of z F z ( Σ has a unique singular value decomposition (SVD), ( ∝ ( q A α + i ) ( W W . In this region the charge density is A W α 2 P A Φ R ρ ) ( α − hh hh hh P √ dependent quadratic equation for i / 1 z are unitary matrices and ± satisfies the saddle point equation obtained by extremizing V q are simply the squares of these singular values. Thus, to obtain the joint distribution , we first perform the change of variables i Fortunately, λ = (1 and W , which extremizes ± U i.i.d. zero mean unit variance Gaussian elements, the distribution of Now, each matrix are on the diagonal, for of need to transform the full measure doi:10.1088/1742-5468/2013/03/P03014 36 In the previous section wematrix, found but the what marginal about density the of entire eigenvalues joint for distribution a of Wishart all random which is theto Marchenko–Pastur the (MP) high distribution dimensionalityspread out of (see around the figure6(A) 1 data, overobtain below). a the more range Thus, data eigenvalues of of (increased due O( the sample covariance matrix 4.3. Coulomb gas formalism matrix in equation (77) without the 1 ansatz for this saddlepotential point, and (72) leads to the electric field Here, the right-hand side of equation (86), This is a has a physically appealingdimensional interpretation statistics that discussed provides below. intuition Consider for applications the in distribution high of electric field and charge density in (71), we are interested in those real values of Q which the solution i (eigenvalues) will accumulate,part. and their In density the will regime be in proportionalz which to this imaginary a little algebra shows that i J. Stat. Mech. (2013) P03014 , A T (91) (92) A ) , where ) } /P i λ { , so we can ( i E = (1 σ = 2, rescaled by − e α W . ∝ | = 2) are drawn from k ) λ } α i λ − { j . ( ) λ | P V d Y j N D ≤ δ = 1, . Of course, with so few projections, one cannot reconstruct the original − dimensional space is randomlyK projected down to an Figure 7. nonlinear manifold respectively. (C) A manifold of N in the number of points . This is striking because the number of projected dimensions 1, as long as δ The celebrated Johnson–Lindenstrauss (JL) lemma [101] (see [102, 103] for more recent < δ < Consider data distributed along a nonlinear 5.2. Manifold reduction doi:10.1088/1742-5468/2013/03/P03014 40 How small can we make for all pairsexceeds of points logarithmic source data, data from theirthe projections. geometry Nevertheless, of the surprisingly, entirebased with point approach cloud so for is few understanding preserved. We the random will JL discuss projections lemma a in statistical mechanics section 5.3. dimensional space, as in figure7(B). An example might be a set of images of a single and simpler proofs)0 provides a striking answer. It states that for any distortion level space, so that the low and high dimensional distances are no longer similar? such that J. Stat. Mech. (2013) P03014 ) )) N is a NC C -sparse N/K nonzero log K 2 log( K 2 . Here, K/δ δ K/δ O( . We will give an δ O( dimensional space, as -sparse signals, with M > K N M > orthogonal vectors spanning ) RPs are sufficient to preserve 2 K = K/δ dimensional linear subspace in an O( M K dimensional signals with only M > N coordinate hyperplanes in N K dimensional space, how might one recover the original , recover the high dimensional signal from its projection. minimization. This is one of the observations underlying Statistical mechanics of complex neural systems and high dimensional data M 1 A L minimization, reviewed below, that can provably, under various 1 dimensional space? For the case of point clouds (figure7(A)) and L . δ N -sparse signals7(C)), (figure there exists a simple, computationally tractable K The simplest finite dimensional manifold is a Sparsity is another example of an interesting low dimensional structure. Consider, However, even in situations where one cannot accurately reconstruct the high Beyond the issue of preserving geometry under an RP, one might be interested in object observed underAnother example different would be lighting theto conditions, set of a perspectives, neural firing continuous rotations rate family vectors and in of a scales. stimuli. brain region In in response [104]–[106] it is shown that dimensional space. It can beall shown] [107 pairwise that distancesalternate between proof of data this result points incourse, section within for5.3 such abelow a using simple distortion theprojection, manifold, results there level of namely exists sections a 4.2 theand nonrandom,.4.3 optimal Of PCA geometry preserving basis consisting of components. This is a manifold of for example, a (nonsmooth) manifold of random projections preserve thenumber geometry related of the to manifold themore up curvature projections. to of distortion Overall, the the manifold, interestingdepends so result linearly is that on that highly the the curvedon required intrinsic manifolds its number dimensionality require of of ambient projections embedding the manifold, dimension. and only logarithmically the manifold. Thus,projections we rather pay a thanalong price the a in optimal hyperplane, the ones.generically a number preserve Of geometry. of PCA course, projections for based for data projection choosing that will random are no not longer distributed be optimal, and willin figure7(C). not The geometry ofIn this manifold particular, can [107] also be shows preserved that by random random projections. projections down to an doi:10.1088/1742-5468/2013/03/P03014 41 dimensional signal fromcompressed computation its directly projection,need in RPs for the signal can lowand recovery. dimensional signal still This projection processing be can algorithms space,For depend be very example, without only done regression useful the on because pairwise [109],learning by distances many signal [115] between allowing interesting detection data and machine for points. [110], learning classification nearest [111]–[114], neighbor manifold finding [102] can all be accomplished in a low dimensional space preserve the distance between any pair of vectors allows oneefficiently and to robustly reconstruct using the these field of vectors compressed from sensing, reviewed the below. low dimensional projection, It turns outparticular, that [108] geometry shows preservation that is any projection a which sufficient preserves condition the geometry for of signal all recovery; in situations in which onedata vector, can or invert signal the in the random low projection, i.e. given the projection of a assumptions on the RP matrix distortion less than signal in the high algorithm, known as general nonlinear manifolds (figure7(B)),algorithms there capable are no of generalcase achieving computationally this of tractable high dimensional signal recovery. However, for the J. Stat. Mech. (2013) P03014 , . α M A s growth and , and let R P points ln /M P √ ). Indeed, the R ). In general, the ln √ /M σ approaches a Gumbel growth of the maximal 2 σ P random projection operator ln √ N ), and can be neglected. In this by ), due to the random choice of /N M /M be an A is due to the strong, Gaussian suppression of any P Statistical mechanics of complex neural systems and high dimensional data limit, for a fixed (i.e. quenched) point cloud, a Gaussian limit, and takes typical values of O( N ) yields the conclusion that the maximal distortion over all ) pairs of points, or possible distortions, and we are interested 2 2 R realizations of such variables. Applying this general result with P P R = O( = O( dimensional space. Let R N 2 independent Gaussian variables with variance P ). Thus, the origin of the extremely slow R , in be the low dimensional image of the cloud. With this choice of scaling for the ) and P/M 1 α M ) independent Gaussian variables each with variance O(1 ln 2 ( p ,...,P As P O in (97) is in the large Consider a fixed realization of a Gaussian point cloud consisting of = = = 1 αβ α 2 distortion with the number of points doi:10.1088/1742-5468/2013/03/P03014 42 Gumbel distribution is avariables universal whose distribution tails governingextreme extreme vanish values values faster of in than any anyof exponentially random single the [116]; variable maximum this directly over σ leads strong to suppression an of extremely slow as O( pairs of points obeys a Gumbel distribution, and its typical values scale with distribution in the large The proofs],101]–[103 [ distortion [107, of submanifolds]104 underThus, behind RPs they rely the leave on remarkablenecessary, open sequences and sufficient how of the the potentially conditions question typical,to loose as for of use opposed inequalities. to low whether a worst directthe these case, approach, distortion probability sufficient behaves. distribution more conditions Is of inprojections? it are the the possible Here, typical actually spirit distortion the ofthe of random statistical random random mechanics, choice manifolds choice to of underobservable of simply of manifold random projection compute interest plays is plays the theover the distribution, all role across role pairs of choices of of ofaveraging, quenched in RPs, thermal points that disorder, of it degrees in the does of theensemble maximal not of distortion depend fixed freedom, on manifolds. manifold. and the In choicecan The the general, of hope a be this particular approach is carried manifold ishyperplanes. from that out a challenging, Our this suitable but for main distribution here twobehavior goal is we very of show in self- some that simple this of it classes section the of is inequalities discussed simply manifolds: in to random section obtain. 5.2 point intuition clouds for and the scaling 5.3. Correlated extreme value theory and dimensionality reduction dimensional space given ais relatively often small comparable numberoriginal high to of dimensional RPs. what space. Moreover,computations The can rely reason task only for be performance on this the obtained remarkable distances performance between by is data that points, performing which these are the preserved by task RPs. directly in the maximum of α whose matrix elementsx are i.i.d. Gaussian with zero mean and variancerandom 1 variable with zero meanNow there and are variance O(1 in the maximum distortion,between whose pairs behavior of could distortions. For in randombetween principle Gaussian two point depend clouds, pairs on the of the correlationmanner, distortions coefficients correlations the are ambient dimensionality weak, of the inthe point maximum fact cloud distortion O(1 disappears over all from pairs theof of problem. points Thus, O( can be well approximated by the maximum variance of the projectionD operator, it is straightforward to show that any one distortion J. Stat. Mech. (2013) P03014 . , , ¯ δ A α K αβ . P δ + and √ (99) D that / (100) by < , with U 2 M ) N , and its αβ α → ¯ A to +1 √ columns. In dimensional T / P α range over all ¯ will have zero A K K be an √ δ < D β / s ¯ A − 1 U 1, the distribution T − ¯ A and α > α s dimensional space can be to obey , we must have , then δ M αβ dimensional space so that the -dimensional coordinate vector D N K . 1 M < K consisting of its first − dimensional space. Thus, if we let ). Therefore, the maximal distortion in A ¯ As K M T − denote a ), while its large positive deviations of O(1) ¯ , in order to achieve an O(1) distortion 3 A s / has been characterized in sections 4.2 and T 2 . Therefore, as long as N s − , a similar argument holds for the minimum . Indeed, if A A coordinate axes. Thus, points in the hyperplane p α M } √ K =1 / 2 of the ambient subspace again disappears from the submatrix of 1 , and let k s need only be logarithmic in the number of points M > K , i.e. are O(e − k A , N K s Statistical mechanics of complex neural systems and high dimensional data { M M ), which, up to constants, is the JL result. Thus, extreme 2 by dimensional vectors whose only nonzero components are the will with high probability lie in the range < δ, M P/δ = max N orthonormal columns form a basis for a random ! αβ D P αβ K M O (ln D . As long as submatrix of ln α 1 αβ √ r K

dimensional space (drawn uniformly from the space of such subspaces). M > in (100) is simply the maximum eigenvalue of the matrix by are mapped to the first max O N , we can always perform a change of basis in U M ¯ As . Its typical fluctuations are O( U T . The maximal eigenvalue is with high probability equal to (1 + 1 ¯ coordinates, and the statistics of their projection to A T K and s K M/K For random Gaussian point clouds, we were able to neglect correlations in the → A of = of distance distortions for the hyperplane, then we have eigenvalues, which correspond geometricallylie to in vectors the in kernel the of random the hyperplane random projection 4.3. In fact, the results in these sections carry over with the replacements denote the distribution over the random choiceN of (100) is close to This means, of course, that if one wants all distortions problem. Second, bymaximal exploiting distortion the over all linearity pairsdistortion of of points over the in all projection the points plane, operator it on suffices to the to compute compute unit the the sphere maximal in determined simply by the this manner, the dimensionality Here, to obtain ain terms slightly of cleaner fractional finaldistance change result, used in we in Euclidean ares distance (97), now as hence measuring opposed the the to distortion square the squared root Euclidean in (100). The constrained maximum over α are exponentially suppressed in can be parameterized by first distortion between different pairs ofno points. longer For be more able generalanalysis to manifold is do ensembles, still this. we possible will However, for despite the the ensemble presence of of random these hyperplanes, correlations. an Let exact and independent of the ambient dimension random matrix whose subspace of What is the distributionpairs of of the points maximal in distortionof this in subspace? (97) First, where by exploitingcolumns rotational of invariance of the ensemble distortion, which will be close to doi:10.1088/1742-5468/2013/03/P03014 43 individual distortion. This effectexample, is if directly we responsible desire for our the remarkable maximal JL distortion lemma. to For be less than or equivalently value theory for uncorrelatednumber Gaussian of variables provides random a projections natural intuition for why the J. Stat. Mech. (2013) P03014 0 s N on µ by K/N (101) a of the = 0 M s f · µ a by an 0 s , or, equivalently, α √ / in the bottom manifold 1 0 s δ < minimization. that satisfy the measurement 1 temporal filters. In the context s L for which the recovery algorithm As N p , is a linear function = , which proves the claim about RPs x 2 . In the context of signal processing, ,...,M A is sparse by searching for sparse solutions 0 s = 1 to a pattern of presynaptic stimulation u µ M > K/δ ) is any sparsity promoting function. A natural ? In general, this is an underdetermined problem; x x x is then the projection of is just one point in this large space. However, we could be a set of ( th row of x 0 , for µ V s . µ µ a ) subject to dimensional vector which has only a fraction 0 i ) = 0 otherwise, so that (101) yields a signal consistent x from , so that (101) finds a solution to the measurement s x p ( is a point in the top manifold of7(C). figure Suppose we 0 N ( | Statistical mechanics of complex neural systems and high dimensional data = 1, the lowest value of As s 0 s V V | p s measurements, which is linearly related to = is the . Here, norm. However, this optimization problem is nonconvex for could be a vector of presynaptic weights governing the linear N =1 0 i x µ 0 X p s s a ) = s L with the minimum number of nonzero elements. However, this is s of ( x , i.e. s V ˆ = 0 and M < N A x of dimensional space of candidate signals . The true signal , where x 0 s M = arg min As minimization. Here, we review this algorithm and its analysis based on s ˆ − = ) = 1 if 1 x is an unknown sparse N -sparse signals (figure7(C)), as discussed above, one can actually recover the x L ( 0 [35]. K V s minimization µ 1 L 1. Thus, a natural choice is Now, how might one recover with the measurements in general a hardproblem combinatorial optimization by problem. choosing One could relax thisp optimization < can try to exploitto our the prior measurement knowledge constraints. that For example, one could solve the optimization problem to obtain an estimate choice is constraints with minimal (101) becomes a convex optimization problem, knowndoi:10.1088/1742-5468/2013/03/P03014 as 44 constraint response of a single postsynaptic neuron of figure7(C). Each measurement of network reconstruction, a trial there is an measurement matrix unknown signal could be a temporal signal, and the then one canthe achieve number this of with random high projections probability obeys as long as of its elements nonzero. Thus, are given a vector 6.1. Suppose high dimensional signal fromknown its as projection usingstatistical a mechanics computationally and tractable message algorithm, of passing. the ideas As in discussed this in section section1, and many the applications previous one are described in [30]. We have seen inof section5 lowthat dimensional random signal manifolds. projectionsspace Furthermore, of can in the preserve case the in geometric which structure the manifold is the 6. Compressed sensing of hyperplanes madefluctuations of in correlated section randomin5.2. variables section (i.e.4.3) Overall, can themanifolds, be this charges namely used of argument hyperplanes. to a understand shows Coulomb geometric gas distortions how induced described the by RPs extremal of simple J. Stat. Mech. (2013) P03014 and (102) (104) (103) (105) using which A 0 . Thus, s 0 , A s . We can A 0 s ii Z and held fixed, we and and fluctuations in (101), in the A log 0 0 A , are the replicated s Q to be self-averaging; M/N = ≡ hh . We take these to be Q = 0 s ˆ . Then, taking the low ¯ s F α , . . . , n β As − and ∆ = 1 = 0 with a x , Q measurement matrices given by play the role of quenched disorder a , average error 0 0 u minimization. Some of these results s s for any typical realization of Z 1 → ∞ − to the optimization problem in (101), G L . The replica method yields a set of , b i P s s ˆ | and and the signal , further averaged over u o i random a i G = log s = A , where u A P M,N are drawn independently from a standard a u + =1 u i βF T i · µi u | − µ P A a randomly chosen nonzero elements, each drawn N =1 i ). Thus, /T X N 0 1 . s + √ G ( fN 2 P / ≡ i P 2 . Statistical mechanics of complex neural systems and high dimensional data , ab ) Au ) i . In the limit = 1 G has u T Q we enforce the constraint ( 2 P u G a µ i 0 δ A b P i ( s βE T h u − h u e N =1 i X 1 N =1 N → ∞ Z λ i X by computing the average free energy 2 1 , where does not depend on the detailed realization of λ N limit condenses the Gibbs distribution onto the vicinity of the global 0 1 N 0 µν ) = Q δ = ) = to guarantee perfect signal recovery, so that u Q = ( ab u Q ( 0 G Q → ∞ A P ∆ E Q β , and therefore its free energy = minimization. However, often large G P

1 b ν L δb a µ Details of the replica calculation can be found in [121]. Basically, averaging over To understand the properties of the solution Now, , all depend on the measurement matrix in the replicated Gibbs distribution corresponding to the energy (102) reduces to coincides with their thermal averages over δb Q 0 A averaging over the variables the replica method. and analyze the statistical mechanics of the Gibbs distribution therefore compute and, if needed, the thermal fluctuations normal distribution, while independently from a distribution By taking the limit ∆ random variables; the matrix elements the typical error Much of theconditions seminal on theoretical work in CS [117, , 118 108 ] has focused on sufficient 6.2. Replica analysis case of have also been derived using message passing [55]we and define polyhedral an geometry energy [122]. function on the residual temperature minimum of (101). Then, we can compute the average error in the thermal distribution s residuals. These variables are jointly Gaussian distributed with zero mean and covariance violate these sufficient conditionsperformance. nevertheless Thus, typically yield thesestatistical good mechanics approach sufficient signal to CS reconstruction conditions basedone on to the are directly replica compute method not the], [119]–[121 typical which necessary. allows performance of Here, we review a expect interesting observables, including thei.e. free the energy, thermal average of these observables over doi:10.1088/1742-5468/2013/03/P03014 45 J. Stat. Mech. (2013) P03014 . 0 0 ), 0 k s Q f z,s ) = ( f c ii , and (109) (110) (106) (107) (108) ( · . This 0 c s α hh α > α → ∞ in (106) and β Q in (101). On the 0 s ) one finds solutions f and ∆ = ( c 0 s ˆ are the solutions to (106) ). Second, assuming that Q Q /f α > α due to too few measurements. remains O(1) as in (108) and the original Gibbs 0 log 1 0 , s and ∆ | , Q s MF 0 . For | 6= z limit, one finds two distinct classes Nf and the full distribution of the signal f P β Q s ˆ

z . Given the convexity of the energy + ) Q 0 ), given the true signal component is 2 ) but and 0 s is O( s → ∞ | ). It is expected that thermal ﬂuctuations /β α )). 0 limit of high sparsity. In this limit, 2 N s β 0 − /α ( s 0 . Furthermore, the quenched average s ( → /β ( u Q MF P G f limit allows us to compute the typical error ) P is O(1 = q P f z 0 .

, s Q 0 − − ) = , MF − plane, as verified in figure8(A). This phase boundary 0 0 z,s is performed with respect to a Gibbs distribution u → ∞ s ) = . Furthermore, consider the marginal distribution of a H ( s Statistical mechanics of complex neural systems and high dimensional data ii 0 z,s f − f β G s – − MF e ii P ) + (1 MF α H s 0 1 Z = i vanish as O(1 H s · MF in i ( 0 k and h 0 2 ), this class of solutions no longer exists, and instead a new 2 H s k Q i f | Q s fδ ) = α α δu u ( 0 c s s 2∆ | hh h hh h = and = s limit, yields a set of self-consistent equations k . Under this replica symmetric ansatz, further averaging over ( = = should always vanish in the low temperature limit, but the fact that 0 s Q ( α < α ) is defined by) (108 and109), ( and in (103) is as follows. The replica parameters MF Q Q 0 MF 0 G Q s (given by G P ∆ P H Q + | → ∞ 0 P s s ab λ ( . This result can be understood from an information theoretic perspective. First, Qδ MF P /f = ∆ The phase boundary simplifies in the The relationship between the mean field theory Now, in solving (106) and (107) in the minimization should exactly recover the true signal, so that , which captures the typical error of CS, also vanishes suggests that for 0 ab log 1 1 doi:10.1088/1742-5468/2013/03/P03014 46 f each of our measurements carries O(1) bits of entropy, and that they are not redundant or the entropy of a sparse signal of dimension was first derived in [123, 122] using very different methods of convex geometry. class of solutions predictsThus, an replica error regime theoryreconstruction in predicts regime which in a the phase transition between a perfect and an imperfect distribution (107) are identified with(107) the in order the parametersof zero (104) CS temperature and as (105). Thus, a solving function (106) of and of solutions [121] dependingin on which the both valuescaptured ∆ of by ∆ other hand, for class of solutions occurs in which ∆ Q single signal component where we make thedenotes substitution an average over a standardcomponent normal variable According to replica theory,marginal the is mean given field by theory prediction for the distribution of this where and (107). L Here, the thermal average with an effective mean field Hamiltonian saddle point equationsfunction), (102 for it the isQ reasonable overlap to matrix choose ataking replica the symmetric ansatz for the saddle point, J. Stat. Mech. (2013) P03014 . k q s ˆ 2 and = ∆ . 8, and . σ 0 , = 0 6 . f 0 and variance , 0 4 = ), we obtained . s f is increased (not in increments of 0 α ( , c 43. The red curves f 2 N . . = 1 , the order parameters = 0 and f q α > α = 500, f ) we never once obtained α f N ( and c that are zero, all conditioned limit at the phase boundary α k reconstruction errors obtained s ˆ 06 and ∆ 2 . α < α L . Here, → ∞ 0 k s = 1 N 0 q +1). For these values of 0 s obtained by solving (106) and (107) in the error phase. ( δ ) obtained by solving (106) and107). ( We also use linear 0 1 2 1 (top), 0 (middle) and +1 (bottom). Each distribution can f = 500. The error bars reflect the standard error. The red Q Statistical mechanics of complex neural systems and high dimensional data ( − c , N α 0 1)+ s − = 500. The black transition region shows when the fraction of times 0 = 0 predicted in (114), all conditioned on the three possible values of s s ( Compressed sensing analysis. (A) The red curve is the theoretical N δ , with 2 1 α ) = ). (B) The blue points are the average 0 f s ( , fed through the soft threshold function in (C), with a noise threshold ( 01, with c . 0 α perfect recovery all 50perfect times, recovery. The and width for of all thisshown), transition other region yielding narrows a as sharp transition in the were numerically found to take the values by solving (101) 100 times for each of four values of Figure 8. phase boundary programming to solve (101)0 50 times forperfect each value recovery of occurs is neither 0 nor 1. For all other are the theoretically predicted distributionin of nonzero (114), reconstruction components whilefunction at the redthe dot true signal is thebe thought theoretically of as predictedq arising height from a of Gaussian distribution the with delta mean curves are plots(C) of The soft-thresholdingare function the defined conditional in distribution of (113). nonzero (D) signal The reconstruction blue components histograms various obtained from solving equationbar represents (101) the average 2000 fraction times, of components while the heightP of the green on the value of the true signal component doi:10.1088/1742-5468/2013/03/P03014 47 J. Stat. Mech. (2013) P03014 , 0 s 0 Q αq (111) (112) (113) = . 0 and is 0 σ Q from above y > α if y plays the role of , and q 0 = q = 0 unless the data )∆ ) is simply the MAP + √ s + ) ) z y σ limit, the statistics of x, σ + α/β ). It will not be possible ( 0 β η s ), in contrast, is universal, | − M ) undergo a first or second = ( x f ). What is surprising is that | ( f c Q )( ( α α, f c x ( 0 Q . α > α corrupted by additive Gaussian noise i | = sgn( 0 s , in the large | minimization, is capable of performing s β | s + 1 | added to a cavity system in the absence of reconstruction algorithm whatsoever, if the 2 L on the true signal, + s 0 | . Then, the mean field Hamiltonian in109) ( 0 q 2 , s ) ) any √ x q plays the role of an effective noise level −| z σ q ∆ − → ∞ − , s 0 β 0 ( s q 1 2 √ Statistical mechanics of complex neural systems and high dimensional data − and ∆ s z s , which basically chooses the estimate x x + q 0 ? Fortunately, it is a second order phase transition, so that which is a true signal s 1 α , up to constant factors. ( ). Thus, from the perspective of information theory, it is not 2∆ x η f . Thus, we see that in (111) and112), ( h /f σ β = ) = argmin = remain O(1) as MF log 1 H 0 f x , σ MF i q ( . Under a Laplace prior e ( s ); namely, the more confined this distribution is to the origin, the shallower given the data h η H σ 0 s 0 ( s and P q M/N > O ) to below, do we see a catastrophic rise in the error, or does performance gracefully The optimization in (113) can be understood intuitively as follows. Suppose one This optimization has an interpretation within the cavity method [119]. It is the Finally, we can understand the nature of the errors made by CS by looking at What is the nature of this phase transition? For example, if we decrease f ( = c estimate of exceed the noise level with variance measures a scalar value Since the entire Hamiltonianare is dominated proportional by to the global minimum of (111). In particular, we have where is a soft-thresholdingapproaches function [55] to (see solvingotherwise the figure8(D)), CS 0. which problem in also equation arises101), ( and in ( message passing the observed, corrupted data optimization that a new signal component doi:10.1088/1742-5468/2013/03/P03014 48 that component must performreflects to a minimize compromise its between total minimizing energy its in own (103). absolute This value and minimization satisfying all the highly correlated, then theto entropy perfectly of reconstruct our the measurements is signal O( using entropy of ourthat measurements the is less measurementα than entropy the exceed entropy the of signal our signal. entropy The then requirement yields the inequality surprising that we can reconstruct the signal when the distribution ofcomponent. the signal This reconstruction is,temperature components of limit conditioned course, we on can the interesting make true only the signal in change of the variables error ∆ regime. To take the zero a very simple, polynomial time algorithm, order phase transition in in that it does not depend on the distribution of non-zeros in the signal. the reconstruction, down totheoretic a limit number at of small measurements that approaches the information α degrade? In the language of statisticalrises physics, continuously does from 0. Thenon-zeros exponent governing the rise depends on the distribution of the rise (see figure8(B)). Note that the phase boundary where ∆ becomes J. Stat. Mech. (2013) P03014 1 N L (115) (116) (114) implies that A variables. This system , in analogy to the SK z MN , z

) q ∆ , 0 q are determined through (106) and107), ( √ variable nodes, one for each component of norm. For example, the Gibbs distribution such messages (the degree 1 factors do q z 1 . This decomposition is in a form suitable for N L + A 0 s MN factor nodes, one for each measurement, plus ( , and ∆ , 2 η | ) i 0 s N s · q |

µ β a − − e µ ) = x Statistical mechanics of complex neural systems and high dimensional data 0 ( N =1 ı s Y degree βλ ) ) s = N ( M 2 0 k µ / s ψ (1 | − s =1 M , and Y µ ) is given by i in (110), reduces to s = s ( 0 k µ ) = e s s ) = s ψ ( s ( ( of the measurement matrix µ G is a real number, every message becomes a distribution over the real ψ P P µ i s minimization problem in equation (101) can also be formulated as a message is row 1 µ L a The first approximation made in [55] is to restrict the messages to be Gaussian, which The distribution of the signal reconstruction components, conditioned on the true problem. The graphical model consists of 1 is reasonable because the density of the random measurement matrix where the factor and not require associatedmessages). messages and can be incorporated into the updates ofin the each other update,allowing each one message to receives invokenumbers the contributions for central each from message, limit a leading theorem. to large Thus, a one dynamical numberdoi:10.1088/1742-5468/2013/03/P03014 system need of on keep 2 messages, track of only two 49 defined by (102) and (103) decomposes as the application of the messageapplication passing equations of (32) and these ( 33).component However, a equations straightforward isnumbers, computationally complex. and First, one since must each keep unknown track of and it reflectscomponents, the fed through Gaussianminimization the distribution problem soft-thresholding in of function (113). which Here, the arises zero from temperature the scalar cavity fields across the unknown signal more degree 1 factor nodes to implement the The passing problem [55, 124], and approximatesystem formulations yield of neural the message network-like passing dynamicsL dynamical which provide a fast iterative way to solve the 6.3. From message passing to network dynamics which can be thoughtdemanding of as that a thewith self-consistency the condition distribution within distribution across thecorresponding across cavity self-consistency components components approximation condition of29) ( ofbetween the in replica the signal the theory SK and reconstruction, cavity model. simulationsin in for field An figure8(D). analogy the example signal is to of reconstruction the the distribution consistent match is shown signal component measurement constraints, whose sumin total the effect mean ismeasurements field will encapsulated theory vary by from of the componentcan (111). to quadratic be component, The term approximated and cavity the bymodel field average the in over encapsulating average components going over the from the (28) effect Gaussian to of variable (29). all other J. Stat. Mech. (2013) P03014 t i r (117) (118) , one can variables. dimensional /N N in (118) that N + b M and have a nonlinear transfer for the unknown T t energy function in equation (102) s A 1 L variables can be interpreted as an iterative neurons in a second layer. The first layer N dimensional measurement space. The update minimization type computations. N + 1 possible factors. Thus, one might assume that M L M and a top down inhibitory prediction of that input in M , x , 1 t ). By performing a careful expansion in 1 ) r − t . The second layer neurons receive a feed forward drive N r , θ Statistical mechanics of complex neural systems and high dimensional data t neurons in the first layer, while the current estimate of b A √ r / is stored in T + out of M t t A s µ + As t ). A similar argument holds for the factor to variable messages, + O(1 s − is a global minimum of the minimization and sparse coding [48, ].47 Given the potential role are all quite similar to each other; they differ only in excluding ( µ M x η 1 µ ∞ M √ L s ). = = / b = i +1 − t t → r s µ (1 + O(1 θ M i . Interestingly, this network dynamics is different from other proposals for the = η is the soft-thresholding function defined in equation (113), and here is applied M η /λ = minimization as a computational description of early visual [56] and olfactory [ 57] µ 1 The crucial term is the message passing derived term involving A readable account of this reduction can be found in [124]. Here, we simply quote the → L i receives feed forward external input through synaptic connectivity from the residuals through a synaptic connectivity where component wise to its vector inputs.then In the [124], it resulting was shownwith that 1 if these equations converge, endows the temporal evolutionterm, of reconstruction the performance residualinterpreted with is as a severely that historyis of impaired. dependence. stored a Without This in this two dynamicsthe the layer sparse can activity interacting representation of loosely neuronal be network, where the residual suggesting update on two pairs of variables, a current estimate main result. The dynamical system on can be furtherto simplified different by factors the noting [55, effects124] of that one messages factor from the samethen variable reduce the message passing equations to a dynamical system on signal, and theequations resulting are residual ] [124 function implementation of doi:10.1088/1742-5468/2013/03/P03014 50 We have reviewed the applications ofof replicas, network cavities dynamics, and message memory passing storage, toof machine basic learning data. models algorithms and While statisticalstriking the models picture, ideas it is and naturalin models the to models ask reviewed are how relaxed, the yieldmechanics and above based a what results analysis theoretical are rich of progress modifiedfew and can these when more be more sometimes assumptions prominent made complex examples. through surprisingly However, scenarios. abarely we Here, statistical warn be the we scratching reader will the that brieflymechanics, surface in discuss computer of this a discussion science a we and deep will literature neuroscience. lying at the intersection of statistical M of 7. Discussion processing, it would be interestingand to explore dynamics more capable fully the of space implementing of neuronal architectures J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data Moreover, dynamical versions of the SK and Hopfield models have a Lyapunov In section2, we discussed binary models of neurons. However, biological neurons More recently, in [137, 138], the authors went beyond mean field theory to track doi:10.1088/1742-5468/2013/03/P03014 51 function that isat bounded from zero below, temperaturedynamical implying possibilities consists that arise: of the oscillations(via long and fixed chaos. time dynamic points Seminal asymptotic mean work only.exhibit behavior has field In shown deterministic theory) analytically asymmetric high andconstant networks, current through dimensional two input, simulations chaos such more between that networks [21, excitatory exhibit neuronal chaos130, and networks byspontaneous inhibitory131]. can dynamically irregular inputs achieving Even neural a tospike when activity balance individual due characteristic driven neurons. of to bycurrent. This cortical fluctuations Interestingly, a balance states, in when leads innetworks the their to which exhibit a inputs input, neurons sharp have asis phase entrained transition more opposed by from the nontrivial a to input,the chaotic temporal as external a state the input structure, to input mean is an strength such there either increases. ordered superthreshold noisy This is state, [132] input happens which an or forphase oscillatory example interesting transition [133]. when non-monotonic occurs, In as the dependence case a in of function oscillatory of the input theare input oscillation characterized strength frequency by analog [133]. atconductance internal which states, states describing this and membrane theirvoltage. voltage Dynamic dynamics mean and field exhibits ion theory large methods channel been can spiking used be to events extended characterize to in the spikingintegrate phase their networks diagram and and membrane have of fire networks of neuronssolely excitatory [134]. and by inhibitory For leaky such amaintaining neurons, membrane a whose voltage, internal distributionself-consistently state one of solving is membrane can characterized for136] voltages derive this for across an reviews). distribution neuronsdynamics, appropriate This using characterized in mean work by Fokker–Planck the two [134] field(regular methods possibilities network, led periodic theory for (see and to spike the by [135, for trains temporal four the statistics or possible of population irregular macroscopicasynchronous single average aperiodic phases neurons or firing spike of constant rates trains)neuron rates). network (synchronous times properties Varying or two allow strengths possibilities all temporally four of structured combinations excitation, rates, to inhibition occur. or entire and microstate single trajectories in spiking neural networks consisting of neurons in which First, in section2,were considering when models weincluding of introduced neuronal symmetric the networks connectivity,happens, that SK binary had for model neurons manyno example, simplifying and and simple assumptions, the when lack formequilibrium Hopfield of the for Gibbs model, external connectivity distribution the inputs. we network in becomes stationary dynamics, (2). What distribution asymmetric? One andmethods of must Then, in then [125, neuronal there posit many126]neuronal activity a is situations activity. to analogous dynamical This one to understand modelInterestingly, was can the for time done, while the for use averaged fullynetworks example, asymptotic dynamic retain asymmetric in statistical mean fixed [127]–[129] networkspoints properties points, for field can become asymmetric but diverge of theory ergodic, exponentially networks. the with partially time network size. it asymmetric takes for transients to reach these fixed 7.1. Network dynamics J. Stat. Mech. (2013) P03014 ) → ∞ N of statistical , indicating that α , which leads to a N . This reveals that the N Drosophila ) for large /α ) decays as O(1 α ( g Statistical mechanics of complex neural systems and high dimensional data is the ratio of the number of training examples to the number of P/N = training inputs and outputs are no longer random, but are generated from . For a wide variety of training procedures, or learning algorithms, statistical α P N ), which is by definition the probability that the trained perceptron disagrees with The perceptron, while acting in some sense as the α ( g doi:10.1088/1742-5468/2013/03/P03014 52 learning theory,separable is classifications a inStatistical which very the mechanics two limited approaches classesgeneralization fall [147]–[149] have architecture in on been more in opposite sophisticated used sides multilayered that to of networks. In a it analyze a hyperplane. multilayered memory can [144]–[146] only and learn linearly the number of examplesgood should generalization be to proportional occur. to the number of synapses in order for a synapses mechanics approaches have found that the teacher perceptron’sset. correct Here, answer on a novel input, not present in the training At the beginning of section3, westore, considered or the memorize, capacity a of set simplethe of network architectures input–output goal to mappings. While of memorybut is most certainly rather important, organisms toappropriate is responses generalize not to from simply novelbeen inputs to past formalized the memorize experience in organismHere, past has in a the responses never statistical order seen to mechanics before.a to past framework ‘teacher’ This learn perceptron. inputs, idea for rules has The the observable that perceptron of can in interest then [142, yield becomes143]. the generalization error 7.2. Learning and generalization it is possiblethe to internal analytically state computenumerically of the exact all time neurons ofcomputing computations in the products of first the ofinitial neuron the network. Jacobians condition. to This They entire associated found spike alloweda classes with spectrum next, the of finite every given authors networks of fraction that future tothe exhibited of Lyapunov Lyapunov spike extensive perform spectrum all chaos, starting exponents is in Lyapunovpositive highly from which exponents by feedback sensitive an to effects were the associated positive.heavily details with to Moreover, of the the they the divergence rise actionfound showed of of ‘flux’ potential that microstate tubes the shape, trajectories. of action as stability Evenstate potential surrounding more trajectories. contribute decayed interestingly, Small most the perturbations quickly,between authors to whereas the trajectories. network larger Thus,the each perturbations trajectory radius led is of tocalculation surrounded this of by an tube Lyapunov a exponents exponential shrinkslimit stability in with divergence tube. is spiking the networks However, extremely inLyapunov number subtle, the exponents of thermodynamic due requires neurons, ( before to the the the taking thermodynamic non-commutation ofthermodynamic limit of a limit would limits. would small yield yieldinto The perturbation negative positive the computation limit, exponents. exponents network of In which butdivergence constitutes any if if in case, a taken taken injecting trajectories.suggesting large after extra that This the spikes perturbation the picture injection evendifferent of is spiking at extra consistent spikes trajectory, finite into withoutactivity with a [139]. changing cortical recent More the network generally, experimental leads overall for141]. to population reviews results a on statistics completely network of dynamics neural in neuroscience, see [140, J. Stat. Mech. (2013) P03014 Statistical mechanics of complex neural systems and high dimensional data This has been done for example in [158, 96] for PCA applied to signals confined to a Another generalization of the perceptron is the tempotron, an architecture and doi:10.1088/1742-5468/2013/03/P03014 53 low dimensional linear space, butdrawn corrupted from by a high covariance dimensional matrixbased noise. computation consisting The of of the data the typical are identity eigenvalue then plus spectrum of a empirical low covariance rank matrices part. for A replica Starting in the latterbased part analysis of of section3,from machine we data, turned focusing learning our on algorithmshigh illusions attention dimensional designed of to noise. structure to statistical In returnedsuch mechanics real extract by illusions, data such structured and algorithms analysis so when patterns problems,first applied understanding we step. to have these However, to illusions we protect present wouldalgorithms ourselves ideally in from when like pure to the noisequestion analyze is data in the an the performance contain important design of bothreliably and machine uncover analysis learning structured structure of patterns experiments buriedanalysis is and within of then, random noise? how learning Since, much noise.analyze algorithms, data in statistical A do the mechanics the problems we key statistical data in needrandom, mechanics which to plays but the based the itself quenched disorder has role is structure. of no longer quenched simply disorder, we must 7.3. Machine learning and data analysis learning rulespikes capable [154]. The oftime tempotron patterns, learning can and bespike to no trained can spikes to classify for be firetempotron spatiotemporal another a left was class, spike unspecified. patterns carriedweight while for A space out the one of statistical to precise class inthe incoming of timing any mechanics [155]. one-step input of given analysis Interestingly, replica spike the spikecomponent of the symmetry corresponds output time broken a space to classification a phase simplifieddirect of different problem shown analogy output binary solutions can schematically spike to time in in the beeach for replica component figure1(C). synaptic the well symmetry corresponds positive Each described broken classifications, tocomponents in phase by a are of different both multilayered internalclassifications) small networks representation. and (implying in The far that which various apartclassification). very solution The (implying similar authors that verified weights that very theserealistic can different properties Hodgkin–Huxley yield weights persist even can model very in yield adouble of different more an dissociation a biologically identical between single structure neuronclassification) (synaptic even connectivity) [155]. and at Overall, function the thisimplications (implemented level reveals for of a the single striking neurons. interpretationfor reviews This of on double incoming applications dissociation of connectomics statisticalsee has data mechanics [12, important to [156].157]. memory, learning More and generalization, generally, network, only themapping, input while the andgenerically internal, output leads hidden layersdesired to input–output layer are mapping replica activities breaks constrained intocomponent symmetry remain multiple to corresponds disconnected unspecified. breaking, components, to implement wherecapable This each where a of a implementing feature different the the desired desired internal mapping.the Statistical space analysis representation mechanics of has of also of learning hadsupport hidden in success solutions other vector in layer architectures machines activities and to [150]–[152] machine and learning a Gaussian algorithms, processes including [153]. J. Stat. Mech. (2013) P03014 -values for cluster significance [167], using extreme p Statistical mechanics of complex neural systems and high dimensional data In section 5.3 we initiated a statistical mechanics based analysis of random Finally, we note that throughout most of this paper we have focused on situations In summary, decades of interactions between statistical physics, computer science and doi:10.1088/1742-5468/2013/03/P03014 54 dimensionality reduction bycorrelated connecting extreme value the theory. Forneglected, maximal the simple while incurred case for of hyperplanes, geometric pointgas the clouds, distortion interactions correlations correlations of arose to could eigenvalues from of be be fluctuations random interesting matrices, in to and the study could Coulomb the be more maximal treated complex distortion exactly. manifolds. were It For proven would tangent example, in planes by [104] rigorous a by scaffold upper surrounding of boundsremains points, arbitrary undistorted and on manifolds then under and showing any that their projection, ifapplication the then of geometry so the of this does JL scaffold lemma thedistortion to geometry incurred the of by scaffold the the then manifold.or manifold suffices An loose under to this a obtain upper random an boundby projection. upper is, more To bound it understand complex on would how manifold the beplanes, tight ensembles. useful one For to would example, compute be for thecorrelated interested typical manifolds matrices, in distortion corresponding consisting incurred to the of the fluctuations restriction unionsplane. of of Thus, of the the results same from maximal random the projection eigenvaluecould eigenvalue to of become spectra each multiple of relevant. random correlated matrices168]–[170] [ in which replicamachine symmetry learning holds, problems,are although including described we multilayer by have networks, replicainto noted symmetry tempotrons many broken that and clusters, phasesat clustering, several in as the which neuronal well the end and as solutionalgorithm, of space suboptimal, known section breaks as higher2.4, up surveythe energy propagation statistical proliferation metastable, [77 of mechanics78], metastable states. basedand states which As inference approaches whose can noted algorithms. have presence findneuroscience can inspired Despite good remain confound a the solutions relatively simpler unexplored. despite power new optimization of survey propagation,neuroscience its have applications lead to tocomputation beautiful and insights, how into our both brains how might neuronal create dynamics machine leads learning to algorithms to analyze data of this type revealedthe the presence amount of of asignal series data of eigenvalues phase associated and transitions with its as(i.e. the the figure6(A)) ambient low ratio associated rank between with dimensionality the partthresholds increases. high pop in dimensional As out noise. the Thus, of this amount thiswork a work of ratio has reveals Marchenko–Pastur data sharp been sea increases, required carriedoutcomes to out for resolve on other signal the structured statistical fromGaussian data clouds mechanics noise. settings,, [159 based Also,160], including analysis supervised interesting findingtransitions of learning in a from typical clustering direction clustered learning as input separating a examplesmodels two [161], function [163, phase of164]. cluster scale Moreover,interesting [162] statistical new and mechanics algorithms learning approaches and Gaussian[165, to mixture practical166], clustering results, based have on including yielded anstates, superparamagnetic isomorphism and clustering between cluster a assignments method andvalue for Potts theory model computing to ground computeall a regions null in distribution a for feature the space maximal of number a of given data size. points over J. Stat. Mech. (2013) P03014 u (A.1) (A.2) (A.3) (A.4) , and ab Q . ] ab ), the synaptic , the different J D NQ − b , we must average the x D · a thermal degrees of freedom x is a random matrix) or the [ ). In the above applications, . δ N D D , D ) , ab Y ab x is the measurement matrix). As a ( ++ can be carried out by introducing x ) NQ H D d D − , ab b a x n =1 x Q · ( Y a a . Therefore, it is useful to separate the x H ( Q is the connectivity matrix ) Z ab =1 ) ˆ n a Q Q ( Q D − P ( e − NE e NE ab − a function, − ˆ , where the components are coupled to each other Q x e δ x d d = e is the set of examples to be stored), the variables ab , in a Hamiltonian D n Z Q =1 D Y a d are independent. However, integrating out the quenched D Statistical mechanics of complex neural systems and high dimensional data EE a ) ] = Z ab x . More precisely, the following identity holds: D Y , ab b a in (A.1) into an integral over all possible overlaps x x ( ** Z · a NQ H a x = = x in a spin glass (then =1 − ) n a over the overlap matrix D D with a fixed set of overlaps configurations with a prescribed set of overlaps by introducing a b s P x ii ii a a /N E · − n n x dimensional vector x e a Z Z x N [ = (1 hh DD hh δ ab in a compressed sensing problem (then of a perceptron (then Q u w could be the spins -function, in the Stieltjes transformresiduals of an eigenvalue spectrumdiscussed (then above, to properlyreplicated average partition over function the quenched disorder weights Conditioned on anyreplicated degrees particular of freedom realization of the quenched disorder for some function The integral over the exponential representation of the disorder introduces interactionsapplications, the resulting among interactions depend the onlydefined on as the replicated overlap matrix between variables. replicas, In all of the above δ encoded in the through some quenched disorder x remaining integral over then all possible doi:10.1088/1742-5468/2013/03/P03014 55 Suppose we wish to perform statistical mechanics on a set of A.1. Overall framework Appendix. Replica theory We thank DARPA,Genentech the Foundation, Stanford Swartz Bio-X Foundation, NeuroventuresBrain and the and the Burroughs Stanford Computation Center for Wellcomediscussions for support. Foundation, about Mind, SG the replicas, thanks cavities Haim and Sompolinsky messages. for many interesting Acknowledgments themselves. We suspect that furtherexciting interactions and between insightful these intellectual fields adventures are for likely many to years provide to come. J. Stat. Mech. (2013) P03014 (A.5) (A.6) (A.8) (A.9) (A.7) scalar with a (A.12) (A.11) (A.10) in (A.6) n a x ab ˆ Q in (A.6) can be a x , )] ab ˆ Q ( 0 limit. Now, in the case can be performed via the G + → ab ab ˆ ˆ Q n Q ab Q , ab and , yielding an integral over ) a P Q ( ab x − , ) ) )] Q n Q NS Q ( ( E S [ represents a compromise between energy ,...,x − 1 N ) x ] = e − ( Q Q ( e ab eff E [ H ab N − ˆ Q NQ e − d e a − x ab ab b d x Q Q . · d d a b Statistical mechanics of complex neural systems and high dimensional data Y a x x ab ab Q [ Y Y ab Z δ .(A.5) reduces to th power. This final result can be written as is understood to be along the imaginary axis. Inserting (A.4) , ˆ Q n Q ln Z Z , i N ab a ab Y b x ˆ − ab = = a x Q Tr log a 2 1 x ab ∂E D D x d limit, the final integrals over X ∂Q h ) = ii ii ab = ), n n n =1 ) = N = = in (A.7). Y a ˆ Q Z Z Q N eff ab ab ( ( eff ˆ Z hh hh G H Q S Q H raised to the denotes an average with respect to a Gibbs distribution with effective and a are real valued-variables (as opposed to binary variables in the SK model), n n x i a i · h x In general, both these equations must be solved in the doi:10.1088/1742-5468/2013/03/P03014 56 and entropy extremization in the exponent of (A.12). and the saddle point overlap configuration represents (up to an additive constant) the entropy of replicated configurations can be performed. Together, thisinvolving yields an entropic factor (up to a multiplicative constant prescribed overlap matrix Hamiltonian these equations can be further simplified because the integral over where the integral over into (A.3) decouples thevariables components of the vectors where is the partition function of an effective Hamiltonian where where performed exactly, since it is Gaussian, and, furthermore, the extremum over where Now, in the large saddle point method,extremizing yielding the a exponent set in of (A.5), self-consistent equations for the saddle point by J. Stat. Mech. (2013) P03014 D is, q that (A.18) (A.15) (A.17) (A.13) (A.16) (A.14) ab 0 at the Q → . . Using this, n , D ) 2 D , − 2 ++ n and the disorder using a sequence x ( Z 0 ab H 2 D − x → Q ) n · D , 1 1 x x both drawn from a Gibbs ( 1 H N 2 = lim − x 2 2 additional replicas added to e − − − q Z 2 n and x δ 1 · ) . The sum over these saddle points x 1 D , x a 2, and then take the limit a, b x 1 , ( N ) H ) over the disorder is difficult because ab − q n > =1 ( Q q n a D P − P δ − q , 2 e ( a ab δ x ). For a given realization of the disorder, the overlap x b d ) of two states Q d q 6= 1 D 2 0 limit for various problems, in the replica symmetric ( a X 2 . , x β n P =1 d x 2 ab → Y a ( 1) Statistical mechanics of complex neural systems and high dimensional data − D Q n H Z ii − 2 = 1 Z . Averaging ) 2 ) n ) q ab ( ( D 2 β , ˆ 1 n D Q ** D x ( ( 0 0 P Z H → → − − n hh n e We will now applyA.9) (A.6)–( to the SK model from section2. As = lim x ) = ) = q d ) = ) = lim ( Q q q are the original degrees of freedom with . ( ( ( R D q 2 P E P P x ) = is the saddle point replica overlap matrix. In situations where replica symmetry . One can then average the right-hand side of (A.15) over 2 D and ( − ab n 1 Z Q x Z doi:10.1088/1742-5468/2013/03/P03014 57 distribution is A.3.1. SK model. averaged distribution of overlaps distribution with Hamiltonian Here, approximation. We use the Einsteinmeant to summation convention be in summed which over. repeated indices are we saw in equation (10), we have Then, equation (A.8) gives Here we show how to take the A.3. Replica symmetric equations Here, we make the connection between the replica overlap matrix A.2. Physical meaning of overlaps where appears both inthis, the numerator one and can theone introduce can denominator replicas perform of via the equationend. the easier (A.13). Thus, average simple at To identity integer circumvent yield of steps very similar to section .A.1 The final answer yields where yields the sum inaccording (A.16). to replica In theory, summary, equaltake the to the probability the value that fraction two of states off-diagonal matrix have elements overlap is broken, there will be multipleof equivalent saddle the points related permutation to group each other on by the the action replica indices J. Stat. Mech. (2013) P03014 (A.21) (A.20) (A.19) (A.22) (A.23) . z ii z )] = 1. We will determine ii qz ) aa . √ .   qz z z Q β   z = 1, ; (A.24) √ 2

ii β σ ) ! q n a a , s s )] qz − a 2 z P √ qz a tanh( n   β X i 2 qz ln[2 cosh(

√ z ( 2 q √ 2

) ) β β a hh q The starting points for the learning s + e qz a q − hh + n sech √ P β 2 ) √ =0 ( ) 2 n β n 2 q ) (

| q q } + 2 q J [2 cosh( a n − s − − . − ) X { ii q hh ) − q n 2 ab   q − (1 (1 tanh Z , (1 Q (1 2 − ln ln − h z by minimizing this. We will need the identity 2 2   hh 1 2) − − ii 2 q / (1 2 ∂ 2 2 ) β 2 2 ∂n 2 2 2 β β , n n β βJ 2 2 2 z ( ) ) β β β ( ab Statistical mechanics of complex neural systems and high dimensional data − 1 e q q 0 N δ } f ) = = = a = = = − − − − q s J X { hh ab ab eff = = − (1 (1   ˆ ˆ = 2 2 Q Q J H 2 2 ln β β z ab ab ii − − − Q Q + (1 N βF ) q z = = − − ( N ) ) ) = βF = Q Q Q z f ( ( ab ( ∂ ∂q E Q G hh E We make the following replica symmetric ansatz for the saddle point: We can now evaluate (A.6) using the identity9) ( with by minimizing the free energy. This leads to where we have used the factq that equation (11) guarantees that doi:10.1088/1742-5468/2013/03/P03014 58 applications discussed here are the energy (60) and entropy (A.11) in the replicated A.3.2. Perceptron and unsupervised learning. As mentioned above, we determine which can be derived by integration by parts. We find therefore, the minimum satisfies (14). This gives us the free energy density J. Stat. Mech. (2013) P03014 z 0, ii · → hh of the (A.30) (A.26) (A.27) (A.28) (A.31) (A.32) (A.25) (A.29) n µ , b ˆ ) λ a λ ab ( . We can thus Q 2 a ˆ ) βV λ a a derived from an ˆ 2) λ / P a − (1 w 2 − ) P a a q ˆ ˆ λ λ a a √ λ . Then, the only coupling ( i P q e 1 2 q has one eigenvalue equal to √ a + ˆ λ ) ) 2)( a d Q ) π / ab λ λ a ( a δ 2 ( (1 λ ) ( λ − q βV βV 2 d ) − a βV ) ) a − a ˆ λ q λ P ( . Finally, inserting (A.28) and ( A.32) n ( =1 − P − Y a a via a saddle point yields a saddle point q 1 b − βV / λ b ) in (62). P q 2 ˆ − ) 1 . − ) λ q = (1 Z 2 q − ab i ( ˆ ab λ qz ) − ) Q ab Q F = q √ q a a λ ˆ − − b Q λ 2)(1 2 λ − λ 2 / / 1 / 1 1 (1 2)(1 − ab − 2)(( / − − Q / e a a a (1 Q ˆ ˆ λ λ (1 . This transformation yields (A.27), and, as λ − Q a a ) z − 2) + ln(1 λ λ e i i / qz 1 with an optimal weight vector e e q ) (1 √ det Statistical mechanics of complex neural systems and high dimensional data q a a − − Tr log q ˆ ˆ µ − √ e λ λ λ 2 1 ξ ( is a zero mean, unit variance Gaussian variable and − d d 1 π π ˆ λ 0 π a i Q λ a a h 0 limit. For the energy, we obtain 2 2 z e 2 → λ (1 d λ λ , 2 n n 1 d z π ˆ λ d √ d z → det 2 d ii π = ii ’s in (A.26) occurs through the term n √ 2 n λ n n n p ) = lim 1 eigenvalues equal to 1 =1 =1 =1 a ζ Y Y Y a a a ζ d λ π Q a − ln , where ( 2 hh λ Z z Z Z Z S Z n d hh √ ln 0

ln ln ln variables at the expense of introducing a Gaussian integral via the identity = = → α bz corresponding to extremizing lim α nα α α n Z i a ζ and e λ q − − − − − q

= = = 1) ) = = − Q 2 ( b n 2) E Now for the entropy, we obtain In going from (A.25) to (A.26) we used the identity / 0 (1 → − lim n between the various and inserted the replica symmetric ansatz decouple the e Suppose we wish to compute the probability distributiondoi:10.1088/1742-5468/2013/03/P03014 across examples 59 where denotes an average with respect to Here, we have1 used + the ( factinto that (A.12) the and performing replicaequation the symmetric for integration over A.4. Distribution of alignments alignment ofunsupervised each learning example problem. Alternatively, one can think of this as the distribution partition function equationA.1. (A.12). Here, These we can take the be derived by following sections 3.3 and (A.28). is the partition function of a distribution whose interpretation will be given in section A.4. J. Stat. Mech. (2013) P03014 . 1 we − n P Z (A.33) (A.34) (A.35) (A.37) (A.36) 0 → n and would hit N and take the + i µ a = lim z λ = /Z 0 , , z z ++ ++ ) µ a ) λ λ ( ( V µ βV , − , P ) ] q δ =1 − n a 1 , − 0 limit. Now we can introduce an . ) / , P ) . This average is hard to perform µ extremizes the free energy (62). 2 0] β ) λ µ ( q → − ∞ ξ qz π, V √ µ −∞ n δ, ) e − ( [ − [ 1 1 P λ λ β ∈ ∈ ∈ − 2)(( − / ) e = 0 in (71), as the pole at λ in the 1 (1 ( − λ δ e a /Z − ) , θ , w q θ λ ) i d ( µ e ) − δ ), perform the Gaussian average over λ µ δ n =1 1 1 0. λ 1 Y a ( (1 w is drawn from the distribution (53). For large λ Statistical mechanics of complex neural systems and high dimensional data − + V d π → µ λ x, x x, x z 2 − Z w ( δ P Z δ λ p = = = β ( 0 0 − 0 1 1 . Here, the first replica plays the role of the numerator in (A.34) ζ δ Z ** =1 P and z z e z µ µ 0 X µ ξ w → · ξ 1 P ** n · d a play the role of 1 w w ) ) Z ) = ) = ) = ) = lim ): N N δ λ λ λ λ = ( ( ( ( ( √ √ , . . . , n P C P P Z P / / denotes an average over the examples = (1 = (1 is the partition function given by (63) and ii µ a µ · ζ λ λ 0 limit using a sequence of steps very similar to those in sections A.1 and A.3.2. This hh → the contour. However, Cauchy’s theoremchanging tells the us integral, that provided wefollowing can that contour, deform which it the takes does contour a without not semicircle cross detour below any the singularities. singularity: We will use the doi:10.1088/1742-5468/2013/03/P03014 60 It will help to take the limit It is helpfulalong to the real think axis. of We cannot (70) simply set as a complex contour integral, with the contour running where and replicas 2 n yields A.5. Inverting the Stieltjies transform of the data projected onto the optimal dimension. This distribution is expect this distribution toit be will self-averaging, be so close for to any fixed realization of the examples, where and because the examples occurcan in be both circumvented the by numerator introducing and replicas the via denominator. the This simple difficulty identity 1 integral representation of where Thus, where J. Stat. Mech. (2013) P03014 , vol Phys. 117 Oxford ( 5 (San Mateo, . , 2006 ) (Cambridge, 1792 . z ( # 35 123 Phys. Rev. Lett. ) W z x ρ ( 34 − Neural Netw. = W 55 x , 1988 ρ 1 θ (Norwalk, CT: Appleton i x (Singapore: World Scientific) e d , 1992 (Oxford: Oxford University δ ∞ δ + Z Phys. Rev. Lett. z (Cambridge, MA: Cambridge + Rev. Mod. Phys. W θ i ρ , 1975 i e δ θ 051907 θ Spikes: Exploring the Neural Code d i , 1962 π Artif. Neural Netw. 0 + e 75 − δ z Z E ) ) 0 z z Efficient supervised learning in networks with binary x z , 1991 ( ( W − Im − ρ Principles of Neural Science 0 W W 1 π x z ρ ρ vol 1 (Cambridge: Cambridge University Press) θ The Oxford Handbook of Random Matrix Theory Chaos in random neural networks i 0. However, their sum is finite. It is referred to + 0 Spin Glass Theory and Beyond x 0 e z Phys. Rev. d δ → δ d ) → i δ δ − ( δ −∞ C θ Z Z , 2007 d ) = lim vol 142 (New York: Academic) π Statistical Mechanics of Learning 0 (Suppl 2) S13 Information, Physics, and Computation − Statistical mechanics of complex neural systems and high dimensional data 32 Solvable model of a spin-glass Function constrains network architecture and dynamics: a case study on Spin Glasses 8 Im Im Z Probabilistic Graphical Models: Principles and Techniques + i Theoretical Neuroscience. Computational and Mathematical Modelling of 20 Learning by message passing in networks of discrete synapses 1 1 π π + z ) (Oxford: Oxford University Press) ( + + Training a 3-node neural network is np-complete 0 0 W 386 → → δ δ R 65 = lim Im ) = lim Biometrika Statistical Physics of Spin Glasses and Information Processing: an Introduction (Cambridge, MA: MIT Press) Random Matrices 1 BMC Neurosci. π 30201 + + i 0 The perceptron: a probabilistic model for information storage and organization in the brain Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference 96 , 1928 z → lim The perceptron: a model for brain functioning. i The generalised product moment distribution in samples from a normal multivariate ( , 2007 On the complexity of training perceptrons W Psychol. Rev. R 259 Im 111 (Oxford: Oxford University Press) MA: MIT Press) CA: Morgan Kaufmann) University Press) 1958 Rev. Lett. Handbooks in Mathematics 61 population synapses the yeast cell cycle boolean network (Cambridge, MA: MIT Press) and Lange) Neural Systems Press) 1 π We can write + 0 [7] Mezard M,[8] Parisi GFischer and K Virasoro H[9] M and A,Nishimori Hertz 1987 H, J 2001 A, 1993 [4] Lau K Y, Ganguli S and Tang C, [1] Kandel E R, Schwartz J H[2] and JessellDayan T P M, and 1991 Abbott L F, 2001 [5] Mezard M and Montanari A, 2009 [6] Sherrington D and Kirkpatrick S, [3] Reike F, Warland D, van Steveninck R and Bialek W, 1996 [10] Koller D and Friedman N, 2009 [11] Pearl J, 1988 [12] Engel A and den Broeck C[13] V, 2001 Rosenblatt F, [16] Amaldi E, [17] Braunstein A and Zecchina R, [21] Sompolinsky H, Crisanti A and Sommers[22] H J, Wishart J, [14] Block H D, [18] Baldassi C, Braunstein A, Brunel N[19] and ZecchinaMehta R, M L,[20] 2004 Akemann G, Baik J and Di Francesco P, 2011 [15] Blum A L and Rivest R L, → lim doi:10.1088/1742-5468/2013/03/P03014 61 as the Cauchy principalinterested value in of the the imaginary integral. part. It This also leaves happens the to second be term, real, and we are only References The first and third terms diverge as J. Stat. Mech. (2013) P03014 , 170 A 66 21 25 1812 , 1967 , 2009 , 2002 , 2008 22 Single-pixel Neuron IS&T/SPIE , 2010 IEEE Signal Process. Proc. Nat. Acad. Sci. Advances in light , 2006 289 Commun. Math. Phys. 196 , 2008 , 2008 4296 Neural Comput. 83 , 1994 IEEE Sig. Proc. Mag. 97 25 , 2010 , 2008 10510 485 Sparse coding via thresholding and local 435 18 35 2526 32 34 Compressive sensing DNA microarrays 20 J. Neurosci. Methods J. Neurophysiol. ICASSP: IEEE Int. Conf. Acoustics Speech and Compressed sensing mri Adaptive compressed sensing—a new class of 51 , 2011 , 2007 , 2010 60601 Object category structure in response patterns of neuronal 4317 Opt. Express Lensless wide-field fluorescent imaging on a chip using Statistical mechanics of sparse generalization and graphical 162824 From sparse solutions of systems of equations to sparse 40 102 IEEE Signal Process. Mag. 1182 326 Large deviations of the maximum eigenvalue in Wishart random , 2010 SIAM Rev. Memory traces in dynamical systems 58 P10009 Semantic Cognition: A Parallel Distributed Processing Approach 2009 Neural Comput. Sparse mri: the application of compressed sensing for rapid mr Annu. Rev. Neurosci. 33 Ann. Rev. Neurosci. 607 , 2008 Large deviations of the maximum eigenvalue for Wishart and Gaussian Statistical mechanics of complex neural systems and high dimensional data Distribution of eigenvalues for some sets of random matrices , 2009 507 Fluorescence applications in molecular neurobiology , 2008 Short-term memory in neuronal networks through dynamical compressed Compressed sensing, sparsity, and dimensionality in neuronal information 381 , 2012 , 2009 114 Reconstruction of sparse circuits using multi-neuronal excitation (rescume) Level-spacing distributions and the Airy kernel Distribution functions for largest eigenvalues and their applications Phys. Rev. Lett. An introduction to compressive sampling J. Stat. Mech. Nature 6065 Continuous update with random encoding (cure): a new strategy for dynamic 4 (Piscataway, NJ: IEEE) pp 5494–7 , 2009 J. Phys. A: Math. Theor. , 2009 , 1996 Magn. Reson. Med. Magnet. Reson. Med. Neural Information Processing Systems (NIPS) et al, Emergence of simple-cell receptive field properties by learning a sparse code for Reconstruction of complete connectivity matrix for connectomics by sampling neural Role of homeostasis in learning sparse representations 1126 72 , 2007 60 , 1995 , 2007 , 2010 25 Advances in Neural Information Processing Systems 22 18970 151 model selection self-organizing coding models forSignal neuroscience Processing Matching categorical object representations inNeuron inferior temporal cortex of man and monkey 105 sensing competition in neural circuits natural images (Cambridge, MA: MIT Press) population in monkey inferior temporal cortex new compressive imaging cameraComput. architecture using Imaging optical-domain compression imaging via compressive sampling compressive decoding of sparse objects microscopy for neuroscience 2009 connectivity with fluorescent synaptic markers EURASIP J. Bioinform. Syst. Biol. imaging processing and data analysis Mag. imaging Matematicheskii Sbornik 159 0034 arXiv:math-ph/021 matrices modeling of signals and images random matrices [50] Lage-Castellanos A, Pagnani A and Weigt[51] M, Coulter W K, Hillar C J, Isley G and Sommer F T, [44] Kriegeskorte N, Mur M, Ruff D A, Kiani R, Bodurka[45] J, EstekyGanguli H, S, Tanaka Huh K D and and Bandettini Sompolinsky P[46] H, A, Ganguli S and Sompolinsky H, [47] Rozell C J, Johnson D H,[48] Baraniuk ROlshausen G B and A Olshausen B A, [49] Perrinet L U, [43] Kiani R, Esteky H, Mirpour K and Tanaka K, [41] Duarte M F, Davenport M A, Takhar D, Laska J N, Sun T, Kelly K F and Baraniuk R G, [40] Takhar D, Laska J N, Wakin M, Duarte M F, Baron D, Sarvotham S, Kelly K F and[42] BaraniukRogers R T G, T and McClelland J L, 2004 [38] Taraska J W[39] and ZagottaCoskun W A N, F, Sencan I, Su T W and Ozcan A, [37] Wilt B A, Burns L D, Ho E T W, Ghosh K K, Mukamel E A and Schnitzer M J, [36] Mishchenko Y, [35] Hu T and Chklovskii D B, [33] Parrish T and Hu X, [30] Ganguli S and Sompolinsky H, [31] Lustig M, Donoho D L, Santos J M and Pauly J M, [34] Dai W, Sheikh M A, Milenkovic O and Baraniuk R G, [32] Lustig M, Donoho D and Pauly J M, [23] Marchenko V A and Pastur L[24] A, Tracy C A and Widom H, [25] Tracy C A and Widom H, [26] Vivo P, Majumdar S N and Bohigas O, [28] Bruckstein A M, Donoho D L[29] and EladCandes M, E and Wakin M, [27] Majumdar S N and Vergassola M, doi:10.1088/1742-5468/2013/03/P03014 62 J. Stat. Mech. (2013) P03014 , , , 4496 , A Proc. 765 , 2008 , 1982 47 257 Phys. 58 E 4384 21 , 2009 57 , 2005 17 B , 2000 58 Phys. Rev. 10318 L997 8254 217 J. Phys. A: Math. 20 Phys. Rev. 26 20 104 , 1985 B , 1999 Rev. Mod. Phys. Phys. Rev. , 1993 Gibbs states and the set of 11539 32 Phys. Rev. Lett. , 1986 J. Neurosci. 745 , 1978 J. Phys. A: Math. Gen. 43 Eur. Phys. J. 26 , 1987 , 2006 2282 , 2009 arXiv:0907.3381 , 1988 10 J. Neurosci. 51 Proc. Nat. Acad. Sci. , 2001 J. Phys. A: Math. Gen. Neuron 437 1007 Optimal information storage and the distribution 202 , 2012 , 2007 440 , 1999 , 2004 Weak pairwise correlations imply strongly correlated Math. Biosci. (New York: Wiley) Nature Storing infinite numbers of patterns in a spinStatistical mechanics glass of model neural of networks near saturation Spin-glass models of neural networks 1530 Front. Neurosci. Conf. Abs: COSYNE Ultrametricity for physicists J. Physiol. Survey propagation: an algorithm for satisfiability 983 Message-passing algorithms for compressed sensing Early sensory processing as predictive coding: subtracting sparse Advances in Neural Information Processing Systems Constructing free-energy approximations and generalized belief , 1971 55 , 2006 11 552 Stability of the Sherrington–Kirkpatrick solution of a spin glass Spatial information outflow from the hippocampal circuit: distributed Deciphering subsampled data: adaptive compressive sampling as a , 2011 Analytic and algorithmic solution of random satisfiability problems , 1969 , 2010 2181 Statistical mechanics of complex neural systems and high dimensional data 201 Infinite-ranged models of spin-glasses Graphical models, exponential families, and variational inference 151 1 22 IEEE Trans. Inform. Theory 1 Thouless–Anderson–Palmer equations for neural networks A Sparse incomplete representations: a novel role for olfactory granule cells 27 9668 Ramsey theory reveals the conditions when sparse coding on subsampled data 2554 Chaotic nature of the spin-glass phase Pure states in spin glasses Optimal storage properties of neural network models 22 , 2005 The bethe lattice spin glass revisited Phys. Rev. Lett Properties of unitary granule cell Purkinje cell synapses in adult rat cerebellar 79 30 18914 812 173 124 106 , 1985 Proc. R. Soc. The Organization of Behavior 72 297 J. Phys. A: Math. Gen. J. Neurosci. Disorder chaos and multiple valleys in spin glasses Neural networks and physical systems with emergent collective computational abilities Cavity-approach analysis of the neural-network learning problem 1839 The space of interactions in neural network models The space of interactions in neural networks: Gardner’s computation with the cavity method , 2011g/abs/1106.3616 arXiv:or 271 A theory of cerebellar function 61 A theory of cerebellar cortex , 2001 21 Ann. Phys. Science Neuron J. Phys. A: Math. Gen. E , 2002 1007 slices Found. Trends Mach. Learn. Gen. of synaptic weights: perceptron versus Purkinje cell 1999 1987 Rev. propagation algorithms 2002 neural networks Proc. Nat. Acad. Sci. Random Struct. Algorithms Nat. Acad. Sci. approximations by circuit dynamics 2011 32 The structure of multi-neuron firing patterns in primatesolutions retina of random constraint satisfaction problems model spatial coding and phase precession in the subiculum network states in a neural population is unique principle of brain communication [75] Wainwright M J and Jordan M I, [76] MézardM and Parisi[77] G, MézardM, Parisi G and Zecchina R, [81] Brunel N, Hakim V, Isope P,[82] Nadal JMarr P D, and[83] BarbourAlbus B, J S, [84] Isope P and Barbour B, [86] Griniasty M, [85] MézardM, [72] Shamir M and Sompolinsky H, [74] Yedidia J S, Freeman W T and Weiss Y, [78] Braunstein A, MézardM and Zecchina R, [80] Gardner E and Derrida B, [73] Bethe H A, 1935 [71] Amit D J, Gutfreund H and Sompolinsky H, [67] Chatterjee S, [68] Hopfield J J, [69] Hebb D O,[70] 1949 Amit D J, Gutfreund H and Sompolinsky H, [79] Gardner E, [66] Bray A J and Moore M A, [55] Donoho D L, Maleki A and Montanari A, [65] Huse D A and Fisher D S, [57] Koulakov A A and Rinberg D, [59] Schneidman E, Berry M J, Segev[60] R andShlens Bialek J, W, Field G D, Gauthier[61] JlaF, L, Krzaka Grivich Montanari M A, I, Ricci-Tersenghi Petrusca F, D, Semerjian Sher[62] G A, andKirkpatrick Litke ZdeborováL, S A and[63] M Sherrington andDe D, Chichilnisky Almeida E J J, R L and[64] Thouless DRammal J, R, Toulouse G and Virasoro M A, [56] Hu T, Druckmann S and Chklovskii D B, [58] Amit D J, Gutfreund H and Sompolinsky H, [53] Hillar C J and Sommer F T, [52] Isely G, Hillar C J and Sommer F T, [54] Kim S M, Ganguli S and Frank L M, doi:10.1088/1742-5468/2013/03/P03014 63 J. Stat. Mech. (2013) P03014 , , , Random , 2006 325 5577 5577 Phys. Contemp. IEEE 67 23 23 J. Phys. Phys. Rev. , 2003 SPIE Electronic 4203 , 1999 , 1984 , 2009 Europhys. Lett. J. Phys. A: Math. 51 , 1989 , 2006 The smashed filter for Ann. Math. , 2007 , 1989 Found. Comput. Math. (Piscataway, NJ: IEEE) vol 3 (Piscataway, NJ: IEEE) 233 , 1958 Subspace, Latent Structure and , 2009 14 J. Phys. A: Math. Gen. J. Phys. A: Math. Gen. (ICIP 2007): IEEE Int. Conf. on , 2006 ACM, pp 604–613 neural network 959 1 , 1990 , 1999 Sparse signal detection from incoherent , 2007 ± 9679 All-or-none potentiation at ca3–ca1 synapses IEEE Trans. Inf. Theory 98 = vol 3940) (Berlin: Springer) pp 52–68 j Proc. Computational Imaging V 102 Spectrum of large random asymmetric matrices Acta Numerica 026124 , 2005 A simple proof of the restricted isometry property for Compressive sampling for signal classification Graded bidirectional synaptic plasticity is composed of 69 , 2007 253 , 2005 E Proc. IEEE 28 Low-dimensional models for dimensionality reduction and , 2010 744 Stable manifold embeddings with operators satisfying the restricted Compressed and privacy-sensitive sparse regression Discrete synaptic states define a major mechanism of synapse Phys. Rev. Extensions of Lipschitz mappings into a Hilbert space 27 Statistical mechanics of complex neural systems and high dimensional data Analysing cluster formation by replica method Proc. Nat. Acad. Sci. 4732 Random projections of smooth manifolds Distributions of singular values for some random matrices , 2004 95 1895 Principal-component-analysis eigenvalue spectra from data with , 2005 Storage capacity of memory networks with binary couplings Asymptotic corrections to the wigner semicircular eigenvalue spectrum of a Asymptotic corrections to the wigner semicircular eigenvalue spectrum of a An elementary proof of a theorem of Johnson and Lindenstrauss Eigenvalue spectra of random matrices for neural networks Random matrix theory CISS: 45th Annual Conf. on Information Sciences and Systems, 2011 Critical storage capacity of the Construct. Approx. Approximate nearest neighbors: towards removing the curse of dimensionality 60 846 60 Decoding by linear programming 22 55 Lecture Notes in Computer Science vol 6 (Piscataway, NJ: IEEE) pp VI–161 ( , 2011 ICASSP Proc. Acoustics, Speech and Signal Processing , 2008 Trends Neurosci. 3389 On the distribution of the roots of certain symmetric matrices , 2006 1 188104 , 2004 L519 60 Random projection, margins, kernels, and feature-selection 26 97 22 Phys. Rev. Lett. Proc. 13th Annual ACM Symp. on Theory of Computing Proc. Nat. Acad. Sci. E 381 3057 51 Trans. Inf. Theory projections p III compressive classification and targetImaging recognition Multiscale random projections forImage compressive classification Processing (ACSSC’06): 40th Asilomar Conf.pp on 1430–4 Signals, Systems and Computers, 2006 Feature Selection random matrices (Piscataway, NJ: IEEE) pp 1–6 Math. 1998 Struct. Algorithms 9 signal recovery: a geometric perspective isometry property large real symmetric random matrix using the replica method 30 Gen. 50 plasticity Lett. large real symmetric random matrix using the replica method switch-like unitary events 1998 1988 Rev. symmetry-breaking structure [99] Dhesi G S and Jones R C, [87] Lootens E and van den Broeck[88] C, Petersen C C H, Malenka R C, Nicoll R A and Hopfield J J, [90] Krauth W and Opper M, [91] Krauth W and MézardM, [92] Montgomery J M and Madison D[93] V, Dhesi G S and Jones R[94] C, Sommers H J, Crisanti A, Sompolinsky H and Stein Y, [96] Sengupta A M and Mitra P P, [98] Wigner E P, [95] Rajan K and Abbott L F, [97] Hoyle D C and Rattray M, [89] O’Connor D H, Wittenberg G M and Wang S S H, [109] Zhou S, Lafferty J and Wasserman[110] L, Duarte M F, Davenport M A, Wakin M B and[111] Baraniuk RDavenport G, M, Duarte M, Wakin M, Laska J, Takhar D, Kelly[112] KDuarte and M Baraniuk F, R, Davenport M A, Wakin M B, Laska J N, Takhar D, Kelly K F and Baraniuk R G, [114] Blum A, [113] Haupt J, Castro R, Nowak R, Fudge G and Yeh A, [108] Candes E and Tao T, [107] Baraniuk R, Davenport M, DeVore R and Wakin M, [101] Johnson W B and Lindenstrauss J, [102] Indyk P and Motwani R, [103] Dasgupta S and Gupta A, [106] Yap H L, Wakin M B and Rozell C J, [105] Baraniuk R G, Cevher V and Wakin M B, [104] Baraniuk R G and Wakin M B, [100] Edelman A and Rao N R, doi:10.1088/1742-5468/2013/03/P03014 64 J. Stat. Mech. (2013) P03014 , 423 J. , 8 Phys. Proc. Phys. , 1978 A Phys. 357 Phys. , 2000 Phys. Rev. , 2012 , 2005 , 1992 28 , 2007 Neural , 1990 Neural , 1992 vol 3146) (Berlin: , 1990 , 2004 Phys. Rev. , 1998 , 2007 Phys. Rev. Lett. , 1973 , 2010 (Berlin: Springer) Annu. Rev. Neurosci. 489 Compressed Sensing: Theory and 52 123 Dynamical principles in neuroscience , 2005 , 2010 466 Sensitivity to perturbations in vivo implies 4922 (Boca Raton, FL: CRC Press) 36 Nature Lecture Notes in Computer Science A ( 2197 9446 4865 Suppressing chaos in neural networks by noise , 2010 L09003 37 A typical reconstruction limit for compressed sensing based on 100 102 Stimulus-dependent suppression of chaos in recurrent neural Statistical mechanics of learning from examples Learning from examples in large neural networks Random projections for manifold learning IEEE Trans. Inf. Theory A Mean-Field Theory of Irregularly Spiking Neuronal Populations Asymptotic analysis of map estimation via the replica method , 200906.3234 arXiv:09 Phys. Rev. An exactly solvable asymmetric neural network model Statistical dynamics of classical systems Neural network dynamics Chaos in neuronal networks with balanced excitatory and inhibitory Mean field methods for cortical network dynamics , 2006 011903 Robust uncertainty principles: exact signal reconstruction from highly Statistical mechanics of complex neural systems and high dimensional data Statistical mechanics of a multilayered neural network , 1987 Chaotic balanced state in a model of cortical circuits Dynamics of spin systems with randomly asymmetricDynamics bonds: Langevin of spin systems with randomly asymmetric bonds: Ising spins 82 J. Stat. Mech. Statistical mechanics of compressed sensing Phys. Rev. 268104 E 1724 1213 Neighborliness of randomly projected simplices in high dimensions Sparse nonnegative solution of underdetermined linear equations by linear Dynamic flux tubes form reservoirs of stability in neuronal circuits Dynamical entropy production in spiking neuron networks in the balanced state Optimally sparse representation in general (non-orthogonal) dictionaries via l1 78 , 2009 105 274 183 , 1988 9452 Proc. Nat. Acad. Sci. Proc. Nat. Acad. Sci. 8 167 4913 4 102 Phys. Rev. Dynamics as a substitute for replicas in systems with quenched random impurities 3717 1683 Science 18 (Cambridge: Cambridge University Press) pp 394–438 , 2003 1321 , 2005 Graphical models concepts in compressed sensing An Introduction to Statistical Modeling of Extreme Values B 6056 69 65 041007 10 2312 , 2010 Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons 2 45 , 1996 65 Rev. Mod. Phys. Phys. Rev. Lett. X A 188701 2006 Rev. Lett. networks Computational Neuroscience: Cortical Dynamics Springer) p 71 2010 Rev. Rev. Lett. high noise and suggests rate coding in cortex Europhys. Lett. activity Comput. Comput. Neurosci. and Working Memory in Recurrent Cortical Networks Rev. Lett. Phys. Rev. dynamics and a spherical model and glauber dynamics 104 Nat. Acad. Sci. programming Applications l p-norm minimization and applications to compressed sensing Information Processing Systems minimization incomplete frequency information [140] Vogels T P,[141] RajanRabinovich K M and I, Abbott Varona L P, F, Selverston A I and Abarbanel H D I, [142] Seung H S, Sompolinsky H and Tishby N, [137] Monteforte M and Wolf F, [138] Monteforte M and Wolf F, [143] Sompolinsky H, Tishby N and Seung H S, [133] Rajan K, Abbott L F and Sompolinsky H, [139] London M, Roth A, Beeren L, HäusserM and Latham P E, [134] Brunel N, [136] Hertz J, Lerchner A and Ahmadi M, [130] van Vreeswijk C and Sompolinsky H, [131] Vreeswijk C and Sompolinsky H, [135] Renart A, Brunel N and Wang X J, 2004 [132] Molgedey L, Schuchhardt J and Schuster H G, [144] Barkai E, Hansel D and Kanter I, [129] Derrida B, Gardner E and Zippelius A, [126] De Dominicis C, [127] Crisanti A and Sompolinsky H, [128] Crisanti A and Sompolinsky H, [125] Martin P C, Siggia E D and Rose H A, [124] Montanari A, [122] Donoho D L and Tanner J, [123] Donoho D L and Tanner J, [120] Rangan S, Fletcher A K and[121] Goyal VGanguli K, S and Sompolinsky H, [115] Hegde C, Wakin M B and[116] Baraniuk RColes G, S, 2001 [117] Donoho D L and Elad M, [119] Kabashima Y, Wadayama T and Tanaka T, [118] Candes E, Romberg J and Tao T, doi:10.1088/1742-5468/2013/03/P03014 65 J. Stat. Mech. (2013) P03014 1191 3251 3 J. , 2008 3767 , 2012 76 Phys. , 1994 Phys. Phys. , 2006 57 Phys. Rev. Phys. Rev. 953 5781 E , 2010 Europhys. , 1990 17 J. Phys. A: , 1999 26 , 2010 , 1993 Phys. Rev. Lett. , 1992 2432 Rev. Mod. Phys. 391 , 2007 J. Math. Phys. , 1999 75 25 Phys. Rev. Phys. Rev. Lett. , 2001 , 1993 , 1962 , 1998 , 1996 Ann. Appl. Probab. Europhys. Lett. Phys. Rev. Lett. J. Phys. A: Math. Gen. Storage capacity and learning algorithms , 2007 2113 Exact distribution of the maximal height of p 72 , 1994 , 1995 , 1999 7590 45 A 150601 Theory of spike timing-based neural classifiers Statistical mechanics of support vector networks Phys. Rev. Lett. Scaling laws in learning of classification tasks 101 Broken symmetries in multilayered perceptrons 220601 Phys. Rev. Superparamagnetic clustering of data Superparamagnetic clustering of data The statistical mechanics of learning a rule Supervised learning from clustered input examples 105 P11001 , 1994 Statistical mechanics and phase transitions in clustering Statistical mechanics of complex neural systems and high dimensional data Significance analysis and statistical mechanics: an application to , 1992 Ome sweet ome: what can the genome tell us about the connectome? Statistical mechanics of the maximum-likelihood density estimation Weight space structure and internal representations: a direct approach to The tempotron: a neuron that learns spike timing—based decisions 346 Universal learning curves of support vector machines Nonintersecting brownian excursions A statistical physics approach for the analysis of machine learning algorithms 18 Statistical mechanics of unsupervised structure recognition Phys. Rev. Lett. Replica theory for learning curves for Gaussian processes on random graphs 420 J. Stat. Mech. 9 1766 Phys. Rev. Lett. 1885 218102 , 2008 945 2975 50 P04009 27 , 2005 E 65 105 82 Learning a rule in a multilayer neural network , 2010 Statistical mechanics of learning orthogonal signals for general covariance models A brownian-motion model for the eigenvalues of a random matrix 117 3167 Learning and generalization in a two-layer neural network: the role of the An exactly solvable model of unsupervised learning 4146 30 70 499 4410 45 Lett. Rev. Lett. Lett. Phys. Rev. vicious walkers arXiv:1202.5918 Nature Neurosci. Rev. Lett. Curr. Opin. Neurobiol. 65 Stat. Mech. Math. Gen. clustering on real data Rev. Lett. 86 Vapnik–Chervonvenkis dimension learning and generalization in multilayer neural networks for two-layer neural networks A [162] Rose K, Gurewitz E and Fox G C, [164] Barkai N and Sompolinsky H, [165] Blatt M, Wiseman S and Domany E, [169] Schehr G, Majumdar S N, Comtet[170] A andTracy C Randon-Furling J, A and Widom H, [163] Barkai N, Seung H S and Sompolinsky H, [159] Biehl M and Mietzner A, [154] GütigR and Sompolinsky H, [155] Rubin R, Monasson R and Sompolinsky H, [158] Hoyle D C, [161] Marangi C, Biehl M and Solla S A, [166] Wiseman S,[167] BlattLukszaM, M and LässigM Domany and E, Berg J, [168] Dyson F J, [156] Lichtman J W and Sanes J R, [160] Biehl M, [153] Urry M and Sollich P, [157] Watkin T L H, Rau A and Biehl M, [150] Dietrich R, Opper M and Sompolinsky H, [152] Malzahn D and Opper M, [151] Opper M and Urbanczik R, [147] Opper M, [148] Monasson R and Zecchina R, [149] Schwarze H, [145] Barkai E, Hansel D and Sompolinsky[146] H, Engel A, KöhlerH M, Tschepke F, Vollmayr H and Zippelius A, doi:10.1088/1742-5468/2013/03/P03014 66