Faster solutions of the inverse pairwise Ising problem

Tamara Broderick,a Miroslav Dud´ık,b Gaˇsper Tkaˇcik,c Robert E. Schapireb and William Bialekc,d aDepartment of Mathematics, bDepartment of Computer Science, cJoseph Henry Laboratories of Physics, cLewis–Sigler Institute for Integrative Genomics, and dPrinceton Center for Theoretical Physics, , Princeton, NJ 08544 (Dated: December 15, 2007) Recent work has shown that probabilistic models based on pairwise interactions—in the simplest case, the Ising model—provide surprisingly accurate descriptions of experiments on real biological networks ranging from neurons to genes. Finding these models requires us to solve an inverse prob- lem: given experimentally measured expectation values, what are the parameters of the underlying Hamiltonian? This problem sits at the intersection of statistical physics and machine learning, and we suggest that more efficient solutions are possible by merging ideas from the two fields. We use a combination of recent coordinate descent algorithms with an adaptation of the histogram Monte Carlo method, and implement these techniques to take advantage of the sparseness found in data on real neurons. The resulting algorithm learns the parameters of an Ising model describing a network of forty neurons within a few minutes. This opens the possibility of analyzing much larger data sets now emerging, and thus testing hypotheses about the collective behaviors of these networks.

I. INTRODUCTION from statistical mechanics could provide a model for the emergence of function in biological systems, and this gen- In the standard problems of statistical mechanics, we eral class of ideas has been explored most fully for net- begin with the definition of the Hamiltonian and pro- works of neurons [6, 7]. Recent work shows how these ceed to calculate the expectation values or correlation ideas can be linked much more directly to experiment functions of various observable quantities. In the in- [9, 10] by searching for maximum entropy models that verse problem, we are given the expectation values and capture some limited set of measured correlations. At a try to infer the underlying Hamiltonian. The history of practical level, implementing this program requires us to the inverse problems goes back (at least) to 1959, when solve a class of inverse problems for Ising models with Keller and Zumino [1] showed that, for a classical gas, the pairwise interactions among the spins, and this is the temperature dependence of the second virial coefficient problem that we consider here. determines the interaction potential between molecules To be concrete, we consider a network of neurons. uniquely, provided that this potential is monotonic. Sub- Throughout the brain, neurons communicate by gener- sequent work on classical gases and fluids considered the ating discrete, identical electrical pulses termed action connection between pair correlation functions and inter- potentials or spikes. If we look in a small window of action potentials in various approximations [2], and more time, each neuron either generates a spike or it does not, rigorous constructions of Boltzmann distributions consis- so that there is a natural description of the instantaneous tent with given spatial variations in density [3] or higher state of the network by a collection of binary or Ising vari- order correlation functions [4]. ables; σi = +1 indicates that neuron i generates a spike, and σ = −1 indicates that neuron i is silent. Knowing In fact the inverse problem of statistical mechanics i the average rate at which spikes are generated by each arises in many different contexts, with several largely in- cell is equivalent to knowing the expectation values hσ i dependent literatures. In computer science, there are a i for all i. Similarly, knowing the probabilities of coincident number of problems where we try to learn the probabil- spiking (correlations) among all pairs of neurons is equiv- ity distribution that describes the observed correlations alent to knowing the expectation values hσ σ i. Of course among a large set of variables in terms of some (hope- i j there are an infinite number of probability distributions arXiv:0712.2437v2 [q-bio.QM] 15 Dec 2007 fully) simpler set of interactions. Many algorithms for P (σ) over the states of the whole system (σ ≡ {σ }) solving these learning problems rest on simplifications i that are consistent with these expectation values, but if or approximations that correspond quite closely to es- we ask for the distribution that is as random as possible tablished approximation methods in statistical mechan- while still reproducing the data—the maximum entropy ics [5]. More explicitly, in the context of neural network distribution—then this has the form of an Ising model models [6, 7], a family of models referred to as ‘Boltz- with pairwise interactions: mann machines’ lean directly on the mapping of proba- bilistic models into statistical physics problems, identi-   1 X 1 X fying the parameters of the probabilistic model with the P (σ) = exp − h σ − J σ σ . (1) Z  i i 2 ij i j coupling constants in an Ising–like Hamiltonian [8]. i i6=j Inverse problems in statistical mechanics have received new attention because of attempts to construct explicit The inverse problem is to find the “magnetic fields” {hi} network models of biological systems. Physicists have and “exchange interactions” {Jij} that reproduce the ob- long hoped that the collective behavior which emerges served values of hσii and hσiσji. 2

The surprising result of Ref [9] was that this Ising ments, σ ≡ {σ1, σ2, ··· , σN }. As ingredients for deter- model provides an accurate quantitative description of mining this model, we use low order computed the combinatorial patterns of spiking and silence ob- from a set of m samples {σ1, σ2, ··· , σm}, which we can served in groups of order N = 10 neurons in the retina as think of as samples drawn from the distribution P (σ). it responds to natural sensory inputs, despite only taking The classical idea of maximum entropy models is that account of pairwise interactions. The Ising model allows we should construct P (σ) to generate the correct values us to understand how, in this system, weak correlations of certain average quantities (e.g., the energy in the case among pairs of neurons can coexist with strong collective of the Boltzmann distribution), but otherwise the distri- effects at the network level, and this is even clearer as one bution should be ‘as random’ as possible [21]. Formally extends the analysis to larger groups (using real data for this means that we find P (σ) as the solution of a con- N = 40, and extrapolating to N = 120), where there strained optimization problem, maximizing the entropy is a hint that the system is poised near a critical point of the distribution subject to conditions that enforce the in its dynamics [10]. Since the initial results, a number correct expectation values. We will refer to the quanti- of groups have found that the maximum entropy models ties whose averages are constrained as “features” of the provide surprisingly accurate descriptions of other neu- system, f ≡ {f1, f2, ··· , fK }, where each fµ is a function ral systems [11, 12, 13] and similar approaches have been of the state σ, fµ(σ). used to look at biochemical and genetic networks [14, 15]. One special set of average features are just the The promise of the maximum entropy approach to bio- marginal distributions for subsets of the variables. Thus logical networks is that it builds a bridge from easily ob- we can construct the one–body marginals servable correlations among pairs of elements to a global X view of the collective behavior that can emerge from the Pi(σi) = P (σ1, σ2, ··· , σN ), (2) network as a whole. Clearly this potential is greatest {σj6=i} in the context of large networks. Indeed, even for the retina, methods are emerging that make it possible to the two–body marginals, record simultaneously from hundreds of neurons [16, 17], X so just keeping up with the data will require methods Pij(σi, σj) = P (σ1, σ2, ··· , σN ), (3) to deal with much larger instances of the inverse prob- {σk6=i,j} lem. The essential difficulty, of course, is that once we and so on for larger subsets. The maximum entropy have a large network, even checking that a given set of distributions consistent with marginal distributions up parameters {h ,J } reproduce the observed expectation i ij to K–body terms generates a hierarchy of models that values requires a difficult calculation. In Ref [10] we took capture increasingly higher–order correlations, monoton- an essentially brute force Monte Carlo approach to this ically reducing the entropy of the model as K increases, part of the problem, and then adjusted the parameters toward the true value [22]. to improve the match between observed and predicted Let P˜ denote the empirical distribution expectation values using a relatively naive algorithm. In this work we combine several ideas—taken both m 1 X from statistical physics and from machine learning [18]— P˜(σ) = δ(σ, σn), (4) m which seem likely to help arrive at more efficient solutions n=1 of the inverse problem for the pairwise Ising model. First, where δ(σ, σ0) is the Kronecker delta, equal to one when we adapt the histogram Monte Carlo method [19] to ‘re- 0 cycle’ the Monte Carlo samples that we generate as we σ = σ and equal to zero otherwise. The maximum- make small changes in the parameters of the Hamilto- entropy problem is then nian. Second, we use a coordinate descent method to maxP S[P ] such that hf(σ)iP = hf(σ)i ˜ , (5) adjust the parameters [20]. Finally, we exploit the fact P that neurons use their binary states in a very asymmetric where S[P ] denotes the entropy of the distribution P , and fashion, so that silence is much more common that spik- h· · ·iP denotes an expectation value with respect to that ing. Combining these techniques, we are able to solve the distribution. Using the method of Lagrange multipliers, inverse problem for N = 40 neurons in tens of minutes, the solution to the maximum entropy problem has the rather than many days for the naive approach, holding form of a Boltzmann or Gibbs distribution [21], out hope for generalization to yet larger problems. " K # 1 X Q (σ) = exp − θ f (σ) , (6) θ Z(θ) µ µ II. INGREDIENTS OF OUR ALGORITHM µ=1 where as usual the partition function Z(θ) is the normal- A. Basic formulation ization constant ensuring that the distribution Qθ sums K to one, and the parameters {θ1, θ2, ··· , θK } ≡ θ ∈ R Our overall goal is to build a model for the distribu- correspond to the Lagrange multipliers. Note that the ex- tion P (σ) over the states of the a system with N ele- pression for Qθ above describes the pairwise Ising model, 3

Eq (1), with θ = {hi,Jij}, and the features f are the for distribution in the form of Eq (6), we have one–spin and two–spin combinations {σi, σiσj}. Rather than thinking of our problem as that of maxi- ∂ ln Z(θ) hf (σ)i = − . (7) mizing the entropy subject to constraints on the expec- µ Qθ ∂θµ tation values, we can now think of our task as searching the space of Gibbs distributions, parameterized as in Eq Enforcing that these expectation values are equal to the (6), to find the values of the parameters θ that generate expectation values computed from our empirical samples the correct expectation values. Importantly, because of means solving the equation basic thermodynamic relationships, this search can also be formulated as an optimization problem. Specifically, m 1 X ∂ ln Z(θ) we recall that expectation values in statistical mechanics hf (σ)i ≡ f (σn) = − . (8) µ P˜ m µ ∂θ can be written as derivatives of the free energy, which in n=1 µ turn is the logarithm of the partition function (up to fac- tors of the temperature, which isn’t relevant here). Thus, But this can be written as

m m K 1 X 1 X ∂ X ∂ ln Z(θ) f (σn) = θ f (σn) = − (9) m µ m ∂θ ν ν ∂θ n=1 n=1 µ ν=1 µ m " K # ∂ 1 X X 0 = − ln Z(θ) − θ f (σn) (10) ∂θ m ν ν µ n=1 ν=1 m ∂ 1 X = ln Q (σn). (11) ∂θ m θ µ n=1

Thus we see that matching the empirical expectation val- Comparing with Eq (12), we obtain the dual formulation ues is equivalent to looking for a local extremum (which of the maximum entropy problem, turns out to be a maximum) of the quantity minθ LP˜ (θ). (14) m 1 X n ln Qθ(σ ). (12) Why is the optimization problem in Eq (14) difficult? m n=1 In principle, the convexity properties of free energies should make the problem well behaved and tractable. But if all the samples {σ1, σ2, ··· , σn} are drawn in- But it remains possible for LP˜ (θ) to have a very sensi- dependently, then the total probability that the Gibbs tive dependence on the parameters θ, and this can cause distribution with parameters θ generates the data is practical problems, especially if, as suggested in Ref [10], Q n Ptotal = n Qθ(σ ), and so the quantity in Eq (12) is the systems we want to describe are poised near a critical (up to a factor of m), just ln Ptotal. Finding the maxi- point. Even before encountering this problem, however, mum entropy distribution thus is equivalent to maximiz- we face the difficulty that computing LP˜ (θ) or even its ing the probability or likelihood that our model generates gradient in parameter space involves computing expecta- the observed samples, within the class of models defined tion values with respect to the distribution Qθ(σ). Once by the Gibbs distribution Eq (6). the space of states becomes large, it is no longer possi- We recall that, in information theory [23], probabil- ble to do this by exact enumeration. We can try to use ity distributions implicitly define strategies for encoding approximate analytical methods, or we can use Monte data, and the shortest codes are achieved when our model Carlo methods. of the distribution actually matches the distribution from which the data are drawn. Since code lengths are related to the negative logarithm of probabilities, it is convenient B. Monte Carlo methods to define the cost of coding or log loss LP˜ (θ) that arises when we use the model with parameters θ to describe Our general strategy for solving the optimization prob- data drawn from the empirical distribution P˜: lem in Eq (14) will be to use standard Monte Carlo simu- m lations [24, 25, 26] to generate samples from the distribu- 1 X L (θ) = − ln Q (σn) = h− ln Q (σ)i . (13) tion Qθ(σ), approximate the relevant expectation values P˜ m θ θ P˜ n=1 as averages over these samples, and then use the results 4 to propose changes in the parameters θ so as to proceed C. Parameter adjustment and the use of sparsity toward a minimum of LP˜ (θ). Implemented naively, as in Ref [10], this procedure is hugely expensive, because at At the core of our problem is an algorithm which tries each new setting of the parameters we have to generate a to adjust the parameters θ so as to minimize LP˜ (θ). In new set of Monte Carlo samples. Some of this cost can be Ref [10] we used a simple gradient descent method. Here avoided using the ideas of histogram Monte Carlo [19]. we use a coordinate descent method [20], adapted from We recall that if we want to compute the expectation Ref [27] and tuned specifically for the maximum entropy value of some function Φ(σ) in the distribution Qθ0 (σ), problem. Where gradient descent methods attempt to this can be written as find a vector in parameter space along which one can achieve the maximum reduction in L, coordinate descent X hΦ(σ)iθ0 ≡ Qθ0 (σ)Φ(σ) (15) methods explore in sequence the individual parameters σ θµ.   Beyond the general considerations in Refs [20, 27], co- X Qθ0 (σ) = Qθ(σ) Φ(σ) (16) ordinate descent may be especially well suited to our Q (σ) σ θ problem because of the the sparsity of the feature vectors. * + In implementing any parameter adjustment algorithm, it Qθ0 (σ) = Φ(σ) (17) is useful to take account of the fact that in networks of Qθ(σ) θ real neurons, spikes and silences are used very asymmet- 0 hΦ(σ) exp[−(θ − θ)·f(σ)]iθ rically. It thus is useful to write the basic variables as = 0 , (18) hexp[−(θ − θ)·f(σ)]iθ ni = (σi + 1)/2 ∈ {0, 1}. This involves a slight redefini- tion of the parameters {hi,Jij}, but once this is done one where we denote expectation values in the distribution finds that individual terms ∝ ni or ∝ ninj are often zero, 0 because spikes (corresponding to n = 1) are rare. This is Qθ(σ) by h· · ·iθ, and similarly for θ . We note that Eq i (18) is exact. The essential step of histogram Monte true not just in the experimental data, but of course also Carlo is to use an approximation to this equation, re- in the Monte Carlo simulations once we have parameters that approximately reproduce the overall probability of placing the expectation value in the distribution Qθ by an average over a set of samples drawn from Monte Carlo spiking vs. silence. This sparsity can lead to substantial simulation of this distribution. time savings, since all expectation values can be eval- uated in time proportional to the number of non–zero Consider an algorithm A which searches for a mini- elements [28]. mum of L (θ). As this algorithm progresses, the values P˜ As an alternative to coordinate descent method, we of the parameters θ change slowly. We will divide these also used a general purpose convex optimization algo- changes into stages s = 1, 2, ···, and within each stage we rithm known as the limited memory variable metric will perform t = 1, 2, ··· ,T iterations. At the first itera- (LMVM) [29], which is known to outperform others such tion, we will generate, via Monte Carlo, M samples of the general purpose algorithms on a wide variety of prob- state σ drawn out of the distribution appropriate to the lems [30]. While LMVM has the advantage of changing current value of θ = θ(s, t = 1). Let us refer to averages multiple parameters at a time, the cost of updating may over these samples as h· · ·i , which should approxi- MCθ outweigh this advantage, especially for very sparse data mate h· · ·i . At subsequent iterations, the parameters θ such as ours. θ will be adjusted (see below for details), but we keep Both the coordinate descent and the LMVM schemes the same Monte Carlo samples and approximate the ex- are initialized with parameters [in the notation of Eq (1)] pectation values over the distribution with parameters J = 0 and the h chosen to exactly reproduce the ex- θ0 = θ(s, t) as ij i pectation values hσii. Our Monte Carlo method follows

0 the discussion of the Gibbs sampler in Ref [26]. All com- hΦ(σ) exp[−(θ − θ)·f(σ)]iMCθ 0 putations were on the same computing cluster [31] used hΦ(σ)iθ ≈ 0 . (19) hexp[−(θ − θ)·f(σ)]iMCθ in Ref [10].

We denote this approximation as h· · ·iθ0 ≈ h· · ·iθ0|θ. Once we reach t = T , we run a new Monte Carlo simulation III. EXPERIMENTS appropriate to the current value of θ, and the cycle begins again at stage s = s + 1. This strategy is summarized as In this section, we evaluate the speed of convergence pseudocode in Fig 1. of our algorithms as a function of M (the number of Once we chose the optimization algorithm A, there are Monte Carlo samples drawn in a sampling round) and still two free parameters in our scheme: the number of T (the number of learning steps per stage) under a fixed samples M provided by the Monte Carlo simulation at time budget. We begin with synthetic data, for which we each stage, and the number of iterations per stage T . In know the correct answer, and then consider the problem our experiments, we explore how choices of these parame- of constructing maximum entropy models to describe real ters influence the convergence of the resulting algorithms. data. 5

Input: empirical observations σ1, ··· , σm parameters M and T access to maxent inference algorithm A Algorithm: initialize A θ(1, 1) ← initial parameter vector computed by A for s = 1, 2,... generate M Monte Carlo samples using θ(s, 1) for t = 1,...,T run one iteration of A, approximating h· · ·iθ(s,t) = h· · ·iθ(s,t)|θ(s,1) θ(s, t + 1) ← parameter vector computed by A on this iteration θ(s + 1, 1) ← θ(s, T + 1)

FIG. 1: Pseudocode for our scheme, which reuses a Monte Carlo sample set across T iterations of a maxent algorithm A.

A. Synthetic data and test for consistency between the two results. Finally, to compare the performance of the two optimization al- The maximum entropy construction is a strategy for gorithms, we withhold some fraction of the data, here simplifying our description of a system with many inter- chosen to be 10%, for testing. acting elements. Separate from the algorithmic question Figure 2 illustrates the performance of the two algo- of finding the maximum entropy model is the natural sci- rithms on the synthetic dataset, measured by ∆L. for ence question of whether this model provides a good de- simplicity, we have held the number of Monte Carlo sam- scription of the system we are studying, making success- ples at each stage, M = 3.2 × 105, fixed. Choosing T = 1 ful predictions beyond the set of low order correlations iteration per stage corresponds to a naive use of Monte that are used to construct the model. To sharpen our Carlo, with a new simulation for each new setting of the focus on the algorithmic problem, we use as “empirical parameters, and we see that this approach converges very data” a set of samples generated by Monte Carlo simu- slowly, as found in Ref [10]. Simply increasing this num- lation of an Ising model as in Eq (1). To get as close as ber to T = 20 produces at least an order of magnitude possible to the problems we face in analyzing real data, speed up in convergence. we use the parameters of the Ising model found in Ref [10] in the analysis of activity in a network of N = 40 neurons in the retina as it responds to naturalistic movies. We note that this model has competing interactions, as in a spin glass, with multiple locally stable states; extrapola- 0.6 Coordinate update; 1 iteration per stage tion to larger systems suggests that the parameters are LMVM; 1 iteration per stage 0.5 close to a critical point. Coordinate update; 20 iterations per stage LMVM; 20 iterations per stage Starting with the parameters θ0 ≡ {hi,Jij} determined in Ref [10], we generate m = 2 × 105 samples by Monte 0.4 Carlo simulation. Let us call the empirical distribution over these samples Pˆ. Then we proceed to minimize 0.3 L (θ). To monitor the progress of the algorithm, we Pˆ 0.2 compute

0.1 ∆L(θ) = LPˆ (θ) − LPˆ (θ0). (20) log loss difference from the truth Note that this computation requires us to estimate the 0 log ratios of partition functions, for which we again use −0.1 the histogram Monte Carlo approach, 0 100 200 300 400 500 600 running time (sec) Z(θ) = hexp [−(θ − θ0)·f(σ)]iθ0 (21) Z(θ0)   FIG. 2: Approximate difference in the test log loss between ≈ exp [−(θ − θ0)·f(σ)] . (22) the current solution of an algorithm and the exact model that MCθ0 generates the test data, as a function of the algorithm running As a check on this approximation, we can exchange the time. The number of Monte Carlo samples at each stage is set roles of θ and θ , to 320,000. The two choices of optimization algorithm exhibit 0 similar performance, but much greater efficiency is achieved   Z(θ0) in both cases by including multiple iterations of learning al- ≈ exp [+(θ − θ0)·f(σ)] , (23) gorithm A per stage. Z(θ) MCθ 6

0.03 1.6×105 MCMC samples 101 iterations per stage 21 iterations per stage 2750 3.2×105 MCMC samples 0.025 Average error in predicted correlations 2250 Finish line Finish line for average error Number of stages in 300 sec 1750 0.02 1250

750 0.015 0.011 250

0.01 0.009 total number of iterations 0.007 0.005 average error in predicted correlations average error in predicted correlations 0.005

0 0.003 0 100 200 300 400 500 600 0 1 2 3 4 5 6 7 5 number of iterations per stage number of MCMC samples x 10

FIG. 3: (Left) Performance of our algorithm on the empirical data according to the absolute correlation difference metric, as a function of the number of iterations of the optimization algorithm and the number of Monte Carlo samples. The algorithm was terminated after the running time exceeded 300 seconds. Choosing too many or too few iterations per stage decreases efficiency (due to overlearning or inefficient use of samples, respectively). (Right) Performance of our algorithm on the empirical data according to the absolute correlation difference metric, as a function of the number of examples generated by the Monte Carlo sampler. The algorithm was terminated after the running time exceeded 300 seconds. Also displayed is the total number of iterations completed by the time cutoff. Fewer iterations per sample allow the generation of more samples, which is a computationally expensive process; therefore, fewer iterations result in fewer total optimization stages. Choosing too many or too few Monte Carlo samples per stage decreases efficiency, either because the algorithm samples for too long and terminates before enough learning stages can be completed, or because the sampling is too sparse and expectations are poorly estimated.

B. Real data samples, we don’t expect that they are exact, and hence we shouldn’t identify convergence with ∆C → 0. Instead Here we consider the problem of constructing the max- we divide the real data in half and measure ∆C between imum entropy distribution consistent with the correla- these halves, and use this as the ‘finish line’ for our algo- tions observed in real data. Our example is from Refs rithms. [9, 10], the responses of N = 40 retinal neurons. As In Fig 3 we see the behavior of ∆C as a function of the noted at the outset, if we divide time into small slices of number of iterations per stage (T ), where we terminate duration ∆τ then each cell either does or does not gen- the algorithm after 300 seconds of running time on our erate an action potential, corresponding to a binary or local cluster [31]. As expected intuitively, both small Ising variable. The experiments of Ref [9], analyzed with T and large T perform badly, but there is a wide range ∆τ = 0.02 s, provide m = 189, 950 samples of the state T ∼ 30−200 in which the algorithm reaches the finish line σ. We note that spikes are rare, and pairwise correlations within the alloted time. This performance also depends are weak, as discussed at length in Ref [9]. on M, the number of Monte Carlo samples per stage, so that again there is a range of M ∼ 1 − 3 × 105 that As we try to find the parameters θ that describe these seems to work well. In effect, when we constrain the data, we need a metric to monitor the progress of our total run time of the algorithm there is a best way of algorithm; of course we don’t have access to the true dis- apportioning this time between stages (in which we run tribution. Since we are searching in the space of Gibbs new Monte Carlo simulations) and iterations (in which distributions which automatically have correct functional we adjust parameters using estimates based on a fixed form to maximize the entropy, we check the quality of set of Monte Carlo samples). By working within a factor agreement between the predicted and observed expec- of two of this optimum, we achieve substantial speed up tation values. It is straightforward to insure that the of the algorithm, or an improvement in convergence at model reproduces the one–body averages hσ i almost ex- i fixed run time. actly. Thus we focus on the (connected) two–body aver- ages Cij = hσiσji − hσiihσji. We compute these averages via Monte Carlo in the distribution Qθ(σ), and then form the absolute difference between computed and observed IV. CONCLUSION values of Cij, and finally average over all pairs ij. The re- sulting average error ∆C measures how close we come to Our central conclusion is that recycling of Monte Carlo solving the maximum entropy problem. Note that since samples, as in the histogram Monte Carlo method [19], the observed values of Cij are based on a finite set of provides a substantial speedup in the solution of the in- 7 verse Ising problem. In detail, some degree of recycling of algorithmic efficiency become crucial. speeds up the computations, while of course too much recycling renders the underlying approximations invalid, so that there is some optimal amount of recycling; for- tunately it seems from Fig 3 that this optimum is quite Acknowledgments broad. We expect that these basic results will be true more generally for problems in which we have to learn We thank MJ Berry II, E Schneidman and R Segev for the parameters of probabilistic models to provide the best helpful discussions and for their contributions to the work match to a large body of data. In the specific context of which led to the formulation of the problems addressed Ising models for networks of real neurons, the experimen- here. This work was supported in part by NIH Grant tal state of the art is now providing data on populations P50 GM071508, and by NSF Grants IIS–0613435 and of neurons which are sufficiently large that these issues PHY–0650617.

[1] JB Keller & B Zumino, Determination of intermolecular Shlens et al, Spatial organization of large–scale concerted potentials from thermodynamic data and the law of cor- activity in primate retina, 176.17/JJ10. IE Ohiorhenuan responding states. J Chem Phys 30, 1351–1353 (1959). & JD Victor, Maximum-entropy analysis of multi-neuron [2] W Kunkin & HL Frisch, Inverse problem in classical sta- firing patterns in primate V1 reveals stimulus-contingent tistical mechanics. Phys Rev 177, 282–287 (1969). patterns, 615.8/O01. S Yu, D Huang, W Singer & D [3] JT Chayes, L Chayes & E Lieb, The inverse problem in Nikoli´c,A small world of neural synchrony, 615.14/O07. classical statistical mechanics. Commun Math Phys 93, MA Sacek, TJ Blanche, JK Seamans & NV Swindale, Ac- 57–121 (1984). counting for network states in cortex: are pairwise corre- [4] E Caglioti, T Kuna, J Lebowitz & E Speer, Point pro- lations sufficient?, 790.1/J12. A Tang et al, A maximum cesses with specified low order correlation. J Markov Proc entropy model applied to spatial and temporal correla- and Related Fields 12, 257–272 (2006). tions from cortical networks in vitro, 792.4/K27. [5] JS Yedidia, WT Freeman & Y Weiss, Constructing free [14] TR Lezon et al, Using the principle of entropy maxi- energy approximations and generalized belief propaga- mization to infer genetic interaction networks from gene tion algorithms. IEEE Trans Inf Theory 51, 2282–2312 expression patterns. Proc Nat’l Acad Sci (USA) 103, (2005). 19033–19038 (2006). [6] JJ Hopfield, Neural networks and physical systems with [15] G Tkaˇcik, Information Flow in Biological Networks (Dis- emergent collective computational abilities. Proc Nat’l sertation, Princeton University, 2007). Acad Sci (USA) 79, 2554–2558 (1982). [16] AM Litke et al, What does the eye tell the brain? De- [7] DJ Amit, Modeling Brain Function: The World of At- velopment of a system for the large scale recording of tractor Neural Networks (Cambridge University Press, retinal output activity. IEEE Trans Nucl Sci 51, 1434– Cambridge, 1989). 1440 (2004). [8] GE Hinton & TJ Sejnowski, Learning and relearning in [17] R Segev, J Goodhouse, JL Puchalla & MJ Berry II, Boltzmann machines, in Parallel Distributed Processing: Recording spikes from a large fraction of the ganglion Explorations in the Microstructure of Cognition, Vol. 1, cells in a retinal patch. Nat Neurosci 7, 1155–1162 (2004). ed. DE Rumelhart, JL McClelland & the PDP Research [18] For a general discussion of Monte Carlo methods in ma- Group, pp. 282–317 (MIT Press, Cambridge, 1986). chine learning, see A Andrieu, N de Freitas, A Doucet & [9] E Schneidman, MJ Berry II, R Segev & W Bialek, Weak MI Jordan, An introduction to MCMC for machine learn- pairwise correlations imply strongly correlated network ing, Machine Learning 50, 5–43 (2003); RM Neal Proba- states in a neural population. Nature 440, 1007–1012 bilistic inference using Markov Chain Monte Carlo Meth- (2006); q–bio.NC/0512013. ods, Technical report CRG–TR–93–1, Dep’t of Computer [10] G Tkaˇcik,E Schneidman, MJ Berry II & W Bialek, Ising Science, University of Toronto (1993). models for networks of real neurons. q–bio.NC/0611072 [19] AM Ferrenberg & RH Swendsen, New Monte Carlo tech- (2006). nique for studying phase transitions. Phys Rev Lett 61, [11] J Shlens et al, The structure of multi–neuron firing pat- 2635–2638 (1988). terns in primate retina. J Neurosci 26, 8254–8266 (2006). [20] M Dud´ık,SJ Phillips & RE Schapire, Performance guar- [12] See, for example, the presentations at the 2007 antees for regularized maximum entropy density estima- meeting on Computational and Systems Neuroscience, tion, in Learning Theory, 17th Annual Conference on http://cosyne.org/wiki/Cosyne 07: IE Ohiorhenuan Learning Theory, J Shawe–Taylor & Y Singer, eds, pp & JD Victor, Maximum entropy modeling of multi– 472–486 (Springer–Verlag, Berlin, 2004). neuron firing patterns in V1. J Shlens et al, Spatial struc- [21] ET Jaynes, Information theory and statistical mechanics. ture of large–scale synchrony in the primate retina. A Phys Rev 106, 62–79 (1957). Tang et al, A second–order maximum entropy model pre- [22] E Schneidman, S Still, MJ Berry II & W Bialek, Network dicts correlated network states, but not their evolution information and connected correlations. Phys Rev Lett over time. 91, 238701 (2003); physics/0307072. [13] See also the presentations at the 2007 meeting of the So- [23] TM Cover & JA Thomas, Elements of Information The- ciety for Neuroscience, http://www.sfn.org/am2007/:J ory (John Wiley & Sons, New York, 1991). 8

[24] K Binder, Monte Carlo Methods in Statistical Physics (Springer–Verlag, Berlin, 1979). [25] MEJ Newman & GT Barkema, Monte Carlo Methods in Statistical Physics (Oxford University Press, Oxford, 1999). [26] S Geman & D Geman, Stochastic relaxation, Gibbs dis- tributions, and the Bayesian restoration of images. IEEE Trans PAMI 6, 721–741 (1984). [27] M Collins, RE Schapire & Y Singer, Logistic regression, AdaBoost and Bregman distances. Machine Learning 48, 253–285 (2002). [28] In early stages of our work we encountered some diffi- culties with convergence of the coordinate descent algo- rithm. One possible source of the problem is that if, for example, the Monte Carlo sample includes no spikes from a particular neuron, then trying to match the observed expectation values can become impossible. To ease this difficulty we sometimes bias expectation values over the Monte Carlo samples with a small contribution from ex- pectation values taken from the real data. Further studies would be required to determine if this is essential once we have the optimal values for the parameters M and T . [29] SJ Benson, L Curfman McInnes, J Mor´e& J Sarich, TAO user manual (revision 1.8). Technical Report ANL/MCS–TM–242, Mathematics and Computer Sci- ence Division, Argonne National Laboratory (2005); http://www.mcs.anl.gov/tao. [30] R Malouf, A comparison of algorithms for maximum en- tropy parameter estimation. In Proceedings of the Sixth Conference on Natural Language Learning, D Roth & A van den Bosch, eds, pp 49–55 (Morgan Kaufman, 2002). [31] The facility has 90 2.3 GHz Apple G5 and 16 1.3 GHz G4 CPUs, running under Sun’s Grid En- gine. See the description of the fafner grid at http://genomics.princeton.edu/support/grids/.