Ising distribution as a latent variable model Adrien Wohrer

To cite this version:

Adrien Wohrer. Ising distribution as a latent variable model. Physical Review E , American Physical Society (APS), 2019, 99 (4), pp.042147. ￿10.1103/PhysRevE.99.042147￿. ￿hal-02170202￿

HAL Id: hal-02170202 https://hal.uca.fr/hal-02170202 Submitted on 1 Jul 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ising distribution as a latent variable model

Adrien Wohrer∗ Universit´eClermont Auvergne, CNRS, SIGMA Clermont, Institut Pascal, F-63000 Clermont-Ferrand, France.

During the past decades, the Ising distribution has attracted interest in many applied disciplines, as the maximum entropy distribution associated to any set of correlated binary (‘spin’) variables with observed means and covariances. However, numerically speaking, the Ising distribution is un- practical, so alternative models are often preferred to handle correlated binary data. One popular alternative, especially in life sciences, is the Cox distribution (or the closely related dichotomized Gaussian distribution and log-normal Cox ), where the spins are generated indepen- dently conditioned on the drawing of a latent variable with a multivariate normal distribution. This article explores the conditions for a principled replacement of the Ising distribution by a Cox dis- tribution. It shows that the Ising distribution itself can be treated as a latent variable model, and it explores when this latent variable has a quasi-normal distribution. A variational approach to this question reveals a formal link with classic mean field methods, especially Opper and Winther’s adaptive TAP approximation. This link is confirmed by weak coupling (Plefka) expansions of the different approximations, and then by numerical tests. Overall, this study suggests that an Ising distribution can be replaced by a Cox distribution in practical applications, precisely when its pa- rameters lie in the ‘mean field domain’.

I. INTRODUCTION The essential interest of the Ising distribution, in all dis- ciplines mentioned above, is its maximum entropy prop- During the last decades, the Ising distribution has erty : whenever a dataset of N binary variables has mea- been used in several disciplines such as (under sured moments (m, C), a single distribution of the form the name quadratic exponential model) [1, 2], machine (1) is guaranteed to exist which matches these moments, learning (under the name Boltzmann machine) [3], infor- and furthermore it has maximal entropy under this con- mation processing [4–6], biology [7] and neurosciences, straint. where it has been proposed as a natural model for the Unfortunately, the Ising distribution is numerically un- spike-based activities of interconnected neural popula- wieldy. The simple act of drawing samples from the tions [8, 9]. In most of these applications, classic as- distribution already requires to set up lengthy Markov sumptions from statistical physics do not hold (e.g., ar- Chain Monte Carlo (MCMC) schemes. Besides, there rangement on a rectangular lattice, uniform couplings, is no simple analytical link between parameters (h, J) independently distributed couplings, zero external fields, and resulting moments (m, C). Given natural parame- etc.), and even old problems have to be revisited, such ters (h, J), the direct Ising problem of estimating (m, C) as efficiently simulating the Ising distribution [10, 11] or can only be solved by numerical sampling from MCMC inferring its parameters from data [12–15]. chains. Given (m, C), the inverse Ising problem of re- In this article I will consider the Ising probability dis- trieving (h, J) can only be solved by gradient descent N based on numerous iterations of the direct problem, a tribution over a set of spins s = (s1, . . . , sN ) ∈ {−1, 1} defined as procedure known as Boltzmann learning. In practice, this means that the Ising distribution cannot be parametrized N N easily from observed data. 1  X 1 X  P (s|h, J) = exp hisi + Jijsisj , (1) For this reason, in spite of the ’s theoretical Z 2 I i=1 i,j=1 attractiveness when dealing with binary variables, alter- native models are generally preferred, which are numeri- with parameters h ∈ RN (external fields) and J an N ×N cally more convenient. In one such family of alternative N symmetric matrix (coupling weights), ZI (h, J) being the models, N latent variables r = (r1, . . . , rN ) ∈ R are corresponding partition function. In this formulation, drawn from a multivariate normal distribution diagonal elements Jii can be nonzero without influencing P −1/2  1 > −1  the distribution, simply adding a constant term Jii/2 N (r|µ, Σ) = |2πΣ| exp − (r − µ) Σ (r − µ) i 2 to both the exponent and log(ZI ). I will note (m, C) the two first centered moments of and then used to generate N spins independently. The the distribution, that is, for all indices i, j, simplest option, setting si = sign(ri) deterministically, yields the dichotomized Gaussian distribution [16, 17], mi = E(si) ,Cij = E(sisj) − E(si)E(sj). which has enjoyed recent popularity as a replacement for the Ising distribution when modeling neural spike trains, as it is easy to sample and to parametrize from data [18, 19]. ∗ [email protected] Slightly more generally, each variable ri can serve as Accepted for publication in Physical Review E an intensity to draw the corresponding spin si following 2 a Bernoulli distribution : similarities with classic mean-field methods which aim at approximating the Ising moments (m, C) (Section IV). N 1  X  Numerical simulations reveal that both types of approx- B(s|r) = exp risi ZB imations, despite their seemingly different goals, are effi- i=1 cient in roughly the same domain of parameters (Section with partition function V). Thus, an Ising distribution can be replaced in prac- tical applications by a Cox distribution, precisely if its Y ri −ri  ZB(r) = e + e . parameters lie in the ‘mean field domain’. i

In statistics, this is the model underlying logistic regres- II. THE ISING LATENT FIELD sion, as introduced by Cox [20], so I will refer to it as the Cox distribution : N Given any vector h ∈ R and N × N symmetric, def- inite positive matrix J, we will consider the following Q(r|µ, Σ) = N (r|µ, Σ), (2) probability distribution over r ∈ RN and s ∈ {−1, 1}N : Q(s|r) = B(s|r), (3) 1  1 > −1 >  N P (s, r) = exp − (r − h) J (r − h) + r s , (4) with µ any vector in R and Σ any N × N symmet- Z 2 ric definite positive matrix. Note that the dichotomized Gaussian corresponds to a limiting case of the Cox dis- with Z ensuring proper normalization. Marginalizing out variable s yields tribution, when the scaling of variables ri tends to +∞ 1. Z (r)  1  Both the Ising (eq. (1)) and Cox (eq. (2)-(3)) distri- P (r) = B exp − (r − h)>J−1(r − h) , (5) butions can be generalized to , by taking Z 2 a suitable limit when the N indexed variables tend to P (s|r) = B(s|r). (6) a continuum [21]. These are respectively known as the Gibbs process [21] and log Gaussian Cox process [22, 23]. Conversely, completing the square in eq. (4) and The latter, much simpler to handle in practice, is used marginalizing out variable r yields in various applied fields such as epidemiology, geostatis- |2πJ|1/2  1  tics [24] and neurosciences, to model neural spike trains P (s) = exp h>s + s>Js , (7) [25, 26]. Z 2 To summarize, in practical application, Cox models P (r|s) = N (r|h + Js, J). (8) (including the dichotomized Gaussian, and Cox point From eq. (7), the resulting spins s are distributed accord- processes) are often preferred to the corresponding max- ing to the Ising distribution of parameters (h, J). imum entropy distributions (Ising distribution, Gibbs The introduction of field variables r has long been point process) because they are easier to sample, and i known in statistical physics, as a mathematical construct to parametrize from a set of observed data. However, to express the Ising partition function Z in an integral to date, we have little analytical insights into the link I form [5]. Indeed, equating the respective expressions for between the two families of distributions. For example, Z imposed by eq. (5) and (7), we obtain the elegant for- given some dataset, we cannot tell in advance how simi- mula lar the Cox and Ising models fitting this data would be. The goal of this article is to investigate this link. Z I first show that the Ising distribution itself can be ZI (h, J) = ZB(r)N (r|h, J)dr (9) r∈ N viewed as a latent variable model, which differs from a R Cox distribution only because of the non-Gaussian distri- expressing ZI as the convolution of ZB with a Gaussian bution of its latent variable (Section II). This allows to kernel of covariance J. This formula can be used as a derive simple relations between an Ising distribution and justification of classic mean field equations [5], and more its ‘best-fitting’ Cox distributions, in two possible senses generally to derive the diagrammatic (i.e., Taylor) expan- (Section III). In particular, the variational approach for sion of ZI as a function of J [27]. targeting a best-fitting Cox distribution displays formal In this work instead, I view the ri as a set of probabilis- tic variables in their own right, coupled to the Ising spin variables, through eq. (5)-(8). Given a spin configuration s, variable r is normally distributed (eq. (8)). Thus, the 1 The dichotomized Gaussian of parameters (µ, Σ) is the limit of overall distribution P (r) is a mixture of Gaussians with 2 the Cox distribution of parameters (λµ, λ Σ) when λ → +∞. 2N components, where the component associated to spin Conversely, note that a Bernoulli variable S ∼ B(r) can be gener- ated as S = sign(r + X) where X follows the logistic distribution configuration s has weight P (s). More compactly, P (r) 1 −2 can be expressed with eq. (5). of density function f(x) = 2 cosh (x), which resembles closely the normal distribution N (x|0, κ2) with κ ' 0.85. As a result, Given some configuration r, the spins s can simply any Cox distribution of parameters (µ, Σ) is decently approxi- be drawn independently following a Bernoulli distribu- mated by a dichotomized Gaussian of parameters (µ, Σ + κ2I). tion (eq. (6)), so the Ising distribution P (s) itself can be 3 viewed as a latent variable model, based on hidden vari- divergence ables ri. It departs from a Cox distribution (eq. (2)-(3)) Z only through the fact that the fields’ distribution P (r) is Q(r|µ, Σ) KLr(Q||P ) = Q(r|µ, Σ) ln dr. (13) N P (r) not normal, in general. r∈R With some straightforward algebra, one can establish the derivatives of KLr(Q||P ) with respect to µ and Σ, and III. COX APPROXIMATIONS TO THE ISING thus its stationary points (µ, Σ). DISTRIBUTION Given the fundamental relation (10)-(11) between spin and field moments in the Ising distribution, it is natural This article investigates the possible replacement of the to reparametrize the Cox parameters (µ, Σ) by the ‘spin Ising distribution by a Cox distribution. With the above moment’ parameters (m, C) such that reformulation, this amounts to approximating the Ising latent field distribution P (r) by a well-chosen multivari- µ = h + Jm, (14) ate normal Q(r) = N (r|µ, Σ). I will now discuss two Σ = J + JCJ. (15) possible choices in this regard. From here on, I will note (m, C) for (any approxima- Then, the values of (m, C) at the stationary points of tion of) the first moments of a spin variable s, and (µ, Σ) KLr(Q||P ) are characterized by the following, fixed point for (any approximation of) the first moments of a field equation : variable r. I will distinguish the true moments of the Z ? ?  p  Ising distribution P (s) with a star : (m , C ). Likewise, mi = tanh µi + x Σii φ(x)dx, (16) the true moments of the corresponding Ising latent field x∈R Z P (r) follow, from eq. (8) :  2  p  di = 1 − tanh µi + x Σii φ(x)dx, x∈R µ? = h + Jm?, (10) (17) Σ? = J + JC?J. (11) −1 −1 (C )ij = di δij − Jij, (18) with φ(x) = N (x|0, 1) the standard one-dimensional nor- Optimal Cox distribution mal distribution. Equations (14)-(18) can be solved by an iterative fixed point method on variables {m , d } (see Appendix Arguably, the optimal approximation of P (r) by a nor- i i i=1...N A). At the solution, the Cox distribution Q(µ, Σ) pro- mal distribution Q(r) = N (µ, Σ) is achieved by equating vides an approximation to the Ising distribution P (h, J). their moments, i.e., setting (µ, Σ) = (µ?, Σ?). I will refer The formulas (16)-(17) are conceptually simple : m to this choice as the optimal Cox distribution. Its quali- i (resp. d ) is obtained as the average of tanh(r) (resp. 1− fication as ‘optimal’ stems from the observation that the i tanh2(r)) using a Gaussian kernel, centered around r = following KL divergence µi with variance Σii. Their estimation at any required Z P (r) precision is straightforward, using numerical integration. KLr(P ||Q) = P (r) ln dr (12) However, when repeated computations are required, it is N Q(r|µ, Σ) r∈R faster to use approximate formulas, given in Appendix E. is minimized when (µ, Σ) = (µ?, Σ?). Thus, in Figure 1 illustrates the nature of the latent field P (r), terms of information geometry, the resulting distribution and of its Cox approximations, on a 2-spin toy model. Q(r|µ?, Σ?) is the nearest neighbor of the latent Ising The optimal Cox distribution is unique by construction, distribution P (r) in the family of normal distributions. but the variational Cox approximation can have multiple Unfortunately, the optimal Cox distribution is unprac- ? ? solutions (panel b), a classic feature of variational meth- tical to characterize, as (m , C ) can only be estimated ods based on minimizing reversed KL divergence [28]. by lengthy Monte-Carlo simulation. In the scope of this article, its study will only be of theoretical interest : it allows to quantify the intrinsic effect of assuming a Gaus- Choice of diag(J) sian shape for P (r). The framework developed above requires strictly posi- tive diagonal coupling values Jii, large enough to ensure Variational Cox approximation that matrix J is definite positive. While these diagonal values do not influence the Ising distribution P (s) over More practically, one may require to approximate an spins, a different choice of diag(J) leads to a different la- Ising distribution P (s) by a Cox distribution, assuming tent field distribution P (r), and thus to a different Cox only knowledge of its natural parameters (h, J). The approximation (see Figure 1). This naturally raises the variational approach to this problem consists in choos- question of what self-couplings Jii represent in the la- ing (µ, Σ) that minimize the reversed Kullback-Leibler tent field formalism, and how they should be chosen in 4

efficiently [29]. Afterwards, a small ridge term λI can be added to J to make it strictly definite positive.

Moments of the Cox distribution

A note of caution is also required on the interpretation of variables (m, C) in the variational Cox approximation, eq. (14)-(18). By eq. (2)-(3), the first spin moment of Cox distribution Q(s|µ, Σ) is Z EQ(si) = tanh(r)N (r|µi, Σii)dr, (20) r∈R and we recognize eq. (16). Thus, at the fixed point of eq. (14)-(18), mi corresponds to the first moment of the approximating Cox distribution. In contrast, at the fixed point of eq. (14)-(18), C is not the spin covariance of the Cox distribution Q(s|µ, Σ). That would be computed (for i 6= j) as ZZ (ij) (ij) CovQ(si, sj) = tanh(r) tanh(t)N (r, t|µ , Σ )drdt 2 (r,t)∈R − EQ(si)EQ(sj), (21)

with µ(ij), Σ(ij) the two-dimensional restrictions of µ, Σ at indices (i, j). Thus, for given Cox parameters (µ, Σ), there are two possible predictions for the Ising covariance matrix C : the ‘forward’ prediction of eq. (21) (spin covariance ma- FIG. 1. Ising latent field and its normal approximations on trix in the Cox distribution) and the ‘backward’ pre- a toy model with N = 2. Normal distributions are ma- diction of eq. (15) (spin covariance in the Ising model terialized by their 2-σ level line. Classic Ising parameters which would give rise to field covariance matrix Σ). If (h1, h2,J12) = (0.2, 0, 0.9). In (a), diagonal couplings are Q(r|µ, Σ) is a good approximation of P (r), we expect fixed at J11 = J22 = 1, whereas in (b) they are fixed at both predictions to be very close. And indeed, the dis- J11 = J22 = 5. The Ising spin distribution P (s) is the same in crepancy between the two predictions of C is a good in- both cases, but not the field distribution P (r) and subsequent dicator of whether the approximation was successful (see Cox approximations. In (b), the variational Cox equation has Supplementary Material). multiple solutions – two of which are displayed.

IV. COMPARISON WITH MEAN FIELD practice. Here, I only detail one concrete proposal for APPROXIMATIONS this choice, and defer more general considerations to the Discussion. In the variational Cox approximation, eq. (14)-(18), In the perspective of this work, the choice should be variables (m, C) constitute an approximation for the mo- made to optimize the resemblance of P (r) with a normal ments of the Ising distribution P (h, J). The goodness of distribution. From eq. (8), larger values of J increase ? ? N fit wrt. exact Ising moments (m , C ) constitutes a sim- the overall separation of the 2 components in P (r) and ple measure of how well distribution P is approximated thus, its divergence from a normal distribution. In the N by the Cox distribution Q(µ, Σ). extreme case where diag(J) is very large, the 2 com- Deriving an approximation for (m, C) is also the goal ponents of P (r) display no overlap at all : see Figure of classic mean field methods. In these methods, the 1(b). magnetizations m are approximated first, as the solution Consequently, the general prescription is that diag(J) of some fixed point equation m = F (m|h, J). Then, this should be kept as small as possible. Given a fixed set of equation is differentiated wrt. h, yielding a predicted off-diagonal weights {Jij}i

Informally, we may say that a given set of Ising param- response formula (Appendix B, eq. (B10), derived here eters (h, J) lies in the ‘mean field domain’ when some with an original approach). mean field method can provide a good estimate of the Finally, the adaptive TAP approximation of Opper and corresponding moments (m?, C?). In the rest of this ar- Winther [33] is a generalized mean field approximation, ticle, I will argue that the variational approximation of based on the cavity method [4]. The magnetization vari- P by a Cox distribution Q is valid precisely in this ‘mean ables {mi} are joined with a second set of variables {Vi}, field domain’. which represent the variance of the cavity field distribu- In this section, I briefly remind the nature of different tion at each spin site, and obey the following fixed point mean field approximations, and compare their respective equations : Plefka expansions in the case of weak couplings, up to X  order 3 (resp. 4 in the Appendices). In section V, I will mi = tanh hi + Jijmj − miVi , (26) proceed to numerical comparisons. j

−1 2  δij (CA )ij = 1 + (1 − mi )Vi 2 − Jij, (27) 1 − mi Classic mean field approximations 2 1 − mi = (CA)ii. (28)

I considered two classic mean field methods, the TAP This derivation is detailed in Appendix C. Briefly, and Bethe approximations, and a more recent general- eq. (26) is a generalization of the TAP equation (23) ization called the adaptive TAP approximation. For the where the variance Vi of the cavity field is left as a free sake of self-completeness, these approximations are re- variable, and eq. (27) is the corresponding linear response minded in some detail in Appendices B and C. Here, I prediction. Equation (28) imposes coherent predictions only provide essential formulas. for individual variances Var(si), thereby closing the fixed point equation on variables {m ,V }. The archetypal mean field method in the Ising model is i i The adaptive TAP approximation is a ‘universal’ mean the TAP approximation, where the approximating vector field method : by letting the variances V adapt freely, of magnetizations m is sought as a fixed point of the i it can account for any statistical structure of matrix J, following Thouless-Anderson-Palmer equation [4, 5, 30] : whereas the classic TAP equation (eq. (23)) is only true when the individual coupling weights Jij are decorrelated  X X 2 2  mi = tanh hi + Jijmj − mi Jij(1 − mj ) . (23) [34]. This is especially welcome in and j6=i j6=i neurosciences, where coupling strengths Jij are generally structured (because they represent learned regularities of This equation can arise in different contexts, one of which the outside world). is the Plefka expansion of the exact Ising model (eq. 29), Equations (26)-(28) bear a striking similarity with the stopped at order 2. Covariances C are derived in turn variational Cox equations, eq. (14)-(18). Both can be based on the linear response formula (eq. (22)) : seen as modifications of the naive mean field equations through N additional variables (the di, resp. Vi) associ- ! ated to the variance of the field acting on each spin. This −1 X 2 2 2 δij (C )ij = 1 + J (1 − m )(1 − m ) similarity will be confirmed in the subsequent analytical TAP ik i k 1 − m2 k i and numerical results. 2 − Jij − 2Jijmimj. (24)

The Bethe approximation is a related mean field Weak coupling expansions method, where the approximating vector of magnetiza- tions m is sought as a fixed point of the following equa- In the Ising model, the link between natural parame- tion [31, 32] : ters (h, J) and magnetizations m can be abstractly de- scribed as h = f(m, J), for an intractable function f. (j) X (i) Only when J = 0 does the link become tractable : the θ = h + tanh−1 tanh(J ) tanh(θ ). (25) i i ik k Ising model boils down to a Bernoulli distribution, with k6=j,i −1 the obvious hi = tanh (mi). One step further, when couplings are weak but (j) Here, the so-called cavity fields θi are tractable func- nonzero, one can derive the Taylor expansion of f around tions of m, namely, the only numbers such that J = 0. In practice, the coupling matrix is written αJ, each 2-spin Ising distribution of natural parameters α being the small parameter of the expansion, and the (j) (i) (θi , θj ,Jij) have first moments (mi, mj). This equa- result is known as the Plefka expansion [35] – although tion can be justified as the exact solution for m when the the approach can be traced back to anterior work [27]. couplings Jij define a tree-like lattice, and its iterative The expansion up to order 4 is a classic computation, resolution is known as the belief propagation algorithm. outlined in Appendix B. Stopping at order 3 for brevity, Covariances C can be derived in turn based on the linear it reads : 6

−1 X 2 X 2 2 hi = tanh (mi) − α Jijmj + α mi Jij(1 − mj ) j6=i j6=i   3 X 2 2 2 1 X 3 2 3 + α 2mi JijJjkJki(1 − mj )(1 − mk) + 2(mi − 3 ) Jijmj(1 − mj ) + o(α ), (29) (jk|i) j6=i where (jk|i) denotes all unordered triplets of the form {i, j, k} with j and k distinct, and distinct from i.

This expansion can serve as a first test on the various nition, the exact Ising expansion (eq. (29)) stopped at approximations (TAP, Bethe, adaptive TAP, variational order 2. The expansion for the Bethe approximation Cox) introduced above. Indeed, these approximations is obtained from the exact Ising expansion by retaining can also be described as h = f(m, J) for a different func- only the sums over spin pairs [36] (see Appendix B). In tion f, and we can compare its Taylor expansion to that eq. (29), this means suppressing the sum over (jk|i) in of the true Ising model. the order 3 term, but keeping the sum over j 6= i. The expansion for the adaptive TAP approximation, The expansion for the TAP approximation is, by defi- derived in Appendix C, writes

−1 X 2 X 2 2 hi = tanh (mi) − α Jijmj + α mi Jij(1 − mj ) j6=i j6=i 3h X 2 2 i 3 + α 2mi JijJjkJki(1 − mj )(1 − mk) + o(α ). (30) (jk|i)

The expansion for the variational Cox approximation, derived in Appendix D, writes

−1 X 2 X 2 2 hi = tanh (mi) − α Jijmj + α mi Jij(1 − mj ) j6=i j6=i 3h X 2 2 2 1 3 2 i 3 + α 2mi JijJjkJki(1 − mj )(1 − mk) − 2(mi − 3 )Jiimi(1 − mi ) + o(α ). (31) (jk|i)

At order 2, all approximations considered have the contribution to the adaptive TAP (eq. (C7)) and varia- same expansion as the exact Ising solution, meaning that tional Cox (eq. (D12)) approximations. Hence, we may they will perform well in case of weak couplings. expect these two approximations to provide a better fit At order 3, discrepancies appear between the exact than the others in case of generic coupling matrices J – Ising solution and its various approximations. The first and this will indeed be our observation in numerical tests contribution to the order 3 term in eq. (29), the sum (Figure 6). over (jk|i), is correctly accounted for by the adaptive The similar structures of eq. (18) and (27) suggest a TAP and variational Cox approximations. The second proximity between the adaptive TAP and variational Cox contribution to the order 3 term in eq. (29), the sum approximations, and this is confirmed by their weak cou- over j 6= i, is correctly accounted for by the Bethe ap- pling expansions : up to order 4, their respective expan- proximation. Of these two sums, that over (jk|i) involves sions differ only through additional terms involving the many more terms, so we expect it will generally be the diagonal weights Jii, so they would be identical for a dominant contribution, except for very specific coupling classic coupling matrix such that diag(J) = 0. matrices J. At order 4, the same qualitative features are observed. The ‘generally dominant’ contribution at order 4 in the true Ising solution writes V. NUMERICAL TESTS

X 2 2 2 2mi JijJjkJklJli(1 − mj )(1 − mk)(1 − ml ) To gain more insights on the behavior of all approxima- (jkl|i) tions above, I turned to numerical exploration : I picked a large number of possible configurations (h, J), estimated (Appendices, eq. (B3)), and this is also the dominant the true Ising moments (m?, C?) in each configuration 7 with lengthy MCMC sampling, and compared all ap- proximations against this ground truth. The numerical details for computing the approximations are given in Appendix A. I should stress from the start that the role of these tests is not to target the most accurate mean field method for approximating (m, C) – which turns out to be the adap- tive TAP method, in most configurations tested here. Instead, the goal of these tests is to support one main claim of this article : that the Ising distribution is well approximated by a Cox distribution in the same domain of parameters where mean field methods are efficient. To focus on the most important aspect of the Ising model, I only tested configurations where h = 0, so that magnetizations verify m = 0 – both in the exact Ising solution and in the various approximations. Hence, the efficiency of a given approximation is assessed by its abil- ity to correctly predict the covariance matrix C. For each tested configuration and approximation, the fit perfor- mance is summarized by number

1 X ? z = Cij − C , N 2 ij i,j FIG. 2. TAP, Bethe, adaptive TAP and variational Cox ap- ? where Cij is the true covariance of spins i and j, and Cij proximations on a typical paramagnetic configuration with its approximation. N = 100 spins, and reference parameter values (J0, J, κ, p) = (0.5, 0.5, ∞, 1). The corresponding fit values z are indicated as insets. In each of Figures 3-6, one parameter will be mod- Generative model for couplings J ified, while the three others keep their reference value.

In this approach, the choice of a generative model for > random matrix XκXκ follows a Wishart distribution, as coupling matrix J is a delicate matter, as the Ising model in the Hopfield model of associative memory [4, 38]. The can exhibit very different behaviors depending on its pa- random variables Jij become dependent, and the spec- rameters. To test different regimes with a single formula, trum of J differs markedly from the SK case (Wigner vs. I used the following generative model : Marˇcenko-Pastur laws). Note that the probabilistic dis-   tribution of each element Jij remains virtually unchanged J0 J > ∗ J = 1 + √ XκXκ . Mp (32) in the process, given by eq. (33) except at very low values pN κpN of κ. Finally, parameter p ∈ [0, 1] allows to dilute the overall with 1 the N × N matrix with uniform unit entries and connectivity, so that only a proportion p of the couplings X an N × κN matrix of independent standard normal κ are nonzero. Coherently, J and J 2 in eq. (33) are scaled entries. Afterwards, element-wise multiplication by a 0 by pN, the effective number of neighbors in the (possibly symmetric masking matrix M can randomly set each p diluted) model. edge (ij) at 0 with probability 1 − p. This model has 4 parameters. The positive num- bers (J0,J) correspond to the standard Sherrington- Results for the approximations Kirkpatrick (SK) parameters for spin glasses [37], mean- ing that the probabilistic distribution of each nonzero off-diagonal term J follows As the 4 parameters in model (32) prevent from an ij exhaustive search, I performed a restricted exploration of  2  parameter space, based on a set of reference parameter J0 J Jij ∼ N , (33) values : pN pN (J, J , κ, p) = (0.5, 0.5, ∞, 1). (34) to a very good approximation, owing to the central limit 0 theorem applied to the κN samples in matrix Xκ. Since (κ, p) = (∞, 1), the individual coupling weights Jij Parameter κ > 0 fixes the amount of global correla- are drawn independently and the connectivity matrix is tion between individual couplings Jij. When κ → +∞, dense. Hence, this is a classic SK model in its paramag- all off-diagonal entries Jij constitute independent ran- netic phase, because J and J0 are smaller than 1 [37]. dom variables, and the classic SK model is recovered Figure 2 shows the fit performances of the various ap- (see Supplementary Material). When κ is smaller, the proximation methods on a typical configuration (h, J) 8 with these generative parameters. All approximations perform well, as expected, since the TAP equations are exact when N → ∞ in the paramagnetic phase of the SK model [4]. However, the variational Cox approximation (fourth panel) displays a bias : the overall magnitude of its predictions Cij is somewhat underestimated, lead- ? ing to a slant in the graph of (Cij,Cij), and a larger fit value z. This bias is a systematic property of the Cox approximation, which is mainly caused by the presence of nonzero self-coupling terms Jii > 0 (see Discussion). I then explored the approximations’ behavior in differ- ent departures from the SK paramagnetic situation. In each of Figures 3-6, one parameter in eq. (34) is varied while the three others keep their reference value. Panel (a) shows the various approximations on a typical con- figuration at the transition out of the ‘paramagnetic SK’ phase. Panel (b) shows the mean fit performance of each approximation as the concerned parameter is var- ied. Panel (c) shows when multiple solutions have been detected to each approximation’s constitutive fixed point equation. I first tested the approximations’ behavior when tran- siting into the ferromagnetic (Figure 3) and spin-glass (Figure 4) phases of the SK model. The ferromagnetic phase, corresponding to J0 > max(1,J), is character- ized by a symmetry breaking into two ‘magnetized’ states with si = 1 (resp. −1) for all spins, constituting the sta- ble solutions of the TAP equation [4, 6]. The spin glass phase, corresponding to J > max(1,J0), is characterized by the apparition of multiple ‘metastable’ local minima of the TAP free energy, with limited basins of attraction [4]. All approximations considered have roughly the same behavior at the phase transitions (Figure 3, J0 ≥ 1, Fig- ure 4, J ≥ 1), losing precise fit (panels (b)) concurrently with the apparition of multiple solutions to their respec- tive equations (panels (c)). This confirms the existence of a universal ‘mean field’ domain for the SK model, cor- responding to its paramagnetic phase. 2 When couplings are made sparser, all approximations again display the same qualitative behavior (Figure 5), maintaining a reasonable precision down to very diluted models. The Bethe approximation is the most efficient in this case, because the rarefaction of loops creates a ‘tree-like’ structure of connectivity. FIG. 3. Transition from paramagnetic to ferromagnetic phase, The main difference between the approximations is when parameter J0 is increased. Other parameters keep their their handling of structured coupling matrices J (Figure reference value (J, κ, p) = (0.5, ∞, 1). (a) TAP, Bethe, adap- 6). When parameter κ decreases and couplings weights tive TAP and variational Cox approximations on a typical Jij become correlated, the TAP and Bethe approxima- configuration with J0 = 1. (b) Average fit value z for each of tions deteriorate much faster than the variational Cox the approximations, assessed over 30 configurations of matrix J picked according to eq. (32), for each tested value of J0. Fit measure is also provided for the optimal Cox distribution (see text). The vertical dashed line corresponds to the refer- 2 Looking in more detail, a notable qualitative difference exists be- tween the different approximations, at least in the ferromagnetic ence values of Figure 2. (c) Apparition of multiple solutions phase. In the TAP and Bethe approximations, the symmetric to the respective fixed point equations. For each value of J0 solution with m = 0 becomes unstable [4, 6], and the linear and approximation considered, I plot the percentage of the response prediction for C at this point diverges – as visible in 30 configurations in which multiple solutions were found (see Figure 3(a). In the adaptive TAP and variational Cox approx- Appendix A, Numerical procedures). imations, this symmetric solution remains stable and coexists with the magnetized solutions (as in Figure 1(b)), and the pre- diction for C is progressively degraded, rather than totally lost – see Figure 3(a). 9

FIG. 4. Transition from paramagnetic to spin glass phase, FIG. 5. Introduction of sparseness in the coupling weights, when parameter J is increased. Other parameters keep their when parameter p is decreased. Other parameters keep their reference value (J0, κ, p) = (0.5, ∞, 1). (a) TAP, Bethe, adap- reference value (J, J0, κ) = (0.5, 0.5, ∞). (a) TAP, Bethe, tive TAP and variational Cox approximations on a typical adaptive TAP and variational Cox approximations on a typ- configuration with J = 1. (b), (c) : same as Figure 3. ical configuration with p = 0.01. (b), (c) : same as Figure 3. and adaptive TAP approximation – which was designed precisely for this purpose [33]. This numerical study confirms the adaptive TAP’s in- the latter’s systematic bias in ‘easy’ configurations, lead- terest as a ‘universal’ mean-field method, the most effi- ing to higher fit values z. In summary, a ‘mean field do- cient in all tested regimes with dense couplings, and sec- main’ can be defined as the ensemble of parameters (h, J) ond most efficient in case of sparse couplings (the Bethe for which the adaptive TAP method efficiently predicts approximation performing marginally better). It also re- the spin moments (m, C), and this is also the domain veals similar domains of validity for the adaptive TAP where the Ising distribution can be easily replaced by a and variational Cox approximations – notwithstanding Cox approximation, thanks to a variational principle. 10

the optimal Cox distribution Q(µ?, Σ?) defined above. My measure of fit in this case consisted in comparing ? the true spin covariances Cij to their values in the op- timal Cox distribution, as given by eq. (21). Thus, a successful fit indicates when assuming a Gaussian shape for the latent field distribution P (r), without modifying its moments, does not modify much the resulting spin moments. Figures 3-6(b) show that this measure globally corre- lates with the efficiency of the adaptive TAP and varia- tional Cox approximations, i.e, it also deteriorates out- side of the ‘mean field domain’. Hence, the increased discrepancy between Ising and Cox distributions outside of the ‘mean field domain’ seems intrinsically related to the Ising latent field P (r) becoming non-Gaussian. This is also coherent with the fact that the adaptive TAP ap- proximation is bound to fail precisely when the instan- taneous field acting on each spin cannot be considered Gaussian (see Appendix C). Nonetheless, in quantitative terms, the loss of fit by the optimal Cox distribution is never total. Figure 7 shows examples of fit performance for the optimal Cox distribu- tion in various configurations at the boundary (first row, compare to Figures 3-6(a)) and far outside (second row) of the mean field domain. It reveals that, even when the Ising latent field P (r) is far from being Gaussian- distributed, its replacement by a Gaussian preserves the overall pattern of spin correlations (as visible in Figure 1(b)). This global correctness is hardly reflected in the magnitude of error z, yet it does imply that an Ising model is never ‘too far’ away from its optimal Cox dis- tribution.

VI. DISCUSSION

I have proposed a reformulation of the Ising distribu- tion as a latent variable model, and used it to derive prin- cipled approximations by the simpler Cox distribution. In practical applications, Cox models (including the di- chotomized Gaussian, and Cox point processes) are often preferred to the corresponding maximum entropy distri- butions (Ising distribution, Gibbs point process) because FIG. 6. Introduction of stochastic dependence between the they are easier to sample, and to parametrize from a set coupling weights, when parameter κ is decreased. Other pa- of observed moments. This article establishes a simple rameters keep their reference value (J, J0, p) = (0.5, 0.5, 1). (a) TAP, Bethe, adaptive TAP and variational Cox approxi- analytical connection between the two families of mod- mations on a typical configuration with κ = 1. (b), (c) : same els, and investigates under what conditions they can be as Figure 3. used interchangeably. The most natural connection between an Ising and a Cox distribution is obtained by equating the two first moments of their latent variables, eq. (10)-(11), a sim- Optimal Cox distribution ple but fundamental result of the article. The result- ing ‘optimal’ Cox approximation holds well in paramag- As such, the above results do not explicitly tell whether netic conditions, and even beyond, as far as global trends the Ising distribution can be approximated by a Cox dis- are concerned (Figure 7). However, eq. (10)-(11) involve tribution outside of the ‘mean field domain’. It may be both the natural parameters (h, J) and the resulting mo- the case that a decent Cox approximation exists, but can- ments (m?, C?) of the Ising distribution, which makes not be retrieved by a variational principle anymore. To them unpractical in most concrete situations. clarify this point, I also considered the fit performance of To target a Cox approximation given only some natu- 11

FIG. 7. Fit for the optimal Cox distribution on typical configurations at the boundary (first row) and far outside (second ? row) of the paramagnetic SK phase. Each panel plots the pairwise spin covariances in the true Ising distribution (Cij ) against their values in the optimal Cox distribution (Cij ), computed with eq. (21). Starting from reference parameter values (J0, J, κ, p) = (0.5, 0.5, ∞, 1), the approximation is assessed at different values of J0 (first column, transition to ferromagnetic SK model), J (second column, transition to spin glass SK model), κ (third column, introduction of correlations between coupling weights Jij ), and p (fourth column, dilution of connectivity).

ral parameters (h, J), I have explored a classic variational Jii. approach, leading to eq. (14)-(18). These equations are This suggests that nonzero self-couplings J may be not particularly interesting as a mean-field method for ii responsible for the systematic bias of the Cox approxima- predicting (m, C), since they globally behave as a biased tion in its prediction of (m, C), compared to the adaptive version of Opper and Winther’s adaptive TAP method. TAP approximation. As a direct confirmation, I observed However, their analytical (Section IV) and numerical that when matrix J is given a zero diagonal, the vari- (Section V) analysis allows to formulate the key conclu- ational Cox equation (14)-(18) generally retains a solu- sion of this article : mean-field methods are efficient pre- tion, and it is then remarkably close to the adaptive TAP cisely when the Ising distribution can be associated to a solution (see Supplementary Material). Unfortunately, quasi-normal latent field distribution. nonzero weights J are required to endow the field vari- If the practical goal is to establish a Cox approxima- ii ables r with a true, multivariate distribution P (r) (this tion for the Ising distribution of parameters (h, J), the i is not the case in the cavity method), and thus produce moments (m, C) may as well be estimated with any other a concrete approximation of the Ising distribution by a choice of mean field method, and then input to eq. (14)- simpler latent variable distribution. (15) to produce the corresponding Cox approximation. Naturally, the relation to mean field methods is not A disturbing consequence is that there is not one la- accidental. From eq. (8), given a spin configuration s, tent field distribution associated to the Ising distribution, P but many different distributions, depending on the value the latent field ri is distributed as N (hi + j Jijsj,Jii). In particular, if self-couplings Jii are zero as in the classic given to self-couplings Jii. When all values Jii are taken Ising model, we simply recover very large, the latent field P (r) is ‘useless’ : it is the mere mixture of 2N Gaussian bumps with no overlap, located X N ri = hi + Jijsj, on the 2 summits of a hypercube, and the bump at sum- j6=i mit s is simply associated to weight P (s) (as in Figure N 1(b)). Then, as the Jii become smaller, some of the 2 that is, the instantaneous field variable considered in bumps start overlapping, and P (r) acquires a less triv- the cavity method (Appendix C, eq. (C1)) and adaptive ial overall distribution. In some cases, when the Jii are TAP equations. Thus, the latent field formalism differs made small enough, the overall shape of P (r) becomes from the classic cavity approach only through the role of quasi-Gaussian (as in Figure 1(a)). In other cases, P (r) nonzero self-couplings Jii. Coherently, the weak coupling never becomes quasi-Gaussian, because a lower bound expansions of the variational Cox and adaptive TAP ap- is reached where the Jii cannot be made smaller while proximations up to order 4 differ only because of nonzero ensuring that matrix J remains definite positive. 12

Given some classic Ising parameters {Jij}i

Appendix B: Variational mean field approximations TAP and Plefka approximations

In this appendix, I recall the variational approach of In so-called Plefka expansions, one approximates mean field theory, which can be used to recover the Plefka G(m, J) by its Taylor expansion in J around J = 0 : expansion of the exact Ising model, as well as the TAP 1 and Bethe approximations. G (m, J) :=G(m, 0) + [∂ G(m, 0)](J) + [∂2G(m, 0)](J, J) n J 2 J 1 + ··· + [∂nG(m, 0)](J,..., J), Mean field variational approach n! J where ∂nG(m, 0) is the n-th derivative of G wrt J (a sym- As in the main text, let us note P (s) the true Ising J metric tensor of order n) evaluated at point (m, 0). All distribution of parameters (h, J), and (m?, C?) its cor- these terms can be evaluated, albeit laboriously. First, responding moments. Let Q(s) any other distribution 2 the fundamental relation ∂J ln ZI = mimj + ∂ ln ZI proposed as an approximation of P . The KL divergence ij θi,θj KL(Q||P ) = P Q(s) ln(Q(s)/P (s)) can be expressed as allows to replace derivatives wrt. J by derivatives wrt. s θ. Second, at J = 0, the Ising distribution boils down to

KL(Q||P ) = U(Q) − S(Q) + ln ZI (h, J), a Bernoulli distribution, where all derivatives wrt. θ are fully tractable. where U(Q) is the average Ising energy, and S(Q) the The approximation at order n = 4 is a classic compu- entropy, under distribution Q : tation [27, 40–42], which yields :

X  1  X X 1 X U(Q) := − Q(s) h>s + s>Js , G (m, J) = G(m ) − J m m − J 2 c c 2 4 i ij i j 2 ij i j s i (ij) (ij) X S(Q) := − Q(s) ln Q(s). X 2 X 3 − JijJjkJkicicjck − J micimjcj s 3 ij (ijk) (ij) X The functional G(Q) := U(Q) − S(Q) is called the vari- − J J J J c c c c ational (or Gibbs) free energy of distribution Q as an ij jk kl li i j k l (ijkl) approximation of P . Smaller values of G(Q) correspond X 2 to a lower KL divergence and thus to a better fit, and its − 2 JijJjkJkimicimjcjck minimum is achieved for Q = P . (ij),k In the Ising distribution, given fixed couplings J, there 1 X + J 4 c c (1 + 3m2 + 3m2 − 15m2m2). is a one-to-one correspondence between values of h and 12 ij i j i j i j resulting values of m = E(s). Thus, in theory, we can (ij) apply the variational approach to the family of Ising dis- (B3) tributions Q(m, J), where couplings J are taken equal to those in P , and m constitutes the N-dimensional Here, (ij), (ijk), (ijkl) indicate respectively all un- parametrization variable. The natural field parameters ordered pairs, triplets and quadruplets of distinct spins. of Q, say θ, are then a (generally intractable) function of G(mi) is the free energy of each 1-spin marginal distri- −1 (m, J), and the associated free energy writes bution (eq. (B1) with θi = tanh (mi)). Finally, we use 2 the shorthand ci = 1 − mi . > G(m, J) = (θ − h) m − ln ZI (θ, J), (B1) By differentiating this function wrt. m, we obtain a fixed point characterization of its extremum(s) m. Stop- whose minimum over m is obtained when m = m?, that ping at order 1 in J yields the “naive” mean field equa- is, when Q = P . (Note that G also depends on the field tion. Stopping at order 2 (first line) yields the TAP equa- parameter h of distribution P , but I omit it for lighter tion. Stopping at order 3 (two first lines) yields eq. (29) notations.) from the main text. Using the classic conjugacy relation ∂θ ln ZI = m in the Ising model, one can note that Bethe approximation ∂mG(m, J) = θ − h, (B2) so function G(m, J) + h>m corresponds to the Legendre In one particular case, G(m, J) in eq. (B1) is tractable exactly. This is when the underlying couplings Jij define transform of ln ZI in its first variable. Function G(m, J) is not tractable in general, but it can a tree topology, that is, they are zero except on a subset be approximated – and the resulting minimum will yield of the edges defining a graph without loops. In that case, an approximation of the true moments m?. This ap- the test Ising distribution Q(m, J) can be written as proach is known as the mean field variational method, of Y Q(si, sj) Y which the TAP and Bethe approximations are two promi- Q(s|m, J) = Q(si), (B4) Q(si)Q(sj) nent examples. hiji i 14 where hiji denotes all edges in the tree. This can be The covariances Cij can be approximated in turn, proved by repeated applications of Bayes’ formula, start- based on the linear response formula, eq. (22). Ap- ing from any leaf of the tree. Besides, each marginal plied to each 2-spin distribution of natural parameters Q(s , s ) is a 2-spin Ising distribution with coupling pa- (j) (i) i j (θi , θj ,Jij), it implies the differential equality rameter Jij, as proved directly by integrating out the remaining variables from the original Ising formula. 2 (j) (i) ∀(i, j), (∂mi) = (1 − mi ) ∂θ + cij ∂θ (B9) In consequence, the (exact) free energy writes i j with c given by eq. (B7). We can then differentiate X X ij GB(m, J) = G(mi, mj,Jij) − (N − 2) G(mi), eq. (B6) as a function of (∂hi) and linearly eliminate the i

2 2 ti + tjtij (1 − ti )(1 − tj ) mi = , cij = tij 2 , (B7) 1 + titjtij (1 + tijtitj) Appendix C: Cavity method and adaptive TAP (j) (i) equations with tij = tanh(Jij), ti = tanh(θi ), tj = tanh(θj ). Inserting this expression for mi into eq. (B6), with −1 1 Cavity method tanh (m) = 2 (ln(1 + m) − ln(1 − m)), and after some linear recombinations, we obtain The cavity method is a classic approach allowing to (j) X (i) recover many analytical properties of the Ising model, ∀(i, j), θ = h + tanh−1 t tanh(θ ), (B8) i i ik k and other multivariate exponential models [4, 5]. Singling k6=j out an arbitrary spin location i, one can rewrite eq. (1) as and the optimum is now characterized by N(N −1) equa- tions over the N(N − 1) cavity variables. This switching  X  P (s\i, si) ∼ P\i(s\i) exp si hi + Jijsj , from N principal variables (the mi) to N(N − 1) auxil- (j) j6=i iary variables (the θi ) can also be interpreted as a dual Lagrangian optimization procedure [31]. where s\i denotes the remaining N − 1 spins, and the In a tree-like topology, eq. (B8) can be solved iter- so-called cavity distribution P\i is the Ising distribution atively starting from any leaf of the tree, allowing to obtained by deleting line and column i from (h, J). The (j) remaining spins interact with i only through the random recover the exact values for all the θi , and thus, for ? variable magnetizations mi . The resulting algorithm is known as belief propagation, or sum-product. In a general topol- X r˜ := h + J s , (C1) ogy, the fixed point approach to characterize solutions of i i ij j eq. (B8) is known as loopy belief propagation. It is not j6=i guaranteed to have a single solution anymore – and it and we can write only characterizes an approximation for the magnetiza- tions mi. P (˜ri, si) ∼ P\i(˜ri) exp(sir˜i), (C2) 15

P\i(˜ri) indicating the distribution of variabler ˜i when the coupling matrix writes αJ. Given the form of eq. (C4), spins s\i follow the cavity distribution P\i. this only requires to obtain the expansion for miVi or, In general, distribution P\i(˜ri) is not tractable exactly after a convenient rescaling, for variable 3. But in many circumstances, sincer ˜ is the sum of vari- i 2 ables with many degrees of freedom, it can be assumed xi := Vi(1 − mi ), to have a normal distribution : for which we want to establish the Taylor development

P\i(˜ri) = N (˜ri|θi,Vi), [1] 2 [2] 3 [3] xi = αxi + α xi + α xi + ... and eq. (C2) becomes, approximately : (note that x = V = 0 when α = 0).   i i 1 2 From eq. (C5), it is clear that the solution depends on P (˜ri, si) ∼ exp − (˜ri − θi) + sir˜i . (C3) 2Vi the diagonal of J only through the simple offset Vi → Vi+ Jii, so we may assume Jii = 0 without loss of generality. This equation is the starting point of the classic cavity Introducing the variables method. Note the formal similarity with our definition for the joint probability of spins and latent fields, eq. (4), 1 − m2 d := i = (1 − m2)1 − x + x2 − x3 + ... , so we can simply recycle our results. From eq. (7), P (s ) i 2 i i i i i 1 + (1 − mi )Vi is the Bernoulli distribution of parameter θi, so mi = tanh(θi). From eq. (14), we have E(˜ri) = θi+Vimi. Taken and matrix D = diag(di), we rewrite eq. (C5) as together, this yields the generalized TAP equation : C−1 = D−1 − (αJ) X m = tanh h + J m − m V . (C4) i i ij j i i −1 j and thus, after a classic switching from C to C :

For the SK model in the limit N → ∞, it can be C = D + αDJD + α2D(JD)2 + α3D(JD)3 + ..., shown that the different summands tor ˜i in the cavity distribution (eq. (C1)) become linearly independent [4, allowing to easily express the development of C from that P 2 2 of d . 6], so the variance writes Vi ' J (1 − m ) and we i j ij j 2 recover the classic TAP equation. Then, noting ci := 1−mi for concision, the fixed point equation (C6) imposes, at order 4 :

 2 3 4 Adaptive TAP approximation 1 = 1 − xi + xi − xi + xi X   + α2 c J 2 c 1 − 2x + 3x2 − x + 2x x + x2 Instead, in the adaptive TAP method [33, 34], Vi is i ij j i i j i j j left as a free variable which can adapt to any statisti- j cal structure of the couplings J. First, differentiating 3 X   + α ciJijcjJjkckJki 1 − 2xi − xj − xk eq. (C4) wrt. h (but neglecting the dependency of Vi j,k itself) leads to a linear response prediction for C : 4 X 4 + α ciJijcjJjkckJklclJli + o(α ). −1 2  δij j,k,l (CA )ij = 1 + (1 − mi )Vi 2 − Jij. (C5) 1 − mi We can then replace xi by its expansion, and regroup the Second, given magnetization mi, the individual variance powers of α. For the equation to be verified at order 1, 2 of spin i should be 1 − mi . Self-coherence of the variance this imposes that prediction imposes that x[1] = 0. 2 i 1 − mi = (CA)ii. (C6) Using this newly found value, the fixed point equation at Taken together, eq. (C4)-(C6) constitute a system on order 2 imposes variables {mi,Vi}, which can be solved by classic iter- ative methods [34]. mi [2] X 2 xi = mi Jijcj. ci I now turn to the weak coupling (Plefka) expansion of j6=i eq. (C4)-(C6), when magnetizations m are fixed, and the i Then, the fixed point equation at order 3 imposes

mi [3] X x = 2m J J J c c . c i i ij jk ki j k 3 i Except when the couplings Jij have a tree-like topology. In this (jk|i) case, P\i(˜ri) is a factorized product over the neighboring spins of i, and this is another way of deriving the Bethe equation (B8) where (jk|i) denotes all unordered triplets of the form [32]. {i, j, k} with j and k distinct, and distinct from i. By 16 inserting these values into eq. (C4), we recover the ex- tizations mi are fixed. More precisely, noting pansion from the main text, eq. (30). −1 Finally, the fixed point equation at order 4 yields ∆i := µi − tanh (mi),

mi [4] X X 4 2 xi = 2mi JijJjkJklJlicjckcl − mici Jijcj . our purpose is to find the parameters in the following ci (jkl|i) j6=i Taylor expansions : (C7)

Note that the sum over (jkl|i) – which involves the most [1] 2 [2] 3 [3] terms and is generally dominant – is identical to that for ∆i = α∆i + α ∆i + α ∆i + ... the true Ising expansion : see eq. (B3). [1] 2 [2] 3 [3] Σij = αΣij + α Σij + α Σij + ...

When inserted into eq. (D4), the development of ∆i will exactly provide the desired Plefka expansion. Appendix D: Weak coupling expansion for the Cox approximation Development for equation (D1)

We consider the ‘variational’ Cox approximation, so- Equation (D1) writes lution to the equations Z  −1 p  Z  p  mi = tanh tanh (mi) + ∆i + x Σii φ(x)dx, mi = tanh µi + x Σii φ(x)dx, (D1) x∈R x∈ R √ Z   p  d = 1 − tanh2 µ + x Σ φ(x)dx, where ∆i is of leading order α, and Σii is of leading i i ii 1/2 x∈R order α . Applying the Taylor development of tanh : (D2) −1 −1 −1  2 h 2 i (C )ij = di δij − αJij. (D3) tanh tanh (m) + X =m + (1 − m ) X − mX + ... µ = h + αJm, (D4) √ 2 1/2 Σ = αJ + α JCJ, (D5) up to order 6 (because Σii is of leading order α ), and using the classic integration formulas : when magnetizations mi are fixed, and the coupling ma- trix writes αJ, α being the small parameter of the ex- ( Z (n − 1)(n − 3) ... if n is even pansion. xnφ(x)dx = (D6) Here, I detail the computation up to order 3, and also x 0 if n is odd provide the result at order 4. The overall structure of the computation is largely similar to that for the adaptive we obtain TAP approximation, in the previous paragraph. When α = 0, the solution is obvious : couplings αJ h m =m + (1 − m2) ∆ − m ∆2 + Σ  vanish, and so does the covariance matrix Σ. The Cox i i i i i i ii distribution Q(µ, Σ) is simply a Bernoulli distribution 2 1  3  + mi − ∆i + 3∆iΣii B(µ) with µ = h, and the fixed point equations impose 3 + − m3 + 2 m ∆4 + 6∆2Σ + 3Σ2  that i 3 i i i ii ii 4 2 2  3 2  + m − m + 10∆ Σii + 15∆iΣ −1 i i 15 i ii µi = tanh (mi), 5 4 3 17  2 2 3 i 3 2 + − mi + 3 mi − 45 mi 45∆i Σii + 15Σii + o(α ). di = 1 − mi . We now seek a Taylor expansion for the solution of Notice that, after integration by the Gaussian kernel, eq. (D1)-(D5) when α is small but nonzero, and magne- only integer powers of Σii remain.

For this equation to be verified, the term inside square brackets must be equal to zero up to order α3. Expanding [k] [k] ∆i and Σii with the shorthand Xk = ∆i , Yk = Σii , and regrouping the powers of α, we obtain : h i 0 = α X1 − miY1 (D7)

2h 2 2 3 2i + α X2 − mi(X1 + Y2) + (3mi − 1)X1Y1 + (−3mi + 2mi)Y1 (D8)

3h 2 1 3 2 2 + α X3 − mi(2X2X1 + Y3) + (mi − 3 )(X1 + 3X2Y1 + 3X1Y2) + (−6mi + 4mi)(X1 Y1 + Y2Y1)

4 2 2 5 3 17 3i 3 + (15mi − 15mi + 2)X1Y1 + (−15mi + 20mi − 3 mi)Y1 +o(α ). (D9) 17

Solution at order 2 Solution at orders 3 and 4

[1] At this point, we can readily find the two first orders The coefficient Cij found in eq. (D10) is pasted into of the solution. Indeed, we have Σ = αJ + α2JCJ and Σ = αJ + α2JCJ, to obtain : 2 Cij = (1 − mi )δij + o(1), and so [3] X 2 2 Σij = JikJklJlj(1 − mk)(1 − ml ). (D11) [1] k6=l Σij = Jij, [2] X 2 Then, line (D9) allows to find the order 3 coefficient of Σij = JikJjk(1 − mk). k µ. After computation, this gives :

[3] X 2 2 Then, at order 1, the fixed point equation above (line ∆i = 2mi JijJjkJki(1 − mj )(1 − mk) (D7)) imposes that (jk|i) 2 1 3 2 − 2(mi − 3 )Jiimi(1 − mi ), ∆[1] = m Σ[1] = m J . i i ii i ii where (jk|i) denotes all unordered triplets of the form {i, j, k} with j and k distinct, and distinct from i. Using this new value, the fixed point equation at order Using the values found for ∆[1], ∆[2] and ∆[3], and the 2 (line (D8)) imposes that P i i i fact that µi = hi + α j Jijmj, yields eq. (31) from the main text. [2] [1] 2 [2] 2 [1] [1] ∆i = mi (∆i ) + Σii − (3mi − 1)∆i Σii + ... Pushing all computations one order further, with the X 2 2 help of the computer algebra system MAXIMA, yields : = mi Jij(1 − mj ), j6=i [4] X X 4 2 ∆i = 2mi JijJjkJklJlicjckcl − mici Jijcj which is identical to the order 2 coefficient in the exact (jkl|i) j6=i 2 2 X 2 Ising model – see eq. (29). Note that the diagonal terms − 2mici(3mi − 1)Jii Jijcj Jii are nonzero and an active part of the derivation, but j6=i cancel out in the final result, so they play no role in the − 2m c (7m4 − 8m2 + 5 )J 4 expansion up to order 2. i i i i 3 ii X 2 2 2 2 − 2mi JijJjjmj cj , (D12) j6=i

Development for C 2 with the shorthand ci = 1 − mi . The two first terms are identical to the adaptive TAP expansion, eq. (C7). The In general, to establish the development of Σ at any remaining terms involve the diagonal weights Jii, and given order n, we need the expansion of C up to order would be absent if the coupling matrix was such that n−2, because Σ = αJ+α2JCJ. Thus, to expand µ and diag(J) = 0. Σ at order 3, we must first establish the development for C at order 1, based on the development of (µ, Σ) at order 1 established above. Regarding eq. (D2), a very similar computation (de- Appendix E: Approximate formulas for the Cox velopment of tanh at order 2, and simplification by the distribution integral formulas of eq. (D6)) yields : The formulas inherent to the Cox distribution, eq. (13), 2 2 2 (16), (17) and (21) from the main text, are easily esti- di = (1 − m ) − α(1 − m ) Jii + o(α). i i mated by numerical integration (for example, Simpson Introducing matrix D = diag(d ), eq. (D3) writes quadrature). But the overall computation time quickly i becomes forbidding, as these estimations must be done for each pair of spins, and on many iterations to target C−1 = D−1 − (αJ), the fixed point. Hence, I found it more convenient to use approximate and thus, after a classic switching from C−1 to C : formulas. Let us note L the logistic function at scale 1/2, that is : C = D + αDJD + o(α). 1 L(r) := . So finally : 1 + e−2r Function L is pivotal in the Bernoulli distribution, since 2 r Cij = δij(1 − mi ) R tanh(r) = 2L(r) − 1, log 2 cosh(r) = 2 0 L(u)du − r, and 2 2 2 0 + α(1 − δij)(1 − mi )(1 − mj )Jij + o(α). (D10) 1 − tanh (r) = 2L (r). 18

I suggest to approximate L by the following combina- (or equivalently eq. (20)), can be approximated as tion of Gaussian functions : app  2 0   r   r  mi = 2 Φ(xiki) + ν(1 − li )φ (xili) − 1 (E2) Lapp(r) = Φ + νφ0 , (E1) κ λ app with a guaranteed maximum error |mi − m | < 0.002. R x i with Φ(x) := −∞ φ(u)du the standard normal cumula- The variance term di in eq. (17) is approximated as tive distribution, and φ0(r) = −rφ(r). Taking parameters (κ, ν, λ) ' (0.7072, 0.1648, 0.9712), app −1/2  2 00  app di = 2Σii kiφ(xiki) + ν(1 − li )liφ (xili) one has kL − L k∞ < 0.001 on the whole real line. The approximation also applies to the primitive, with app R app with a guaranteed maximum error |di − di | < 0.007. k (L − L )k∞ < 0.001, and to the first derivative, 0 app 0 To concretely estimate the free energy associated with kL − (L ) k∞ < 0.0033. to eq. (13), it is necessary to compute Fi := As the convolution product of two Gaussian functions R √ remains Gaussian, this replacement allows to compute x log 2 cosh(µi + x Σii)φ(x)dx. It is approximated as analytically all the formulas. Here, I only provide the re-  00  sults, and refer to Supplementary Material for the deriva- app p φ(xiki) φ (xili) Fi = µi [2Φ(xiki) − 1] + 2 Σii − ν tion. ki li Given spin index i, let us introduce the following re- duced quantities : with a guaranteed maximum error inferior to 0.002. µ The approximate formula for the covariance of the Cox x := √ i , i Σ distribution, eq. (21), is quite bulky and provided in Sup- √ii plementary Material. It has guaranteed maximum error Σii inferior to 0.008. ki := √ , κ2 + Σ √ ii In my numerical tests, using these approximate for- Σ mulas instead of lengthier Simpson quadrature yielded l := √ ii . i 2 no noticeable difference in the final solution of the fixed λ + Σii point equations. At the same time, computation times Then, the first moment of the Cox distribution, eq. (16) were cut by (up to) two orders of magnitude.

[1] D. R. Cox, Applied statistics , 113 (1972). [14] Y. Roudi, J. Tyrcha, and J. Hertz, Physical Review E [2] L. P. Zhao and R. L. Prentice, Biometrika 77, 642 (1990). 79, 051915 (2009). [3] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, in [15] S. Cocco and R. Monasson, Physical review letters 106, Readings in Computer Vision (Elsevier, 1987) pp. 522– 090601 (2011). 533. [16] K. Pearson, Biometrika 7, 96 (1909). [4] M. M´ezard,G. Parisi, and M. Virasoro, Spin glass theory [17] D. R. Cox and N. Wermuth, Biometrika 89, 462 (2002). and beyond, Vol. 9 (World Scientific Publishing Company, [18] S.-i. Amari, H. Nakahara, S. Wu, and Y. Sakai, Neural 1987). computation 15, 127 (2003). [5] M. Opper and D. Saad, eds., Advanced mean field meth- [19] J. H. Macke, M. Opper, and M. Bethge, Physical Review ods : Theory and practice (MIT Press, 2001). Letters 106, 208102 (2011). [6] H. Nishimori, Statistical physics of spin glasses and infor- [20] D. R. Cox, Journal of the Royal Statistical Society. Series mation processing: an introduction, Vol. 111 (Clarendon B (Methodological) , 215 (1958). Press, 2001). [21] D. J. Daley and D. Vere-Jones, An introduction to the [7] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and theory of point processes, volume I : Elementary theory T. Hwa, Proceedings of the National Academy of Sciences and methods (Springer Science & Business Media, 2003). 106, 67 (2009). [22] D. R. Cox, Journal of the Royal Statistical Society. Series [8] E. Schneidman, M. J. Berry II, R. Segev, and W. Bialek, B (Methodological) , 129 (1955). Nature 440, 1007 (2006). [23] J. Møller, A. R. Syversveen, and R. P. Waagepetersen, [9] I. E. Ohiorhenuan, F. Mechler, K. P. Purpura, A. M. Scandinavian journal of statistics 25, 451 (1998). Schmid, Q. Hu, and J. D. Victor, Nature 466, 617 [24] P. J. Diggle, P. Moraga, B. Rowlingson, and B. M. Tay- (2010). lor, Statistical Science , 542 (2013). [10] K. Hukushima and K. Nemoto, Journal of the Physical [25] M. Krumin and S. Shoham, Neural computation 21, 1642 Society of Japan 65, 1604 (1996). (2009). [11] F. Wang and D. P. Landau, Physical review letters 86, [26] R. Brette, Neural computation 21, 188 (2009). 2050 (2001). [27] A. N. Vasil’ev and R. Radzhabov, Theoretical and Math- [12] H. J. Kappen and F. d. B. Rodr´ıguez,Neural Computa- ematical Physics 21, 963 (1974). tion 10, 1137 (1998). [28] C. M. Bishop, Pattern recognition and machine learning [13] V. Sessak and R. Monasson, Journal of Physics A: Math- (Springer Verlag, New York, USA, 2006). ematical and Theoretical 42, 055001 (2009). [29] S. Boyd and L. Vandenberghe, Convex optimization 19

(Cambridge university press, 2004). eral 15, 1971 (1982). [30] D. J. Thouless, P. W. Anderson, and R. G. Palmer, [36] F. Ricci-Tersenghi, Journal of Statistical Mechanics: Philosophical Magazine 35, 593 (1977). Theory and Experiment 2012, P08015 (2012). [31] J. S. Yedidia, W. T. Freeman, and Y. Weiss, in Advances [37] D. Sherrington and S. Kirkpatrick, Phys. Rev. Lett 35, in neural information processing systems (2001) pp. 689– 1792 (1975). 695. [38] J. J. Hopfield, Proceedings of the national academy of [32] M. M´ezardand G. Parisi, The European Physical Jour- sciences 79, 2554 (1982). nal B-Condensed Matter and Complex Systems 20, 217 [39] A. Decelle and F. Ricci-Tersenghi, Physical Review E 94, (2001). 012112 (2016). [33] M. Opper and O. Winther, Physical Review Letters 86, [40] A. Georges and J. S. Yedidia, Journal of Physics A: Math- 3695 (2001). ematical and General 24, 2173 (1991). [34] M. Opper and O. Winther, Physical Review E 64, 056131 [41] K. Nakanishi and H. Takayama, Journal of Physics A: (2001). Mathematical and General 30, 8085 (1997). [35] T. Plefka, Journal of Physics A: Mathematical and gen- [42] T. Tanaka, Physical Review E 58, 2302 (1998).