<<

SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 1 Self-Guided Belief Propagation – A Homotopy Continuation Method

Christian Knoll, Associate Member, IEEE, Adrian Weller, and Franz Pernkopf, Senior Member, IEEE

Abstract—Belief propagation (BP) is a popular method for performing probabilistic inference on graphical models. In this work, we enhance BP and propose self-guided belief propagation (SBP) that incorporates the pairwise potentials only gradually. This homotopy continuation method converges to a unique solution and increases the accuracy without increasing the computational burden. We provide a formal analysis to demonstrate that SBP finds the global optimum of the Bethe approximation for attractive models where all variables favor the same state. Moreover, we apply SBP to various graphs with random potentials and empirically show that: (i) SBP is superior in terms of accuracy whenever BP converges, and (ii) SBP obtains a unique, stable, and accurate solution whenever BP does not converge.

Index Terms—Graphical models, belief propagation, probabilistic inference, sum-product , partition function, inference . !

1 INTRODUCTION

OMPUTING the marginal distributions and evaluating the par- tentials often reduce accuracy and worsen the convergence proper- C tition function are two fundamental problems in the context ties [21] inspired us to construct a homotopy; i.e., we first consider of probabilistic graphical models. Both problems can be solved only local potentials (where BP is exact) and subsequently modify efficiently for -structured models but are NP-hard if the graph- the model by gradually increasing the pairwise potentials to their ical model contains loops [7]. Since loops are present in many given values. SBP thus solves a deterministic sequence of models problems of relevance there is a need for efficient approximation that iteratively refines the Bethe approximation towards a solution methods. that is uniquely defined by the initial model. Belief propagation (BP) provides a way to approximate the We empirically evaluate SBP for grid-graphs, complete graphs, marginal distribution and may be viewed as an approach to try and random graphs with random potentials. Compared to BP, we to minimize the Bethe free energy B [61]. BP has a long observe superior performance in terms of accuracy; in fact, SBP success story in many applications, includingF computer vision, achieves more accurate results than Gibbs sampling in a fraction speech processing, social network analysis, and error-correcting of run-time. Additionally, SBP exhibits favorable convergence codes [19], [25], [35]. But, despite its success, BP does not always properties and excels for hard problems, for which – despite approximate the marginals well and may even fail to converge. BP failing to converge – SBP provides accurate results. We Reasons for BP’s failure include the existence of multiple fixed also analyze SBP theoretically and demonstrate optimality of the points of varying accuracy (where it depends on implementation selected fixed point for attractive models with unidirectional local details to which one BP converges) and of unstable fixed points potentials.1 that cause BP to oscillate far away from any fixed point [16], [33], It is worth emphasizing that SBP is a relatively straightforward [51]. These limitations motivate the search for modifications of modification to standard BP that can be easily added to existing BP that have better convergence properties and find more accurate BP implementations. We thus expect that the ease of use lowers marginals. the hurdle for practical applications. One alternative is to perform a different optimization problem The paper is structured as follows: Section 2 provides back- arXiv:1812.01339v2 [stat.ML] 19 Mar 2021 that minimizes the Bethe free energy B, instead of applying ground on probabilistic graphical models, belief propagation, and BP [61]. While, for certain problemF classes, polynomial-time methods that minimize the free energy. Moreover, we present a algorithms exist, it is even problematic to approximate the global unified taxonomy of model-classes according to their complexity. minimum for models with arbitrary potentials [5], [42], [55]. Our proposed algorithm is presented in Section 3. We evaluate Hence, the pursuit for methods that approximate the marginals SBP and discuss empirical observations in Section 4 and provide well – ideally with both run-time and convergence guarantees – is a more formal analysis in Section 5. Finally, we conclude the paper still ongoing. in Section 6. In this work, we present self-guided belief propagation (SBP) to help address this gap. The observation that strong pairwise po- 2 BACKGROUND In this section, we briefly introduce probabilistic graphical models • Christian Knoll ([email protected]), and Franz Pernkopf and specify the models considered in this work. We further ([email protected]) are with the Signal Processing and Speech Com- munication Laboratory, Graz University of Technology. 1. An attractive model is a probabilistic where all pairwise • Adrian Weller ([email protected]) is with the University of Cambridge interactions are specified by positive couplings; a general model also contains and the Alan Turing Institute. pairwise interactions with negative couplings. Unidirectional local potentials are potentials that all favor the same state (see Section 2.2). SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 2

introduce the BP algorithm and its connection to the Bethe approx- − + + + imation. Finally, we summarize important properties regarding the

solution space of BP as a foundation for our theoretical analysis + + − + of SBP.

− + + + 2.1 Probabilistic Graphical Models (a) (b) (c) (d) Let us consider an undirected graph = (X, E), where G Fig. 1. Illustration of the model-classes specified in Section 2.2.1: (a) X = X1,...,XN is the set of nodes, and E is the set of { } frustrated; (b) balanced, and (c) its equivalent attractive model; and (d) undirected edges. Then, two nodes Xi and Xj are joined by an unidirectional model. Solid lines depict attractive edges and dashed lines edge if (i, j) E; note that we consider each edge only once, depict repulsive ones. The signs in the vertices are equal to the signs of ∈ i.e., (i, j) = (j, i). We further denote the set of neighbors of Xi the corresponding local fields θi. by N(i) = X X :(i, j) E . { j ∈ ∈ } Let us define a probabilistic graphical model = ( , Ψ), X U G χij = (Xi,Xj) = xixjP (xi, xj). (4) where Ψ = Φ(x1),..., Φ(xK ) is the set of all K potentials E and X is the{ set of random variables.} In this work we focus on xi,xj pairwise models where all potentials consist of two variables at Let us define couplings Jij R that are assigned to each ∈ most, so that the joint distribution PX(x) factorizes according to edge (i, j) E and local fields θi R that act on each ∈ ∈ N variable Xi X. These parameters then define the pairwise 1 Y Y potentials ∈ and the local potentials PX(x) = Φ(xi, xj) Φ(xi) (1) Φ(xi, xj) = exp(Jijxixj) Φ(x ) = exp(θ x ) so that the corresponding joint distribution Z (i,j) E i=1 i i i ∈ from (1) can be written as 1  = exp E(x) ,   − N Z 1 X X where the potentials specify the energy E(x). PX(x) = exp  Jijxixj + θixi . (5) (i,j) E i=1 We consider the following two problems: First, evaluating the Z ∈ partition function (i.e., the normalization function of the joint Z There are two different types of interactions between random distribution). Consider any distribution QX(x) and the (Gibbs) repulsive P  variables: if Jij is negative then the edge (i, j) is ; if free energy = (QX(x)) = QX(x) E(x)+ln QX(x) ; F F x Jij is positive then the edge (i, j) is attractive. Accordingly, one then, evaluating the partition function and minimizing the free distinguishes between general models that contain both attractive energy are equivalent since the minimum of corresponds to 2 F Z and repulsive edges and attractive models that contain only according to min = ∗ = ln (cf. [61]). F F − Z attractive edges. Second, we consider computing the marginal distribution Attractive models are often more well-behaved than general X models but, rather than only distinguishing between these two PY(y) = PX(x), (2) model-types, we find it beneficial to consider an even finer-grained xi:Xi X Y ∈{ \ } distinction into the following three model-classes. In doing so, we where Y X may be any set of RVs. The identities of the ⊂ additionally unify different naming conventions in the literature random variables are often obvious from the notation of the and provide a consistent taxonomy of model-classes. values; therefore, we often use the shorthand notation P (y) for the marginal distribution. Note that the problems of evaluating 2.2.1 Frustrated, Balanced, and Unidirectional Models the partition function and computing the marginal distributions First, frustrated models are models that contain cycles for which are inherently linked, as obtains its minimum precisely for the the product over all couplings equates to a negative number, i.e., marginal distributions P F(x). X cycles with an odd number of repulsive edges (see Fig. 1a).3 In general, both problems are intractable [7]. Consequently one Note that if a model is frustrated, it must be a general model often relaxes the problems and only approximates the marginal per definition. distributions and the partition function. This allows for an elegant Second, balanced models are models that are not frustrated. iterative algorithm that was discovered multiple times in different For a balanced model, it is possible to flip a certain subset of fields. It is known as belief propagation (BP) in computer science, variables without affecting the represented distribution such that the sum-product algorithm in , and the cavity- the resulting model is attractive (cf. [52]). To flip a variable or the Bethe-method in physics; we refer the reader to [26], [30] X , one must switch its two states, and: reverse the sign of the for a good overview of the underlying principles. i corresponding local field θi; and reverse the signs of the couplings Jij for all incident edges (i, j): Xj N(i). For an illustrative 2.2 MODEL SPECIFICATION example, consider the model in Fig. 1b.∈ This model is a general We consider binary pairwise models where every one in its current form; reversing the sign of the rightmost variable Xi takes values from = 1, +1 (i.e., Ising models). further implies changing both attached edges from repulsive to For these models, insteadX of considering{− } the singleton marginals attractive, which results in an equivalent attractive model (see P˜(xi) and the pairwise marginals P˜(xi, xj) explicitly, it is often more convenient to work with the means m and the correlations 2. Note that attractive models are also known as ferromagnetic models [30] i or log-supermodular models [40]. χij defined according to 3. Frustrated models are also referred to as spin glasses in the physics literature. These are some of the most complex models considered of the mi = E(Xi) = PXi (+1) PXi ( 1), (3) − − form (5). SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 3 Fig. 1c). Such an equivalent model exists if and only if the original unidirectional.5 model contains no cycles with an odd number of repulsive edges, i.e., if the original model was not frustrated [14]. 2.3 BELIEF PROPAGATION (BP) Finally, unidirectional models are attractive models with unidirectional local potentials, i.e., either all θi 0 or θi 0 BP approximates the marginals by recursively exchanging mes- 4 ≤ ≥ n+1 (see Fig. 1d). In a similar manner as for balanced models, certain sages between random variables. The messages µij (xj) from general models have equivalent unidirectional models, i.e., by Xi to Xj at iteration n + 1 are updated according to (6) and are normalized so that P n . flipping variables to render the model attractive all local potentials xj µij(xj) = 1 are mirrored so that the attractive model is unidirectional; we call ∈X X Y these models quasi-unidirectional. All theoretical findings for µn+1(x ) Φ(x , x )Φ(x ) µn (x ) (6) ij j ∝ i j i ki i unidirectional models thus readily extend to quasi-unidirectional xi X N(i) j ∈X k models. ∈ \ n n Let µ = µij(xj): eij E, xj be the set of all messages at iteration{ n and let∈ the mapping∈ X }induced by (6) be The major advantage of considering unidirectional models n+1 n is that they are relatively well understood. Thus unidirectional denoted as µ = (µ ). If all successive messages remain unchanged, i.e., if µnBP+1 = µn, then BP is converged to a fixed models are of particular relevance for our work – as well as for init studying inference algorithms in general. point µ◦. We further write µ◦ = ◦(µ ), where ◦ updates the messages until convergence.BP If BP fails to convergeBP and the Intuitively, in unidirectional models, we expect that all vari- messages oscillate, one can try to achieve convergence by either ables will energetically favor the same state as indicated by the changing the update-rule [10], [23], [47], or by replacing the local potentials. Additionally, since all edges are attractive, no con- messages with a convex combination of the last messages [34]. tradictions between neighboring variables will occur, thus further The latter method is known as damping (BPD) where a damping reinforcing this behavior. Moreover, various research directions parameter  [0, 1) specifies the new update rule indicate the relevance of this model class. We will list a couple of ∈ these indications below: µn+1 = (µn), BPD On the one hand, it is generally often easier to study homo- = (1 ) (µn) + µn. (7) geneous models, where all pairwise potentials are the same (i.e., − BP

Jij = J) and all local potentials are the same (i.e., θi = θ). After convergence, the singleton marginals P˜(xi) and pairwise Provided that the model is unidirectional, the findings from homo- marginals P˜(xi, xj) are approximated by geneous models carry over readily [12], [21]. Alternatively, one often considers models of the form (5) in an external field θ, i.e, ˜ 1 Y P (xi) = Φ(xi) µki◦ (xi), (8) where all local potentials are the same (or have the same sign) [17]. Zi Xk N(i) Keeping the physical interpretation in mind, we might hope that ∈ 1 models that are placed in an external field are easier to understand. P˜(xi, xj) = Φ(xi)Φ(xj)Φ(xi, xj) Let us use the concept of flipping variables for transforming Zij · Y Y balanced models into attractive models again. Interestingly, if we µ◦ (x ) µ◦ (x ), (9) ki i · lj j consider models in an external field only, then flipping variables Xk N(i) j Xl N(j) i ∈ \ ∈ \ is not a viable option as this would imply reversing the sign of where Zi,Zij R+∗ guarantee that all probabilities sum to some θi. Thus for models in a constant external field, the class of ∈ attractive models reduces precisely to the class of unidirectional one. We refer to the set of singleton and pairwise marginals as ˜ models. pseudomarginals PB, where On the other hand, the complexity of approximating the P˜ = P˜(xi), P˜(xi, xj): Xi X, (i, j) E . (10) partition function also depends on the model class and is in B { ∈ ∈ } line with our distinction. That is for unidirectional models the We further denote the pseudomarginals obtained at a fixed point partition function can be approximated efficiently in polynomial ˜ 6 of BP by PB◦ . time [17]; for frustrated models, the partition function cannot be approximated efficiently; balanced models are of intermediate complexity [12]. 2.4 THE BETHE APPROXIMATION & RELATED WORK Finally, it is sometimes beneficial to study models without BP is closely connected to variational methods and concepts from local potentials; in fact, one can find an auxiliary model without statistical physics. In particular, a direct relationship between per- local potentials for every model of the form (5) by adding one forming BP and minimizing the Bethe free energy exists (cf. [61]). variable to the graph (and connecting it to all existing variables), ˜ The Bethe free energy B(PB) is evaluated over the pseu- where the auxiliary model has the same marginals and the same domarginals (i.e., the singleton-F and pairwise marginals) and is a partition function (up to a constant factor of two) [18], [41], [53]. Provided that the original model was attractive, then the auxiliary 5. On the one hand, we need to allow for a flip of the additional variable model remains attractive (or can be rendered attractive by flipping for the case that all θi < 0; on the other hand, this makes the statement the additional variable) if and only if the original model was agnostic to the definition (i.e., it holds irrespective if we have X = {−1, +1} or X = {0, 1}). 6. With slight abuse of notation, we will refer to both the pseudomarginals ˜◦ ◦ PB and the messages µ as fixed points throughout this work. Strictly 4. Unidirectional models are also referred to as attractive models in an speaking, however, the pseudomarginals are only evaluated at a fixed point external field, true ferromagnetic models, or consistent models in the literature. rather than being a fixed point themselves. SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 4 ˜ function of the average energy EB(PB) and the Bethe entropy We aim to estimate B and PB; while the approximation S (P˜ ) that are defined by of in [55] is ε-accurate,F no run-time guarantees exist for B B FB X X general models. Our work, on the contrary, provides an estimate in E (P˜ ) = P˜(x ) ln Φ(x ) B B − i i constant run-time (see Theorem 6 in Section 5); the approximation Xi xi error, however, cannot be made arbitrarily small in general. Both X X P˜(x , x ) ln Φ(x , x ). (11) methods overcome their respective disadvantages when restricting − i j i j (i,j) E xi,xj the models; i.e., both methods do efficiently minimize the Bethe ∈ X X approximation for unidirectional models.7 S (P˜ ) = P˜(x , x ) ln P˜(x , x ) B B − i j i j (i,j) E xi,xj X ∈ X 2.5 Solution Space of Binary Pairwise Models + ( N(i) 1) P˜(x ) ln P˜(x ). (12) | | − i i Over the years, a substantial amount of work has been devoted Xi xi to getting a good understanding of how the solution space is This subsequently defines the Bethe free energy according to affected by the specification of the graphical model. We are ˜ ˜ ˜ particularly interested in how the performance of BP and the B(PB) =EB(PB) SB(PB). (13) energy landscape of vary, depending on the specification of the F − FB Let us define the local polytope , i.e, the set of consistent potentials. Here, we briefly summarize the most important results pseudomarginals, where L with a particular focus on how the solution space changes when considering increasingly stronger couplings. X X = P˜ : P˜(x ) = 1, P˜(x , x ) = P˜(x ) . This discussion serves two purposes: on the one hand, it L { B i i j i } xi xj provides a good intuition and motivates the working principle of SBP discussed in Section 3; on the other hand, it prepares the Minimizing B over the local polytope gives us the desired F ˜ required background for our theoretical analysis in Section 5. global minimum B∗ = min B(PB). This, however, is not L Moreover, the difference in the behavior of BP is often only straightforward asF the constrainedF is often non-convex. B considered between general and attractive models. Instead, we The main reason for consideringF the Bethe free energy is its consider the difference in BP’s behavior between frustrated, immediate connection with the fixed points of BP µ (and the ◦ balanced, and unidirectional models. We are convinced that this associated pseudomarginals P˜ ): all stationary points are in B◦ B◦ distinction is not only relevant for analyzing SBP in this work a one-to-one correspondence with the fixed points ofF BP. More but that this is a crucial distinction one must make in the pursuit precisely, we have of understanding what makes some problems inherently hard to ◦ = (P˜◦ ). (14) solve. FB FB B In practice, we are particularly interested in stable fixed points For as long as the couplings Jij are sufficiently small (relative (for which BP converges if initialized close enough); we index all to θi), the Bethe free energy B is convex, BP has a unique fixed ˜s F stable fixed points PB by s = 1,...,S and denote the set of all point, and BP converges [20], [33]. As the couplings increase, stable fixed points by S. Likewise, we consider the set of all fixed the performance tends to suffer as multiple fixed points emerge points that constitute minima of the Bethe free energy M and and as BP may fail to converge (even if the fixed point is still index them by m = 1,...,M. Note that every stable fixed point unique) [21], [33].8 The precise effect of the strong couplings P˜s must be a minimum of ; the converse, however, need not B FB on the behavior of BP and the shape of B mainly depends be the case, i.e., not every local minimum corresponds to a stable on the respective model-class. Thus, we willF split the following fixed point, so that S M [50]. ⊆ discussion accordingly. This correspondence between BP and B led to a better under- frustrated models F For , the number of stationary points scales standing of BP and inspired plenty of methods that minimize B exponentially with the number of variables. Often, multiple sta- directly [58], [62]. The minimization, however, is still highly non-F tionary points coexist with the same value of B but possibly trivial and requires good approximation methods in practice. Since completely different marginals; it may even happenF that there strong pairwise potentials often reduce the accuracy of the Bethe is not a single unambiguous global minimum. Moreover, for approximation and are responsible for its non-convexity [20], one frustrated models, the Bethe free energy is not lower bounded can correct the entropy term (12) by accounting for the strong by the exact free energy, so that the global minimum need not potentials; this admits convex relaxations that provide provable approximate the free energy best. Such complex energy landscapes convergent message passing algorithms [11], [15], [28], [29], [49]. render any attempt of minimizing the Bethe free energy futile. There is, however, a trade-off between convergence-properties and Moreover, in particular for models with strong couplings, BP often accuracy in general. That is, if it can be minimized, the Bethe fails to converge for frustrated models. approximation often provides accurate results and outperforms For balanced models, the number of stationary points still convex free energy approximations [29], [55]. Thus, it is a relevant scales exponentially with the number of variables. For example, problem to directly approximate B in a way that allows for attractive models with arbitrary local potentials (i.e., random field F efficient minimization. Polynomial run-time algorithms exist that Ising models) are some of the simplest forms of disordered sys- approximate B for restricted models: these include sparsity tems [2]. Yet, in comparison with frustrated models, they are much constraints [42]F or require attractive models [55]. If both properties are fulfilled, i.e., for locally tree-like attractive models the Bethe 7. The restriction to unidirectional models is actually too conservative for [55], where the approximation of FB takes polynomial run-time for approximation is exact and can be optimized efficiently [9]. Note balanced models as well. that B provides an upper bound on for attractive models [40], 8. Note that the differences in behavior coincide with different phases in [56],F [60]. F statistical mechanics (cf. [30], [48], [64]). ˜ SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD PB◦ (ζ =1) 5

P˜ (ζ =0) easier to solve. Since every balanced model can be transformed B◦ 0.5 − 3.3 into an equivalent attractive model, the Bethe free energy is lower − 4.15 1 − − 3.4 1.5 − 4.2 B B bounded by the exact free energy. Moreover, many methods that − B − F F 2 3.5 F − − 4.25 minimize B are particularly efficient for attractive models and − 2.5 3.6 F − − 4.3 thus for balanced models as well. One of the properties that we 3 − − 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 will use throughout is that for attractive models all fixed points P˜ P˜ P˜ corresponding to local minima of B are stable [50, Theorem (a) (b) (c) 6], [36], [46]. Although BP may stillF fail to converge on balanced Fig. 2. Evolution of the Bethe free energy FB for Example 1. We models, it is thus at least guaranteed that BP will converge if the consider a complete graph with N = 4 variables with homogeneous messages are initialized close enough to a fixed point. potentials, attractive edges, and θi = 0. FB is evaluated for P˜(xi) = P˜ For unidirectional models two fixed points exist at most. This for (a) J = 0.1, (b) J = 0.55, and (c) J = 0.7. not only renders approximating the free energy and the marginals relatively straightforward but also allows for a thorough evaluation Fig. 3b), before turning into a local maximum with two symmetric of various algorithms. Since we devote a great part of our theoret- minima branching off (see Fig. 3c). ical analysis of SBP to its performance on unidirectional models, we will now discuss the solution space of unidirectional models in Second, when all local potentials are specified by θi > 0 the more detail and prepare some foundations for the later analysis. fixed points are not symmetric anymore. For later reference, we Most properties result from generalizing the properties of summarize some important properties of the fixed points in the homogeneous models. As discussed in Section 2.2.1 these models following Lemma. qualitatively behave the same. Lemma 2. Let us consider a unidirectional model with θ > 0 First, we treat the special case where all local potentials are i and couplings strong enough to admit multiple (i.e., two) fixed specified by θ = 0 and where both fixed points – if they exist – i points. Then, one fixed point has its means aligned with the local are symmetric. s potentials, i.e., mi > 0; we call this fixed point consistent or state- Lemma 1. Let us consider a unidirectional model with loops, with preserving (cf. [12], [22]). Similar, as in Lemma 1, the means of t θi = 0, and with couplings strong enough to admit multiple (i.e., the second fixed point have opposing signs, i.e., mi < 0. Both two) fixed points. Then, both fixed points s and t are symmetric, fixed points are not symmetric, however, and the consistent fixed s t s t i.e., mi = mi, and both fixed points have the same Bethe free point has larger means mi > mi and larger correlations − s t s t | | | | energy, i.e., B = B. χij > χij. Accordingly, the consistent fixed point constitutes the F F s global minimum of the Bethe free energy, i.e., ∗ = . Proof: Given that the local potentials are the same FB FB for both states, the message update rule from (6) simplifies Proof: First, the existence of a consistent fixed point, i.e., to n+1 P Q n . Consequently, if ms > 0, is an immediate consequence of bounding the location µij (xj) xi Φ(xi, xj) Xk µki(xi) i s ∝ µij(xj) = µij◦ (xj) are fixed point messages, so are the symmetric of the fixed points of BP in [54, Theorem 3], [61]. Second, the t t messages µij(xj) = 1 µij◦ (xj). By computing the pseudo- existence of a second point with opposing means, i.e., mi < 0, marginals according to (8)-(9)− it follows immediately that the follows from the fact that only a single fixed point exists with means are symmetric around 0 and that both fixed points have positive means [9, Lemma 2.3]; thus the second fixed point, if it the same Bethe free energy (see Example 1). exists, must have negative means (see Example 2). Third, it is an We will now present an illustrative example that describes the immediate consequence of the update rule in (6) that the consistent s t behavior of models with θ = 0. These models have been well- fixed point is energetically favored, i.e., µ (xj) > µ ( xj). ij ij − studied in the literature [20], [32], [57]. We present Example 1 to Then, computing the marginals according to (8) implies that the summarize the most important properties that serve as the basis consistent fixed point has larger means, i.e., ms > mt . The | i | | i| for our theoretical analysis of SBP in Section 5. same line of reasoning shows that the correlations are larger as well. Finally, the consistent fixed point has a lower average energy, Example 1. Consider a homogeneous model on a complete graph s ˜ t ˜ i.e., EB(PB) < EB(PB) (see (11)). Therefore (and by symmetry with N = 4 variables, positive couplings Jij = J > 0, and s of the Bethe entropy), it follows that B∗ = B. local potentials specified by θi = 0. For computing the Bethe F F free energy according to (13), we enforce the symmetry of Example 2. Consider a unidirectional homogeneous model on a FB the model by setting all singleton marginals to the same value complete graph with N = 4, positive couplings Jij = J > 0, and P˜(xi) = P˜. We compute the Bethe-optimal pairwise marginals local potentials specified by θi = θ = 0.02. Again, we enforce according to [59, Lemma 1] (see also [57]). This allows us to com- symmetry by setting all singleton marginals to the same value pute the pairwise marginals given only the singleton marginals.9 and compute the pairwise marginals and the Bethe free energy The effect of the coupling strength is illustrated in Fig. 2. Note accordingly. that for θ = 0, P˜ = (0.5, 0.5) is a fixed point for every value Similar as in Example 1, we illustrate the effect of the coupling of J; we will often refer to this fixed point in terms of its means strength on the shape of B. In some sense we see a similar 10 F mi = 0. Consequently, there is also a corresponding stationary picture emerging; i.e., B is convex for small values of J (see F 11 point of . For small values of J, is convex and this fixed Fig. 3a) and non-convex for larger values of J (see Fig. 3c). FB FB point is unique (see Fig. 3a); as J increases, B becomes flat (see There is one crucial difference however: because of θi > 0 F the minimum does not turn into a maximum with two minima 9. Note that an alternative representation of (5) is used in [57], [59]; there branching of. Instead, we see that the global minimum persists is, however, a straightforward mapping from one representation to the other. 10. This fixed point exists whenever all local potentials are specified by 11. Contrary to Example 1, no closed-form solution exists for computing θ = 0 and can be computed analytically [32]. the fixed points because of the local potentials. SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 6 both minima according to (15), however, gives the exact singleton 4.3 1.5 3.6 − − − marginals because of the model’s symmetry properties. 4.4 3.7 − 2 − B B Obtaining all minima of the Bethe free energy, however, is

− B F F F 4.5 2.5 3.8 − − − a complex task that is only possible for models with certain 4.6 − 3 3.9 structures or potentials [6], [20], [36], [64]. Besides the application − − 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 P˜ P˜ P˜ to optimization problems [4] the RSB assumption is thus still of (a) (b) (c) limited practical relevance for estimating the marginals; nonethe- less, it provides a powerful concept for assessing and comparing Fig. 3. Evolution of the Bethe free energy FB for Example 2. We the marginals given a selection of fixed points [22]. consider a complete graph with N = 4 variables with homogeneous potentials, attractive edges, and θi > 0. FB is evaluated for P˜(xi) = P˜ for (a) J = 0.25, (b) J = 0.62, and (c) J = 0.75. 3 SELF-GUIDED BELIEF PROPAGATION (SBP) The main aim of SBP is to select an accurate fixed point, i.e., for all values of J, whereas the energy landscape becomes flat a fixed point that approximates the singleton marginals well, or before two additional stationary points emerge somewhere else – if none are stable – to approximate an accurate fixed point. In (see Fig.3b). this section we present an intuitive justification of the proposed We will see in Section 5 that unidirectional (non- method and subsequently introduce SBP in detail. We further homogeneous) models behave qualitatively the same; conse- present practical considerations and pseudocode of SBP. A formal quently we will make repeated use of the behaviour illustrated treatment of SBP is presented in Section 5. in Example 2 for analysing the behaviour of SBP. The current understanding of BP is that strong pairwise po- Finally, note that every fixed point that corresponds to a local tentials negatively influence BP in the sense of deteriorating the minimum is a stable fixed point of BP for unidirectional models. convergence behavior [20], [32] and the approximation quality of This is a direct consequence of [50, Theorem 6] and the fact the marginals [20], [22], [57] and that incorporating the potentials that unidirectional models are attractive per definition. Even more slowly [3] may reduce the overall number of iterations. Inspired importantly, BP will always converge for unidirectional models: by the recent observation that strong local potentials increase therefore note that BP always converges for unidirectional models accuracy and lead to better convergence properties [21], we aim with θi = 0 [33] and that the existence of local potentials to reduce the influence of the pairwise potentials that negatively with θi = 0 can only enhance the convergence properties [21]. influence BP. In doing so, we hope that an accurate fixed point Recently,6 it has also been shown that for unidirectional models BP emerges if we start from a simple model with independent random will converge reasonably fast, if initialized properly [9], [24]. variables and slowly increase the potentials’ strength. SBP starts from a simple model with independent random 2.5.1 Combination of Fixed Points variables and slowly incorporates the edge potentials’ strength. If multiple fixed points exist, the performance of BP often varies More precisely, SBP is a homotopy method that solves the simple considerably between different fixed points. Here we briefly intro- problem first and – by repetitive application of BP – keeps track of duce the RSB (replica symmetry breaking) assumption that states the fixed point as the interaction strength is increased by a scaling how all those fixed points can be combined to compute the exact term. marginals. Therefore, let us consider all local minima of the Bethe Formally, SBP considers an increasing length-M sequence m ˜m ζm where m = 1,...,M such that ζm < ζm+1 and free energy B , the pseudomarginals PB at the corresponding { } F ζ [0, 1] with ζ1 = 0 and ζ = 1. This further indexes a fixed point, and the associated partition function approximation m ∈ M m sequence of probabilistic graphical models that converges . {Um} ZB to the model of interest = . Every probabilistic graphical UM U Conjecture 1 (RSB Assumption). Let us consider all M local model has a set of potentials Ψm = Φm(xi, xj), Φm(xi) m { } minima of the Bethe free energy B , the pseudomarginals at associated, where Φm(xi) = Φ(xi) and the pairwise potentials at ˜m F the corresponding fixed point PB , and the associated partition index m are exponentially scaled by function approximation m. Then, the exact singleton marginals ZB P (xi) are given by a convex combination of all fixed points at Φm(xi, xj) = exp(Jijζmxixj) ζ minima of B according to = Φ(x , x ) m . (16) F i j M 1 X We further write µ◦ to clarify that we consider a BP fixed point P (x ) = P˜m(x ) m. (15) m i M i for the model . If multiple fixed points exist, the initialization P m ZB m m=1 of BP plays a crucialU role in determining to which fixed point BP m=1 ZB converges and how it performs. SBP always provides a favorable This representation belongs to a set of postulates in [31]. initialization by the preceding fixed point and aims to perform the One underlying assumption is that the system does exhibit mul- composite function12 tiple fixed points (unique fixed points would falsely imply exact init marginals otherwise). Despite its non-rigorous flavor, Conjecture 1 µM◦ = M◦ M◦ 1 1◦ µ1 . (17) BP BP − ···BP has been verified for a wide range of problems [30, Ch.19]. Example 1 and Fig. 2c in particular give an intuitive illustration of 12. The underlying assumption is the existence of a continuous path that init ◦ connects the initial messages µ1 to the final fixed point µM , where all the RSB assumption: for this model the local maximum (i.e., the init intermediate initializations µm lie along this path. For now, we assume a unstable fixed point) corresponds to the exact marginals; yet, both fixed number of models M for practical reasons but we will consider SBP’s local minima do not approximate the marginals well. Combining behavior in the limit of M → ∞ in our formal analysis in Section 5. SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 7 3.2 Pseudocode SBP: intermediate solution Pseudocode of SBP is presented in Algorithm 1. Note that we 0.5 SBP: approximate solution exact solution initialize the messages µinit randomly. The sequence of messages over all models m is collected in µ◦ = µ◦,..., µ◦ . ) m 1 m i ExtrapolateMsg{Uapplies} cubic spline extrapolation{ } { to estimate the} m (

E 0.4 initial messages of the subsequent model. We further present the pseudocode for the adaptive step size controller in Algorithm 2. The step size is increased depending on k, where k is the number of preceeding fixed point messages that 0.3 are -close to the current fixed point messages µ 0 0.2 0.4 0.6 0.8 1 ε m◦ ζ Algorithm 1: Self-Guided Belief Propagation (SBP) input : graph = (X, E), potentials Ψ, Fig. 4. Example 3: SBP proceeds along the smooth solution path and G obtains accurate marginals despite instability of the terminal fixed point. initial step-size stepinit, maximum number of iterations NBP output: fixed point messages µ◦ This may lead to problems if the fixed point becomes unstable for init init 1 initialization µ1 µ some value m < M, in which case we cannot rely on BP to keep ← 2 m 1 track of the fixed point anymore. Instead, SBP provides the last ← 3 ζ1 0 stable fixed point in that case, i.e., µ◦ as the final estimate. ← m 4 while ζ 1 do m ≤ Example 3. We illustrate how SBP approximates the marginals 5 Ψ(ζm) ScalePotentials(Ψ, ζm) ← init for a frustrated model, for which BP fails to converge, in Fig. 4. 6 (µ, iterations) BP(µ , Ψ(ζ ),N ) ← m m BP Initially, SBP obtains the pseudomarginals for ζ = 0 by running 7 if iterations < NBP then BP on the simple model. As ζ increases, SBP estimates the 8 µ◦ µ m ← pseudomarginals along the solution path by running BP to keep 9 else track. For ζ > 0.7, however, BP does not converge anymore; 10 break SBP consequently terminates and provides the marginals of the 11 if adaptive stepsize then last stable fixed point as the final approximate solution. Note that 12 ζm+1 ζm+ AdaptiveStepSize( µm◦ , the approximated marginals are already close to the exact ones ← { } stepinit, m) in this example; experiments show that this is often the case (see 13 else Section 4). 14 ζ +1 ζ + step m ← m init init In other words, SBP relaxes the problem of minimizing B by 15 µ ExtrapolateMsg( µ◦ , ζ ) F m+1 ← { m} { m} making all variables independent (and the Bethe approximation 16 m m + 1 exact). Then the problem is deformed into the original one by ← 17 µ◦ µm◦ 1 increasing ζ from zero to one. Thereby, a stationary point B◦ ← − emerges as a well-behaved path (see Property 1 in Section 5)F and SBP keeps track of it with BP constantly correcting the stationary point. Algorithm 2: Adaptive Step Size Controller

input : sequence of messages µm◦ , stepinit, m threshold ε { } 3.1 Practical considerations output: stepsize step

In practice the run-time of SBP is influenced by the difference be- 1 step stepinit µ µ ← tween two successive fixed points m◦ and m◦ 1 – the difference 2 − k 1 is primarily determined by the number of steps M. Ideally M ← 2 3 while µm◦ µm◦ k < ε do should be as large as possible. This, however, increases the run- | − − | 4 k k + 1 time (see Theorem 6); in practice, we would choose M as small ← 5 step step + stepinit k as possible but as large as necessary. Moreover, one can adaptively ← · increase the step size if two successive fixed points are close, i.e., if µm◦ µm◦ 1 (cf. [43, pp.23], [1]). Our experiments show that it is sufficient' − to use rather coarse steps; we used M 10 for all reported experiments and using more steps would≤ not have 4 EXPERIMENTS improved the accuracy. We apply SBP to attractive (Section 4.2) and general (Section 4.3) Additionally, instead of initializing m with its preceding models on n n grid graphs of different sizes, on complete graphs BP fixed point messages, i.e., with N = 10×random variables, and on random graphs (i.e., Erdos init ˆ µm = µm◦ 1 one can (e.g., by spline extrapolation) estimate Renyi graphs with an average degree d = 3) with N = 10 µinit µ − µ µ so that µinit µ to random variables. These graphs are considered in order to render m = f( m◦ 1, m◦ 2,..., m◦ k) m ∼= m◦ reduce the overall− number− of iterations.− We empirically observed the computation of the exact marginals feasible and to make the that the benefit diminishes for k > 3. results comparable to previous work [29], [44], [45], [57]. SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 8

Grid Graph Grid Graph 1 0.2 0.5 MSE MSE 0.1

0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 β β

1,000 3,000 2,000 500 1,000 BP Iterations BP Iterations 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 β β Complete Graph Complete Graph 0.8 0.8 0.6 0.6 0.4 MSE

0.4 MSE 0.2 0.2 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 β β 150 1,000 100 500 50 BP Iterations BP Iterations 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 β β Random Graph Random Graph 0.8 0.6 0.4 0.4 MSE MSE 0.2 0.2 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 β β

400 2,000

200 1,000 BP Iterations BP Iterations 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 β β

(a) (b)

◦ ◦ Fig. 5. MSE and number of iterations for: SBPall (blue), BP (orange), and BPD (green); θi ∼ U(−0.5, 0.5) and (a) Jij ∼ U(0, β) (attractive model); (b) Jij ∼ U(−β, β) (general model). In terms of accuracy, SBP is superior in all scenarios, while increasing the number of iterations only slightly.

3 4.1 Experimental settings stepinit = 0.1 and a threshold of ε = 10− . The overall number of steps is thus M 10. We evaluate SBP and compare it to BP, BPD (BP with damping), ≤ 5 For BPD we choose a large damping factor  = 0.9 in order and Gibbs sampling (run for 10 iterations) using the mean to prioritize convergence over run-time and therefore increase squared error (MSE) between the approximate marginals P˜X 4 i the maximum number of iterations to NBP = 10 . Carefully and the exact marginals , where, for binary random variables, PXi selecting a damping factor that depends on a given model may MSE = 2 PN P (+1) P˜ (+1) 2. We further compare N i=1 | Xi − Xi | reduce the number of iterations until convergence but cannot the run-time by counting the overall number of BP iterations and increase the accuracy; moreover, if chosen too small, BP may 13 D the number of iterations for Gibbs sampling. fail to converge at all. For BP and SBP we set the maximum number of iterations to The initial messages are randomly initialized 100 times for 3 NBP = 10 and use random scheduling. We randomly initialize each model, before applying BP with and without damping. We µinit and use an adaptive step size with an initial step size of consider BP (and BPD) as converged for a model if at least a single message initialization (out of 100) exists for which BP 13. Computing the acceptance-probability requires similar run-time as one converges. The convergence ratio is the number of experiments BP message update. (or probabilistic graphical models) for which BP converged at least SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 9

TABLE 1 RESULTSFORGENERALMODELSWITH Jij ∈ {−1, 1} ONGRIDGRAPHS (N =25 AND N =100), COMPLETEGRAPHS (N =10), ANDRANDOM GRAPHS (N =10).WEREPORTTHE MSE TOTHEEXACTMARGINALSANDTHE MSEB TOTHE BETHE APPROXIMATION, CONVERGENCE RATIO, ◦ ◦ ANDTHEOVERALLNUMBEROF BP ITERATIONS.ONLYCONVERGEDRUNSARECONSIDEREDFOR BP AND BPD BUTALLRUNSARECONSIDERED ∗ FOR SBPall,GIBBSall, AND FB,all.

Grid Graph( 5 × 5) Grid Graph (10 × 10) Complete Graph Random Graph θ 0 0.1 0.4 0 0.1 0.4 0 0.1 0.4 0 0.1 0.4 BP◦ 0.338 0.251 0.102 - - 0.184 0.463 0.466 0.356 0.252 0.202 0.101 ◦ BPD 0.226 0.198 0.066 0.186 0.240 0.154 0.463 0.473 0.422 0.128 0.116 0.083 MSE SBPall 0.000 0.029 0.047 0.000 0.026 0.077 0.000 0.055 0.074 0.000 0.048 0.049 ∗ FB,all 0.036 0.042 0.069 ------Gibbsall 0.001 0.016 0.064 0.001 0.037 0.120 0.096 0.096 0.077 0.001 0.011 0.048 Convergence BP◦ 0.05 0.11 0.26 0.00 0.00 0.02 0.41 0.42 0.50 0.30 0.33 0.49 ◦ ratio BPD 0.11 0.16 0.69 0.01 0.02 0.12 0.41 0.41 0.50 0.62 0.64 0.80 BP◦ 40 52 84 - - 102 17 17 18 42 53 50 ◦ Number of BPD 1370 1449 1735 2711 2313 2599 211 207 234 1077 1057 873 iterations SBPall 5 182 146 5 149 209 5 51 110 5 149 131 5 5 5 5 5 5 5 5 5 5 5 5 Gibbsall 10 10 10 10 10 10 10 10 10 10 10 10

MSEB SBPall 0.036 0.037 0.022 ------◦ FB (ζM ) equals SBP 100 10 23 ------

once divided by the overall number of experiments (i.e., 100). First, in order to evaluate the performance of SBP we consider The reported error (MSE) and the number of iterations are uniform local potentials θ = θ 0, 0.1, 0.4 and draw the i ∈ { } averaged over all convergent runs of BP and BPD (i.e., BP◦ and couplings with equal probability from J 1, 1 ; the results ij ∈ {− } BPD◦ ) while all runs that did not converge are discarded. On the are summarized in Tab. 1. Although BP and BPD fail to converge contrary, we average the error and the number of iterations over for most models we observe that SBP stops after only a few all models for SBP (SBPall), Gibbs sampling (Gibbsall), and for iterations and significantly outperforms BP in terms of accuracy. minimization of the Bethe approximation ( B,all∗ ). In fact, SBP achieves accuracy competitive with Gibbs sampling F but requires three orders of magnitude fewer iterations. 4.2 Attractive models Second, we apply SBP to general graphs and evaluate whether We consider grid graphs with N = 10 10 random variables, SBP provides a good approximation of the pseudomarginals P˜ × B∗ random graphs with N = 10 random variables, and complete that correspond to the global minimum of the Bethe free energy graphs with N = 10 random variables. We generate L = 100 ∗ . Therefore we consider grid graphs (of size 5 5), which FB × models for every value of β 0, 0.5,..., 5 and sample the still allows us to approximate the global minimum ∗ – and ∈ { } FB potentials according to θi ( 0.5, 0.5) and Jij (0, β); the related pseudomarginals P˜ – reasonably well by [55]. The ∼ U − ∼ U B∗ i.e., overall we consider 1100 different models for each individual results are summarized in Tab. 1 and show that SBP approximates graph-structure. Note that BP is randomly initialized 100 times for ˜ PB∗ within the accuracy of our reference method (MSEB). We every considered model. We compute the MSE for every value of further report the number of times where SBP obtains the terminal β and visualize the mean and the standard deviation of the MSE14  fixed point, i.e., for M , in Tab. 1 i.e., B◦ (ζM ) equals SBP . It as well as the number of iterations in Fig. 5a. becomes obvious thatU SBP approximatesF the terminal fixed point BP (orange) converges rapidly for all graphs considered; reasonably well, despite frequently stopping for ζm < 1. More- hence, there is no additional benefit in using BPD (green) that only over, the MSE reveals that SBP does not only approximate the increases the number of iterations. SBP (blue) slightly increases ˜ pseudomarginals PB∗ well but concurrently provides an accurate the number of iterations as compared to BP but converges in fewer approximation of the exact marginals PB. iterations than BPD. Note that SBP is guaranteed to capture the Third, we investigate how the approximation quality depends global optimum when applied to unidirectional models (see The- on the scaling term ζm. Therefore, we depict the evolution of orem 9-10). But even if we do allow for random local potentials, MSEB (to the approximate solution) and of the MSE (to the exact we empirically observe that SBP consistently outperforms BP with solution) in Fig. 6. We observe that MSEB (blue) decreases mono- respect to accuracy; specifically for models with strong couplings tonically with every iteration, which empirically verifies that SBP These models exhibit multiple stable fixed points [20] such that, proceeds along a well-behaved solution path (see Property 1). Note depending on the initialization, BP often converges to inaccurate that MSEB decreases rapidly in the first iterations and SBP spends fixed points while SBP selects an accurate one. a major part of the overall run-time for slight improvements. The MSE to the exact solution, on the other hand, decreases first until it 4.3 General models increases again as SBP incorporates stronger couplings. Stronger General models with random potentials are often frustrated and couplings tend to degrade the quality of the Bethe approximation traditionally pose problems for BP and other methods that aim to in loopy graphs and lead to marginals that are increasingly biased minimize the Bethe approximation. towards one state [20], [51]. This explains why the MSE to the 14. Note that the MSE is not Gaussian distributed but we report the standard exact solution increases as SBP converges towards the terminal deviation for simplicity. fixed point. One could exploit this behavior and restrict the run- SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 10

0.25 1 ˜ MSE PB◦ (ζ =1) MSEB 0.2 0.5 ) i ˜ 0.15 PB◦ (ζ =0) m 0 ( E MSE 0.1 0.5 − 0.05 1 − 0 0.2 0.4 0.6 0.8 1 0 0 50 100 150 200 ζ Number of BP iterations Fig. 7. Solution path (as defined in Section 5.1) for Example 4. The Fig. 6. MSE (orange) and MSEB (blue) and their standard deviation depicted lines illustrate how the fixed points of BP evolve as ζ goes from (shaded area) over the cumulative number of iterations. Results are zero to one. Solid lines correspond to minima of FB ; dashed line(s) to averaged over 100 grid graphs (5 × 5); θi = 0.4 and Jij ∈ {−1, 1}. maxima of FB . For more details see Example 4. time by stopping SBP after consumption of a fixed iteration which the pairwise potentials are specified according to (16); we budget; this may even increase the accuracy with respect to the define the homotopy accordingly by exact solution. H(µ, ζ) = µ (µ) where Ψ = Ψ(ζ). (18) Finally, we investigate the influence of the coupling strength: − BP Consequently, for a given value ζ, H(µ, ζ) is zero precisely for therefore we consider θi ( 0.5, 0.5) and Jij ( β, β). ∼ U − ∼ U − the fixed points µ◦ of = ( , Ψ(ζ). It follows that a solution For every β 0, 0.5,..., 5 we consider L = 100 models U G and present the∈ averaged { results} in Fig. 5b. We restrict the results path to β 2 on the grid graph because BP did only converge c(ζ): H(µ, ζ) = 0 (19) sporadically≤ for models with stronger couplings. Even for models where BP converged, SBP requires only slightly more iterations exists that (i) has a start point c(ζ = 0) = µ : H(µ, ζ = 0) = 0, µ µ than BP and fewer than BPD. The benefits of SBP become (ii) an endpoint c(ζ = 1) = : H( , ζ = 1) = 0 , and (iii) increasingly evident as the coupling strength increases. Again SBP is continuous over ζ [0, 1], i.e., the solution path connects the start- with the endpoint.∈ (blue) significantly outperforms BP◦ (orange) and BPD◦ (green) on all graphs with respect to accuracy. SBP then proceeds along a solution path c(ζ) that implicitly ˜ defines the pseudomarginals PB◦ (ζ) by (8)-(9). In particular, we ˜ ˜ refer to the start- and endpoint by PB◦ (ζ = 0) and PB◦ (ζ = 1) 5 THEORETICAL PROPERTIESOF SBP respectively. The following example illustrates the solution path for a Here, we present some more formal arguments and discuss the unidirectional model. properties of SBP to understand under which conditions the Example 4. Consider a unidirectional grid graph of size 3 3 algorithm (presented in Section 3) can be expected to perform × well. We begin by formally defining the notion of a solution path with J = 2 and θ = 0.1. For this model, we can compute the in Section 5.1 before we discuss the properties of SBP’s solution fixed points of BP; i.e., the solutions to H(µ, ζ) = 0 with the path in Section 5.2. Finally, we restrict ourselves to unidirectional numerical homotopy continuation method as in [20]. models, for which we analyze the accuracy of SBP in Section 5.3. Fig. 7 shows how the solutions evolve as ζ goes from zero to one. Similar to Example 2, this illustrates how increasing the interactions between variables changes the energy landscape. That 5.1 Solution Path: Definition is, for small values of ζ we have a unique fixed point until – as ζ increases – two additional fixed points emerge; one of which is a First, we fix our notation: For the model with its potentials m minimum and one a maximum of . Although multiple solutions Ψ we refer to the pseudomarginals by P˜ U(ζ ), and, with slight B m B◦ m to H(µ, ζ) = 0 exist for larger valuesF of ζ, all except for one lack abuse of notation, we refer to the corresponding stationary point a start point and are therefore of no relevance for any method that of the Bethe free energy by (ζ ) = (P˜ (ζ )). It is B◦ m B B◦ m proceeds along a solution path defined by the homotopy in (18). beneficial to study the behaviorF of SBP asFM tends towards infinity. Therefore we consider the unit interval ζ [0, 1] to be ˜∈ the compact support of the functions B(ζ) and PB(ζ). SBP is 5.2 Solution Path: Properties inspired by the idea to proceed along aF so-called solution path as The following proposition summarizes the main properties of the ζ increases from zero to one in order to obtain the marginal dis- solution path that is specified and followed by SBP. tributions for the model of interest. Therefore, we consider a con- µ +1 µ tinuous homotopy function H(µ, ζ): R| | R| |. Note that Properties 1 (Convergence Properties). SBP proceeds along a the scaling factor modifies the graphical model →= ( , Ψ(ζ), for well-defined solution path c(ζ). More precisely, we have: U G SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 11 ˜ (1) BP has a unique fixed point µ1◦ for ζ = 0, so that SBP Theorem 4 (Property 1.2). Consider the pseudomarginals PB◦ (ζ) ˜s has a unique start point PB(ζ = 0) (see Theorem 3). along the (one) solution path that originates from the start point ˜ ˜ (2) A smooth (i.e., continuous) solution path originates from PB◦ (ζ = 0). Then, both the pseudomarginals PB◦ (ζ) and the ˜s PB(ζ = 0) (see Theorem 4). associated stationary point B◦ (ζ) are continuous functions of (3) SBP proceeds along this (unique) solution path c(ζ) the scaling parameter ζ alongF the whole solution path, i.e., for and terminates after a fixed number of iterations (see ζ [0, 1]. Theorem 6). ∈ Proof: First, we show that the Bethe free energy (P˜ , ζ) is continuously differentiable. Therefore, the deriva- In the sequel, we validate all points of Property 1. B B tiveF with respect to ζ must exist and be continuous as well. ˜ Note that B(PB, ζ) is a high-dimensional function defined over 5.2.1 Uniqueness of the Start Point ˜ F PB , i.e., over all locally consistent marginals. Now, let us First, we show that the initial model with ζ = 0 has a unique fixed consider∈ L (13) with the pairwise potentials defined by (16). Then, point. This is an immediate (and known) consequence of the fact when taking the derivative with respect to ζ, all terms depending that all variables are independent; we still include the following only on singleton marginals vanish and the derivative is given Theorem for the sake of completeness. according to

Theorem 3 (Property 1.1). For the initial model 1 with param- ∂ B(ζ) ∂ X X U F = P˜(xi, xj) ln Φ(xi, xj) eters Ψ(ζ = 0), BP has a unique fixed point µ1◦. That is, only ∂ζ −∂ζ (i,j) E xi,xj a single start point exists for the solution path of SBP. Moreover, ∈ this start point is exact, i.e., ˜ , and BP ∂ X X PB◦ (ζ = 0) = PB(ζ = 0) = P˜(x , x ) ζJ x x is guaranteed to converge.15 −∂ζ i j · ij · i j (i,j) E xi,xj X ∈ Proof: First, we show that all messages converge to the = J χ . (24) − ij · ij identical value whenever all Jij = 0. Therefore, note that we (i,j) E can omit the pairwise potential in the update equation of BP and ∈ rewrite (6) to Since (24) is a finite sum over finite terms, this proves that B(ζ) is continuously differentiable.16 More precisely, (24) provesF that n+1 X Y n µij (xj) Φ(xi) µki(xi). (20) (P˜ , ζ) is continuously differentiable with respect to ζ for ∝ FB B xi Xk N(i) j ∈X ∈ \ every point on the energy surface. We specifically consider the local minimum ◦ (ζ) along the It follows that (20) is independent of the state xj, i.e., the B solution path, i.e., the minimum that emerges fromF the unique start computed messages are the same for all states of Xj. Thus, after 1 point B◦ (ζ = 0). normalization, µij(xj) = = 0.5 for all edges (i, j) E. F Second, the pseudomarginals|X | are then obtained according∈ to Given that the Bethe free energy varies in a continuous fashion along this solution path (for ζ [0, 1]), the local minimum B◦ (ζ) ∈ F17 ˜ 1 Y and its position vary in a continuous fashion as well [8]. For P (xi) = Φ(xi) µki◦ (xi) (21) Zi existence of this fixed point, note that minima of B persist Xk N(i) ∈ when increasing ζ (similar to increasing the inverse temperature;F exiθi = . (22) see [63]). Consequently, because of the one-to-one correspondence θi θi e + e− between stationary points of the Bethe free energy B◦ (ζ) and ˜ F Finally, we must show that the exact marginals are the same, fixed points of BP PB◦ (ζ), the pseudomarginals are also continu- i.e, that P (xi) = P˜(xi). Therefore, note that the joint in (5) ous. Note, however, despite each minimum varying continuously factorizes only over the local potentials, which allows us to reorder with increasing ζ, the global minimum may not (as a former local the summation and compute the exact marginals according to minimum becomes the global minimum).

xiθi 1 xiθi X Y xj θj e P (xi) = e e = . (23) In Theorem 4, we have shown that continuity of the Bethe θi θi e + e− free energy and the corresponding pseudomarginals ˜ Z xj :Xj X Xi Xj X Xi B(ζ) PB◦ (ζ) ∈{ \ } ∈{ \ } implies a smoothF solution path. We implicitly assumed that this further implies a well-behaved solution path. In general this is the Theorem 3 thus reduces the problem of initializing SBP to case but there are two scenarios that might be problematic as we computing the fixed point messages µ of the initial model with 1◦ will discuss now: ζ = 0; this can be done in linear time. Moreover, a unique start First, it is absolutely critical that the stationary points of point implies that only a single solution path emerges from it. B are zero-dimensional, i.e., that every minimum is a distinct point,F and that there are finitely many stationary points. Fortunately, this 5.2.2 Smooth Solution Path ˜ Second, we show that – for any model – the Bethe free energy 16. Moreover – according to its definition in (13) – FB (PB , ζ) is a is a smooth function of the scaling parameter ζ in the sense that polynomial, which guarantees that it is not just continuously differentiable but in fact an analytical function [39]. Note that this is in accordance with no discontinuities exist. This implies the existence of a smooth the fact that true phase transitions (singularities in the derivative of the free solution path; let us make this notion of a smooth solution path energy) can only occur in the thermodynamic limit, where (24) becomes a sum more precise: over infinitely many terms that equates to infinity. 17. This is actually not restricted to problems with a unique minimum; instead sets of minima are considered in [8] to prove that minima of constrained 15. Note that this concurs with the sandwich-bound [54, Th.4] that reduces minimization problems vary continuously under continuous changes of the θi −θi ˜ to PXi (+1) = θi/(e + e ) = PXi (+1) for Jij = 0. objective function (see also [38, Section 5]). SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 12

is the case and the number of stationary points of B is always between both fixed points and consequently imply the absence of finite [50]; the same is true for the fixed points of BPF (for as long bifurcations. as the messages are properly normalized) [27]. To summarize, Theorem 4 and Lemma 5 guarantee the ex- Second, a potential issue is how SBP copes with bifurcations istence of a smooth solution path that connects the start point ˜ ˜ of the solution path. A bifurcation is a point along the solution PB◦ (ζ = 0) to the endpoint PB◦ (ζ = 1). This is a property of path at which a minimum B◦ turns into a maximum with two new fundamental importance as, at least in principle, it allows path- minima branching off; i.e.,F one minimum splitting up into two new tracking algorithms (as SBP) to proceed along this well-behaved minima (see Example 1 for the illustration of a typical bifurcation). solution path. The only remaining question is whether SBP is In that case, SBP will follow the evolution of a single minimum actually capable of proceeding along the solution path and if it and discard the other one. Which fixed point is selected mainly obtains the endpoint; i.e., if SBP converges. depends on the implementation details of BP (this is because 5.2.3 Convergence Properties of SBP we rely on BP to find the next fixed point after our prediction). Note that different minima are usually wide apart (see Example 1 So far, we have substantiated the claim that a smooth solution or [22]). If, however, two minima would be close together, SBP path emerges from the trivial start system leading to the target might switch between branches because of numerical issues (e.g. system. The existence of this smooth solution path alone, however, because of quantization errors or a step-size too large). Since this is not sufficient to guarantee that SBP obtains the marginals of the is not a fundamental property of the solution path, we will assume target system. As already discussed in Section 3, SBP relies on sufficient numerical precision and rule out this phenomenon. Thus, BP to proceed along the solution path; ideally, keeping the overall the existence of bifurcations is no particular problem and SBP amount of BP iterations low. If – at some point (for ζ < 1) – the ˜ retains the property of a unique solution path. fixed point PB◦ (ζ) becomes unstable, this approach will inevitably Although bifurcations are not problematic from a practical break down. Nonetheless, SBP will proceed along the solution perspective, it is still relevant for our theoretical analysis of SBP path for as long as it has a stable fixed point and approximates to understand if and where bifurcations of the solution path exist. the marginals precisely at the onset of instability. That is, even if ˜ Unfortunately, the current state of knowledge is still relatively lim- the terminal fixed point PB◦ (ζ = 1) is unstable, SBP will always ited in this respect. Note, however, that at least for unidirectional converge and approximate the marginals in a limited number of 18 models it is actually possible to make precise statements regarding iterations. the existence of bifurcations. Theorem 6 (Property 1.3). There exists some ζ 1 so that SBP ˜ ≤ Lemma 5. Let be a unidirectional model. Then, if the local converges to PB◦ (ζ) in (MNBP). G ∈ L O potentials of all nodes are specified by θi = 0, a bifurcation Proof: SBP increases ζ as long as BP converges in less than occurs. Else, if some local potentials are nonzero, no bifurcation NBP iterations, and stops otherwise. Consequently, BP corrects exists along the solution path. the accuracy of the fixed point for each value ζm within a bounded Proof: Since we consider unidirectional models there are number of iterations. The run-time of SBP is further determined s by the choice of M, i.e., the step-size (see Section 3.1). That is, two minima of B at most. We denote these minima by B t F F BP is run until convergence for at most M times and thus, SBP and B. We will explain now, why these two minima branch off theF previously unique minimum as we increase the scaling converges in at most M NBP iterations. SBP is consequently· capable of tracking the fixed point that parameter ζ if and only if θ = 0. Therefore, we denote the critical i emerges as ζ increases and requires MN iterations at most. scaling parameter, for which the two minima begin to exist, by ζ . BP c SBP may, however, only converge to a surrogate model for Then, by definition, a bifurcation exists if and only if both minima ζ < 1 and is not guaranteed to obtain the pseudomarginals coincide for ζ , i.e, if m c of the endpoint. One can characterize this error by computing a s t lim m = lim m . (25) bound on B◦ (ζm) B◦ (ζ = 1) given the difference between + i + i ζ ζc ζ ζc |F − F | → → Ψ(ζm) and Ψ(ζ = 1) (cf. [16, Theorem 16]). SBP works particularly well for attractive models, for which First, remember that both fixed points are symmetric if all θi = 0, s t every minimum of B is also a stable fixed point of BP [50]. i.e., mi = mi (see Example 1 and Fig. 2). Because of this F symmetry, both− fixed points will bifurcate from a fixed point with More precisely, if for some scaling term ζm the fixed point of the solution path becomes unstable, i.e., if it turns into a local zero mean mi = 0. Note, that this specific case is well understood and serves as the archetype for the existence of bifurcations on the maximum, two stable fixed points will branch of [50, Theorem models we consider in this work [33]. 6]. Thus for attractive models (and also for balanced models) a smooth solution path with a stable fixed point always exits so that Second, it remains to show that this is not the case if θ = 0. i SBP is consequently capable of tracking the solution path until its Therefore, we show that both minima do not coincide for ζ ,6 i.e., c end. lim ms = lim mt. (26) + i + i Corollary 6.1. Let be an attractive or unidirectional model. ζ ζc 6 ζ ζc → → Then, every point alongG the solution path going from ζ = 0 to This is an immediate consequence of the fact that both fixed points ζ = 1 is stable. Consequently, some number of steps M exists for s are not symmetric anymore (see Section 2.5). In fact, mi > 0 which SBP converges and obtains the marginals P˜◦ (ζ = 1). t B is strictly positive, whereas mi < 0 is strictly negative; as a consequence, both fixed points can never coincide, which rules Proof. According to Theorem 4, a smooth solution path c(ζ) ˜ out the existence of bifurcations along the solution path. exists that goes from the start point PB◦ (ζ = 0) to the endpoint The implications of Lemma 5 are illustrated in Example 1 and 18. This is in stark contrast to plain BP that fails to converge and thus cannot 2. Note how even small values for θ = 0 break the symmetry be used to approximate the marginals in such a case. i 6 SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 13 ˜ PB◦ (ζ = 1), where every point along c(ζ) corresponds to a local Even though the fixed point with mi(ζ) = 0 is unstable, SBP minimum of the Bethe free energy B◦ . Every local minimum is will still converge to this fixed point. This is rather surprising, as – also stable for attractive models [50].F Therefore, the correction by in general – BP can neither converge to an unstable fixed point nor BP will converge in every step if initialized sufficiently close to would it remain there because of quantization errors. There are two the fixed point, i.e., if the step size is chosen small enough. reasons why SBP obtains the potentially unstable fixed point in the special case of models with θi = 0 nonetheless. First, note that 5.3 Accuracy of SBP the (unique) fixed point of the initial mode is identical to the exact solution; hence all variables have zero mean at the start point, i.e., The convergence properties of SBP (Property 1) are of fundamen- ms(ζ = 0) = 0. Second, all messages µ (x ) µ (ζ = 0) take tal importance for understanding SBP. Yet, so far we have only i ij j ◦ the same value of µ (x ) = 1/2, since θ = 0∈. The extrapolated been concerned if SBP converges but neglected any discussion ij◦ j i messages will stay the same for every iteration of SBP because – regarding the quality of the obtained solution. Thus, we will now as discussed above – they already constitute a valid (albeit possibly theoretically analyze how well the fixed point obtained by SBP unstable) fixed point (see Example 1). Note, however, that these approximates the marginals. Unfortunately, however, our under- fixed point messages can be represented without quantization error standing of what characterizes a fixed point that approximates the in binary arithmetic. Thus SBP will remain on the fixed point marginals well is relatively poor. Indeed, it is not even obvious corresponding to the exact marginals. how the accuracy of the approximated marginals relates to the The above statement arguably holds only for very restricted accuracy of the approximated free energy [22]. models and is thus of limited practical relevance. Indeed, not only Therefore, we will mainly focus our discussion on the simplest does standard BP obtain the same fixed point if initialized with model class for which the solution space is well understood (i.e., identical messages but the exact marginals are also known and unidirectional models). Note that for unidirectional models SBP is can be computed analytically if all θ = 0, obviating the need for actually guaranteed to converge (see Section 2.5). Regarding the i approximate inference methods. accuracy of SBP – when applied to unidirectional models – we Our next results will discuss the accuracy of SBP for the summarize the properties as follows: more general class of unidirectional models both with respect Properties 2 (Accuracy of SBP for Unidirectional Models). Let to approximating the marginals and the Free energy (or partition G be a unidirectional model, i.e., all edges are attractive (Jij > 0) function respectively). and all local potentials are specified by θi with the same sign. Therefore, we first describe how the scaling term ζ affects the 19 Without loss of generality we consider the case where all θi 0. marginals along the solution path c(ζ) by generalizing Griffiths’ Then, the solution path ends at a fixed point that approximates≥ the inequality [13] to the fixed points of BP. In particular, this analysis marginals and the free energy well. More specifically, SBP either reveals that the means mi◦ are monotonically increasing with ζ. obtains the exact marginals (if all θ = 0) or finds the global i Lemma 8 (Monotonicity of the Means). Consider two attractive minimum of the Bethe free energy that – for unidirectional B∗ models and on the same graph , associated scaling pa- models – corresponds to the fixed pointF with the most accurate m n rametersU U, and local potentialsG specified by . Note marginals. Let us now break down the argument into the following ζm < ζn θi > 0 that we are effectively considering two nearly equivalent models three points: with n having strictly larger couplings than m. Now, let us ˜ U U (1) For θi = 0, SBP obtains the exact marginals, i.e., P ◦ = consider a BP fixed point of with positive means m◦(ζ ) > 0. B Um i m PB (see Theorem 7). Then, the corresponding fixed point of has larger means Un (2) For θi = 0, SBP finds the global minimum of B(ζ) and mi◦(ζn) > mi◦(ζm) and correlations χij◦ (ζn) χij◦ (ζm). gives the6 best approximation to the free energyF (see ≥ Proof: The main ingredient of this proof is to show Theorem 9). F how the ratio between the messages for their respective states (3) For θi = 0, the global minimum of B(ζ) also gives µ (X =+1) 6 F µ¯ = ij j increases monotonically with ζ. Then, the fact the best approximation of the marginals, i.e., no other ij µ (Xj = 1) ij − minimum of B(ζ) has more accurate marginals (see that the means and the correlations increase monotonically follows Theorem 10).F directly from the definition of the pseudomarginals in terms of the messages (see (8) and (9)). First, we show that SBP obtains the exact marginals in the First, we show that the message ratio µ¯ij increases mono- special case where all local potentials are specified by θi = 0 tonically: therefore, let us denote the messages on both models by µ (xj) and µ (xj) respectively. Furthermore, with- Theorem 7 (Property 2.1). Consider a unidirectional model with (m)ij (n)ij s out loss of generality, we assume that all couplings of n θi = 0. Then, the pseudomarginals obtained by SBP P˜ (ζ = 1) U B are ε-larger than those of m. Then, for every ε there exists are identical to the exact ones PB(ζ = 1). δ > 0 so that whenever 0U< µ¯ µ¯ < δ, we have (n)ij − (m)ij Proof: For attractive models with θ = 0 it is straightfor- 0 < J(n)ij J(m)ij < ε. Note that by assumption mi (0, 1] i − ∈ ward to compute the exact marginals. The exact marginals of all so that variables are uniformly distributed and have zero mean m (ζ) = 0 i µ (X = +1) µ (X = 1). (27) for all values of ζ; note that these marginals also correspond to a ij j ≥ ij j − stationary point B◦ (ζ) [33]. For sufficiently small values of Jij First, we show that for all (i, j) E this stationary pointF is a (unique) minimum but for large values of ∈ µ(◦ ) (Xj = +1) µ(◦ ) (Xj = +1) Jij it turns into a maximum (see [30, pp.385] and Example 1). m ij < n ij . (28) µ(◦m)ij(Xj = 1) µ(◦n)ij(Xj = 1) 19. Note that identical results can be obtained in a straightforward manner − − for θi < 0 because of symmetry properties. Therefore, consider the update rules of (6) for both states SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 14 Theorem 9 (Property 2). Consider a unidirectional model, i.e, an attractive model with θ > 0. Then the minimum of the Bethe n+1 J(m)ij +θi+ε Y n i µ(n)ij(Xj = +1) e µ(n)ki(Xi = +1) s ∝ free energy along the solution path B(ζ) is always the global Xk N(i) j s F ∈ \ minimum, i.e., B(ζ) = min B(ζ). Moreover, the Bethe free J θi ε Y n F F + e− (m)ij − − µ (X = 1), (29) energy of this stationary point decreases monotonically with ζ. (n)ki i − Xk N(i) j ˜s ∈ \ Proof: A unique start point PB(ζ = 0) exists by The- orem 3. For θi > 0 this fixed point has only positive means n+1 J(m)ij +θi ε Y n s s µ (Xj = 1) e− − µ (Xi = +1) m (ζ = 0) > 0 and correlations χ (ζ = 0) > 0. (n)ij − ∝ (n)ki i ij Xk N(i) j Consequently, Lemma 8 applies, which implies that the corre- ∈ \ s s J(m)ij θi+ε Y n sponding means m (ζ) and correlations χ (ζ) are monotonically + e − µ( ) (Xi = 1). (30) i ij n ki − increasing along the solution path. This further implies that the Xk N(i) j ∈ \ Bethe free energy ◦ (ζ) decreases along the solution path. Let ε B In (29) the larger product is multiplied by e and the smaller s (ζ = 1) correspondF to the endpoint of the solution path c(ζ) ε B product is divided by e . For (30) it is exactly the other way round Fthat emerges from the origin; then, it immediately follows that the so that the ratio between the messages increases which proofs (28). s error with respect to the endpoint B(ζ = 1) decreases along the We shall denote the imposed difference δ0 R+∗ on the messages F ∈ solution path: i.e., consider two arbitrary values m, n [0, 1] such by s s s ∈ s that n > m, then B(ζm) B(ζ = 1) B(ζn) B(ζ = 1) . |F −F | ≥ |F −F µ(◦n)ij(Xj = +1) = µ(◦m)ij(Xj = +1) + δ0, (31) | s It remains to show that SBP obtains the fixed point B(ζ) µ◦ (X = 1) = µ◦ (X = 1) δ0. (32) F (n)ij j − (m)ij j − − that is the global minimum of the Bethe free energy for a given scaling parameter ζ. Therefore, consider the fact that unidirec- Second, we show that mi◦(ζn) > mi◦(ζm), which is an immediate consequence of plugging (31) and (32) into (8) (see Example 1 tional models have two fixed points at most, only one of which has positive means (see Lemma 2). As discussed above, SBP that illustrates how all fixed points, i.e., minima of B, are pulled towards extreme values of zero and one as increases).F proceeds along a solution path belonging to the consistent fixed mi◦ J s (i) (ii) point with mi (ζ) > 0. The second fixed point t, if it exists, has t Finally, it remains to show that 0 χ◦ (ζm) χ◦ (ζn). negative means m (ζ) < 0. Given that SBP obtains the consistent ≤ ij ≤ ij i Without loss of generality we assume that all variables have equal fixed point, it follows from Lemma 2 that SBP obtains the global t s degree d + 1, constant coupling strength Jij = J, and constant minimum, i.e., = ∗ . FB ≥ FB FB local fields θi = θ. First we show that (i) holds, i.e., χ = χij,◦ 0 Theorem 9 makes it explicit that – for unidirectional models is positive. Let us express the marginals by (9) and denote the – SBP favors the fixed point that defines the global minimum of messages by µ = µ(m)ij(Xj = 1). It follows that µ(m)ij(Xj = B. This is in accordance with the observation that the average 1) = (1 µ) and that energyF is linear in the pseudomarginals (see (11)), which further − − J+2θ 2d J 2θ 2d J d d implies that varying the local potentials essentially tilts the energy χ=e µ +e − (1 µ) 2e− µ (1 µ) . (33) − − − landscape (cf. [37]). Note that this behavior is also illustrated in Let us further represent the messages by µ = 1/2 + β with β Example 1 and 2 accordingly: i.e, as opposed to Fig. 2, local ∈ [0, 1/2]. It follows that potentials with θi > 0 tilt the energy landscape in Fig. 3 so that the consistent fixed point constitutes the global minimum of B. (a)   F χ (1/2 + β)2d +(1/2 β)2d 2 (1/2 + β)d (1/2 β)d Although finding the global minimum of B is desirable, one ≥ − − − is ultimately interested in approximating theF exact free energy  2 = (1/2 β)d (1/2 + β)d (34) as good as possible. Fortunately, both properties coincide for − − unidirectional models. (b) 0, (35) Corollary 9.1 (Accuracy of the Bethe Approximation). Consider ≥ a unidirectional model. Then there is no other fixed point of BP P˜k where (a) follows from neglecting all exponential terms and thus B that approximates the free energy better than the one obtained upper bounding the positive term and lower bounding the negative by SBP, i.e., F term (with equality if and only if J = 0 and θ = 0) and (b) is a direct consequence of the square in (34). Now let us show that ˜s k PB(ζ) = argmin B(ζ) ∗ . (ii) holds, i.e., χ increases monotonically, by taking the derivative P˜k (ζ) |F −F | B ∈L of (33), so that Proof: This is a direct consequence of SBP obtaining the ∂  J+2θ 2d 1 J 2θ 2d 1 global minimum of the Bethe free energy (see Theorem 9) and the χ =2d e µ − e − (1 µ) − ∂µ − − fact that the Bethe free energy upper bounds the free energy for J d d 1 d 1 d attractive models (cf. [40]). +2de− µ (1 µ) − µ − (1 µ) (36) − − − This implies that the fixed point obtained by SBP minimizes (a) k J  d d 1 d 1 d the approximation error of the free energy B(ζ) ∗ . Moreover 2de− µ (1 µ) − µ − (1 µ) (37) |F −F | ≥ − − − – because of the direct relationship between the Bethe free energy (b) and the Bethe approximation of the partition function – this 0, (38) ≥ implies that the fixed point obtained by SBP also approximates where (a) follows from neglecting the, strictly positive, first term the partition function best. in (36), and (b) is a direct consequence from (27). Note, however, that statements about the approximation qual- ity of the free energy do not necessarily extend to the accuracy of SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 15

the pseudomarginals [22]. Nonetheless, at least for unidirectional ACKNOWLEDGMENT models, this is the case and SBP obtains the fixed point with the The authors greatly appreciate the valuable discussions with most accurate marginals. Bernhard C. Geiger, KNOW-Center, and acknowledge his help in Theorem 10 (Accuracy of Marginals). Consider a unidirectional making the formal definition of SBP much more concise. More- model and assume that the RSB assumption holds (see Con- over, we thank Alexander Ihler, University of California Irvine, ˜k jecture 1). Then there is no other fixed point of BP PB that for the encouraging discussion during UAI’17 that was pivotal for approximates the marginals better than the one obtained by SBP, actually pursuing the – at this time – rather unpolished idea. We i.e, further appreciate the help of Florian Kulmer in implementing and ˜s ˜k empirically evaluating our algorithm and are particularly indebted PB(ζ) = argmin PB(ζ) PB(ζ) . P˜k (ζ) | − | to Harald Leisenberger for carefully reading the manuscript. B ∈L This work was supported by the Graz University of Technol- Proof: Note that we are only considering unidirectional ogy LEAD project “Dependable Internet of Things in Adverse models here. For these models, although not formally verified, Environments”. Adrian Weller acknowledges support from the the RSBP assumption is accepted to hold. Thus, if multiple fixed David MacKay Newton research fellowship at Darwin College, points do exist this allows us to express the exact marginals as a The Alan Turing Institute under EPSRC grant EP/N510129/1 & convex combination of all fixed points according to Conjecture 1. U/B/000074, and the Leverhulme Trust via the Leverhulme Centre Moreover, remember that the Bethe free energy of unidirec- for the Future of Intelligence (CFI). tional models has two minima at most (see Section 2.5). This – and the fact that the convex combination of both fixed points yields the exact marginals by assumption – implies that the fixed REFERENCES point obtained by SBP (i.e., the one that minimizes the Bethe free energy) approximates the marginals best. [1] E. Allgower and K. Georg. Introduction to Numerical Continuation Methods. Society for Industrial and Applied Mathematics, 2003. To conclude, the fixed point along the solution path is always [2] D. Belanger and A. Young. The random field Ising model. Journal of stable for unidirectional models. Thus, SBP follows the solution magnetism and magnetic materials, 100(1-3):272–291, 1991. path to its endpoint. In doing so, SBP not only decreases the Bethe [3] A. Braunstein, F. Kayhan, G. Montorsi, and R. Zecchina. Encoding free energy and obtains the global minimum but it also provides for the Blackwell channel with reinforced belief propagation. In IEEE International Symposium on Information Theory, 2007. the fixed point with the most accurate marginals. [4] A. Braunstein, M. Mezard,´ and R. Zecchina. Survey propagation: an algorithm for satisfiability. Random Structures & Algorithms, 27(2):201– 6 CONCLUSION 226, 2005. [5] V. Chandrasekaran, M. Chertkov, D. Gamarnik, D. Shah, and J. Shin. In this paper, we introduced an iterative algorithm to perform Counting independent sets using the Bethe approximation. SIAM Journal approximate inference: self-guided belief propagation (SBP) is a on Discrete Mathematics, 25(2):1012–1034, 2011. simple and robust method that gradually accounts for the pairwise [6] A. Coja-Oghlan and W. Perkins. Bethe states of random factor graphs. potentials and guides itself towards a unique, stable, and accurate Communications in Mathematical Physics, 366(1):173–201, 2019. [7] G. F. Cooper. The computational complexity of probabilistic inference solution. using Bayesian belief networks. Artificial intelligence, 42(2-3):393–405, We provide a comprehensive theoretical analysis in order to 1990. validate the underlying assumptions: in particular, we showed that: [8] G. B. Dantzig, J. Folkman, and N. Shapiro. On the continuity of the minimum set of a continuous function. Journal of Mathematical Analysis (i) a smooth solution path exists and originates from a unique start and Applications, 17:519–548, 1967. point that is obtained by neglecting the pairwise potentials; (ii) [9] A. Dembo and A. Montanari. Ising models on locally tree-like graphs. this solution path is well-behaved and can be tracked by SBP; and The Annals of Applied Probability, 20(2):565–592, 2010. (iii) for unidirectional models, the solution of SBP approximates [10] G. Elidan, I. McGraw, and D. Koller. Residual belief propagation: Informed scheduling for asynchronous message passing. In Proceedings the exact solution well and corresponds to the global optimum of of UAI, 2006. the Bethe approximation. [11] A. Globerson and T. Jaakkola. Convergent propagation algorithms via The theoretical analysis does not fully answer for which mod- oriented trees. In Proceedings of UAI, 2007. els SBP is expected to perform particularly well. Although SBP [12] L. A. Goldberg, M. Jerrum, and M. Paterson. The computational complexity of two-state spin systems. Random Structures & Algorithms, obtains the most accurate fixed point for unidirectional models, the 23(2):133–154, 2003. marginals will not be approximated well for models with strong [13] R. B. Griffiths. Correlations in Ising ferromagnets. ii. external magnetic couplings, for the simple reason that accurate fixed points do not fields. Journal of Mathematical Physics, 8(3):484–489, 1967. exist at all. We expect that SBP is particularly advantageous for [14] F. Harary et al. On the notion of balance of a signed graph. The Michigan Mathematical Journal, 2(2):143–146, 1953. all models that have a large amount of fixed points with varying [15] T. Hazan and A. Shashua. Convergent message-passing algorithms for accuracy. Thus SBP will perform well for dense models with inference over general graphs with convex free energies. In Proceedings random couplings (particularly if they are centered around zero) of UAI, 2008. [16] A. Ihler, J. Fisher, and A. Willsky. Loopy belief propagation: convergence for which the marginals are largely affected by the local potentials. and effects of message errors. In Journal of Machine Learning Research, Not only do we expect that SBP approximates the marginals well pages 905–936, 2005. in that case but it is actually essential for estimating marginals as [17] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms BP will often fail to converge. for the Ising model. SIAM Journal on Computing, 22(5):1087–1116, 1993. In order to verify SBP empirically, we applied SBP to a range [18] J. K. Johnson, D. Oyen, M. Chertkov, and P. Netrapalli. Learning planar of models, not only attractive but also general ones. In our ex- Ising models. The Journal of Machine Learning Research, 17(1):7539– periments, SBP consistently improves the accuracy in comparison 7564, 2016. with BP (with and without damping). Besides, SBP approximates [19] M. Jordan. Graphical models. Statistical Science, pages 140–155, 2004. [20] C. Knoll, D. Mehta, T. Chen, and F. Pernkopf. Fixed Points of Belief the exact marginals well on graphical models for which BP does Propagation–An Analysis via Polynomial Homotopy Continuation. IEEE not converge at all. Trans. on Pattern Analysis and Machine Intelligence, 2017. SELF-GUIDED BELIEF PROPAGATION – A HOMOTOPY CONTINUATION METHOD 16

[21] C. Knoll and F. Pernkopf. On Loopy Belief Propagation–Local Stability [54] A. Weller and T. Jebara. Bethe bounds and approximating the global Analysis for Non-Vanishing Fields. In Proceedings of UAI, 2017. optimum. In Proceedings of AISTATS, 2013. [22] C. Knoll and F. Pernkopf. Belief propagation: Accurate marginals or [55] A. Weller and T. Jebara. Approximating the Bethe partition function. In accurate partition function – where is the difference? In Proceedings of Proceedings of UAI, 2014. UAI, 2019. [56] A. Weller and T. Jebara. Clamping variables and approximate inference. [23] C. Knoll, M. Rath, S. Tschiatschek, and F. Pernkopf. Message Scheduling In NIPS, pages 909–917, 2014. Methods for Belief Propagation. In Machine Learning and Knowledge [57] A. Weller, K. Tang, T. Jebara, and D. Sontag. Understanding the Bethe Discovery in Databases, pages 295–310. Springer, 2015. approximation: when and how can it go wrong? In Proceedings of UAI, [24] F. Koehler. Fast convergence of belief propagation to global optima: 2014. Beyond correlation decay. In Proceedings of NeurIPS, 2019. [58] M. Welling and Y. Teh. Approximate inference in Boltzmann machines. [25] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles Artificial Intelligence, 143(1):19–50, 2003. and Techniques. MIT press, 2009. [59] M. Welling and Y. W. Teh. Belief optimization for binary networks: A [26] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum- stable alternative to loopy belief propagation. In Proceedings of UAI. product algorithm. IEEE Trans. on Information Theory, 47(2), 2001. Morgan Kaufmann Publishers Inc., 2001. [27] V. Martin, J.-M. Lasgouttes, and C. Furtlehner. The role of normalization [60] A. S. Willsky, E. B. Sudderth, and M. J. Wainwright. Loop series and in the belief propagation algorithm. preprint arXiv:1101.4170, 2011. Bethe variational bounds in attractive graphical models. In NIPS, 2008. [28] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing [61] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy algorithms: a unifying view. In Proceedings of UAI, 2009. approximations and generalized belief propagation algorithms. IEEE [29] O. Meshi, A. Jaimovich, A. Globerson, and N. Friedman. Convexifying Trans. on Information Theory, 51(7):2282–2312, 2005. the Bethe free energy. In Proceedings of UAI, 2009. [62] A. Yuille and A. Rangarajan. The concave-convex procedure. Neural [30] M. Mezard and A. Montanari. Information, Physics, and Computation. Computation, 15(4):915–936, 2003. Oxford Univ. Press, 2009. [63] L. Zdeborova´ and F. Krzakala. Generalization of the cavity method for [31] M. Mezard, G. Parisi, and M. Virasoro. Spin Glass Theory and Beyond: adiabatic evolution of gibbs states. Physical Review B, 81(22), 2010. An Introduction to the Replica Method and Its Applications, volume 9. [64] L. Zdeborova´ and F. Krzakala. Statistical physics of inference: Thresh- World Scientific Publishing Co Inc, 1987. olds and algorithms. Advances in Physics, 65(5):453–552, 2016. [32] J. Mooij and H. Kappen. On the properties of the Bethe approximation and loopy belief propagation on binary networks. Journal of Statistical Mechanics: Theory and Experiment, 2005(11):P11012, 2005. [33] J. M. Mooij and H. J. Kappen. Sufficient conditions for convergence of the sum–product algorithm. IEEE Trans. on Information Theory, Christian Knoll received his Bsc and Msc de- 53(12):4422–4437, 2007. gree in Information and Computer Engineering [34] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for in 2011 and 2014 and earned his PhD degree approximate inference: an empirical study. In Proceedings of UAI, 1999. in 2019 from Graz University of Technology. He [35] F. Pernkopf, R. Peharz, and S. Tschiatschek. Introduction to Probabilistic is currently a postdoctoral researcher with the Graphical Models. Academic Press’ Library in Signal Processing, 2014. Signal Processing and Speech Communication [36] G. Perugini and F. Ricci-Tersenghi. Improved belief propagation algo- Laboratory at Graz University of Technology. rithm finds many Bethe states in the random-field Ising model on random His research interests include machine learning, graphs. Physical Review E, 97(1):012152, 2018. graphical models, and statistical signal process- [37] X. Pitkow, Y. Ahmadian, and K. D. Miller. Learning unbelievable ing with a particular focus on message passing probabilities. In Proceedings of NIPS, 2011. methods for probabilistic inference. [38] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317. Springer Science & Business Media, 2009. [39] W. Rudin. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964. [40] N. Ruozzi. Beyond log-supermodularity: lower bounds and the Bethe Adrian Weller received an MA in mathematics partition function. In Proceedings of UAI, 2013. from the University of Cambridge, and a PhD [41] A. Saade, F. Krzakala, and L. Zdeborova.´ Spectral bounds for the in Computer Science from Columbia University. Ising ferromagnet on an arbitrary given graph. Journal of Statistical He is Programme Director for AI at The Alan Mechanics: Theory and Experiment, 2017(5):053403, 2017. Turing Institute, the UK national institute for data [42] J. Shin. Complexity of Bethe Approximation. In Proceedings of science and AI, where he is also a Turing Fel- AISTATS, pages 1037–1045, 2012. low leading work on safe and ethical AI. He is The Numerical Solution of Systems of [43] A. Sommese and C. Wampler. a Principal Research Fellow in Machine Learn- Polynomials Arising in Engineering and Science , volume 99. World ing at Cambridge, and at the Leverhulme Cen- Scientific, 2005. tre for the Future of Intelligence where he is [44] D. Sontag and T. S. Jaakkola. New outer bounds on the marginal Programme Director for Trust and Society. His polytope. In Proceedings of NIPS, 2008. interests span AI, its commercial applications and helping to ensure [45] C. Srinivasa, S. Ravanbakhsh, and B. Frey. Survey Propagation beyond beneficial outcomes for society. Constraint Satisfaction Problems. In Proceedings of AISTATS, pages 286–295, 2016. [46] N. Stoop, T. Ott, and R. Stoop. Loopy belief propagation: Benefits and pitfalls on Ising-like systems. In Proceedings of NOLTA. International Symposium on Nonlinear Theory and its Applications, 2006. [47] C. Sutton and A. McCallum. Improved dynamic schedules for belief Franz Pernkopf received his MSc (Dipl. Ing.) propagation. In Proceedings of UAI, 2007. degree in Electrical Engineering at Graz Univer- [48] S. C. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs sity of Technology, and a PhD degree from the measures. In Proceedings of UAI, 2002. University of Leoben. In 2002 he was awarded [49] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree-reweighted the Erwin Schrodinger¨ Fellowship. He was a Re- belief propagation algorithms and approximate ml estimation by pseudo- search Associate at the University of Washing- moment matching. In Proceedings of AISTATS, 2003. ton, Seattle, from 2004 to 2006. From 2010-2019 [50] Y. Watanabe and K. Fukumizu. Graph zeta function in the Bethe free he was an Associate Professor at the Laboratory energy and loopy belief propagation. In NIPS, pages 2017–2025, 2009. of Signal Processing and Speech Communica- [51] Y. Weiss. Correctness of local probability propagation in graphical tion; since 2019, he is a Professor for Intelligent models with loops. Neural Comp., 12(1), 2000. Systems, at Graz University of Technology. His [52] A. Weller. Bethe and related pairwise entropy approximations. In research is focused on pattern recognition, machine learning, and com- Proceedings of UAI, 2015. putational data analytics with applications in various fields ranging from [53] A. Weller. Uprooting and rerooting graphical models. In International signal processing to medical data analysis and industrial applications. Conference on Machine Learning, pages 21–29, 2016.