<<

arXiv:1107.4196v3 [cs.IT] 20 Oct 2012 aoAt,C 40,UA(-al [email protected] (e-mail: USA 94304, CA Alto, Palo − ento 1 Definition where nant hr h umto soe all over is summation the where [ emnn fasur arx(see, square a of inlBteapoiain rp oe,priinfuncti partition cover, algorithm. sum-product graph permanent, matching, approximation, perfect Bethe tional ecnld h ae ihsm bevtosadconjecture and observations permanent-ba an kernels. some and better, pseudo-codewords with even permanent-based paper permanent about the the approximates conclude it we that considerat so under manent matrix the of permane of versions terms combinatoria lifted in a permanent so-called permanent present of the Bethe also on the We permanent, of permanent. bounds Bethe characterization Bethe upper the the potential on by on based bounded comment f lower we sum-produc the is and discuss the the permanent then that We the that efficiently. and that show minimum convex its we is finds Namely, function algorithm energy well. free works Bethe permanent the imating h emnn of permanent The S,Sp 9Ot ,21,ada h 01IfrainTheory Information 2011 the at Feb. USA, and CA, Jolla, La 2010, Diego, San 1, Mon UC A Workshop, Computing, 29–Oct. 48th Applications and the Sep. at Control, presented USA, 201 Communications, previously 20, on October was version paper Conference current this of in date material 2011; 21, July ceived eutn prxmto ftepraetteBtepermane Bethe minimizatio t the call function permanent We the algorithm. energy sum-product of free approximat the approximation of resulting be Bethe help the well certain with very problem a can solving entries, by real non-negative only o-eaiesur matrix, square non-negative a n 1 .O otbli ihHwetPcadLbrtre,10 P 1501 Laboratories, Hewlett–Packard with is Vontobel O. P. Manu Theory. Information on Transactions IEEE for Accepted ] oevr ecmeto osblte omdf h eh pe Bethe the modify to possibilities on comment we Moreover, otatti ento ihtedfiiino the of definition the with definition this Contrast eta otetpco hsppri h ento fthe of definition the is paper this of topic the to Central ne Terms Index approx- to approach this why reasons give we paper this In Abstract , if h eh emnn faNnNgtv Matrix Non-Negative a of Permanent Bethe The of { σ sgn( 1 θ , sa d permutation. odd an is , 2 I a eetybe bevdta h emnn of permanent the that observed been recently has —It n , . . . , σ i.e. ) Let , equals Bteapoiain eh emnn,frac- permanent, Bethe approximation, —Bethe det( perm( θ } θ θ . ( = = ) +1 sdfie ob h scalar the be to defined is .I I. θ θ i,j X if = ) NTRODUCTION σ σ ) i,j sgn( sa vnpruainadequals and permutation even an is X i.e. σ eara arxo size of matrix real a be fasur arxcontaining matrix square a of , σ i Y ∈ ) n [ n i e.g. ! Y ∈ ] [ emttoso h set the of permutations θ n i,σ [1]). , ] θ ( i,σ i ) , ( i ) , g). .Sm fthe of Some 2. g ilRoad, Mill age na Allerton nnual –1 2011. 6–11, aclO Vontobel O. Pascal iel,IL, ticello, determi- citre- script n ion. × sed (1) on, nts and act nt. ed he n  r- n d s t l . .Apoiain otePermanent the to Approximations B. the of above- computation the surprising. the not Therefore, contain are for #P-complete. permanent that numbers even is matrices complexity that ones of mentioned Note and permanent NP. zeros the class only the of in computation associat problems the problems decision counting the the of set “number with the or is P” #P (“sharp where #P [3], class P”) complexity the in is permanent mat the in exponential still size. is but permanent, the computing n ie h atta nmn plctosi sgo enough good is it an applications many compute in to that fact the given and h rt-oc complexity brute-force the iemti includes: matrix tive captur setup. be this can problems by counting interesting many matrices, n destructively.” and htptnilyhv uhlwrcomplexity. algorithm lower devise much approximate to have one to potentially allows exact that permanent, from the permanent. of requirements, the evaluation approximate in to relaxation methods This efficient on focuses fcetagrtm o optn h emnn)requires permanent) the computing Θ( for algorithms efficient aeweetesmi 1 otispstv n negative and positive contains (1) in sum terms, the g the where to contrast case in is This constructively.” “interfere sum ssmlri hscs eas ihti etito h sum the restriction this terms, non-negative with only because contains case (1) this in perman in the approximating simpler that expected is be to is non-negative, It negative. is (1) in nre u hr h product the where but entries eladtosadmlilctos eddt opt the compute to needed the in be multiplications) is of to and (number efficiently seem complexity additions as not arithmetic real least does the whereas at this Namely, conclude However, computed case. to determinant. be tempting the can is as permanent it determinant, the the that of definition the Permanent the Computing of Complexity A. 1 ntrso opeiycass h optto fthe of computation the classes, complexity of terms In ie h ifiut fcmuigtepraetexactly, permanent the computing of difficulty the Given • non-nega- a of permanent the approximating on work Earlier oevr ewl osdrol h aeweetematrix the where case the only consider will we Moreover, eas h ento ftepraetlossmlrthan simpler looks permanent the of definition the Because titysekn,teeaeas matrices also are there speaking, Strictly n · cee(PA)b erm icar n ioa[5] Vigoda to and lead Sinclair, ultimately approximation Jerrum, and randomized by [4] (FPRAS) polynomial Broder scheme fully of famous a work the with started which methods, Markov-chain-Monte-Carlo-based 2 n i.e. ) rtmtcoeain 2.Ti lal mrvsupon improves clearly This [2]. operations arithmetic h em nti u itreeconstructively “interfere sum this in terms the , approximation O 1 ( n ept hsrsrcint non-negative to restriction this Despite 3 ) i.e. Q ye’ loih oeo h most the of (one algorithm Ryser’s , i ∈ hr l nre of entries all where , O [ n ( ] n θ i,σ otepraet hspaper this permanent, the to · n ( i )= !) ) snnngtv o every for non-negative is θ ihpstv n negative and positive with O i.e. n h em nthis in terms the , 3 / 2 · ( θ n/e r non- are ) n eneral  σ ent for . rix ed ed θ s 1 2

(for more details, in particular for complexity estimates of • The minimum of the Bethe free energy function is quite these and related methods, see for example the discussion close to the minimum of the Gibbs free energy function. in [6]); Namely, as was recently shown by Gurvits [14], [15], • Godsil-Gutman-estimator-based methods by Karmarkar, the permanent is lower bounded by the Bethe permanent. Karp, Lipton, Lov´asz, and Luby [7] and by Barvinok [8]; Moreover, we list conjectures on strict and probabilistic • a divide-and-conquer approach by Jerrum and Vazi- Bethe-permanent-based upper bounds on the permanent. rani [9]; In particular, for certain classes of square non-negative • a Sinkhorn-matrix-rescaling-based method by Linial, matrices, empirical evidence suggests that the permanent Samorodnitsky, and Wigderson [10]; is upper bounded by some constant (that grows rather • Bethe-approximation / sum-product-algorithm (SPA) modestly with the matrix size) times the Bethe perma- based methods by Chertkov, Kroc, and Vergassola [11] nent. and by Huang and Jebara [12]. • We show that the SPA finds the minimum of the Bethe The study in this paper was very much motivated by these last free energy function under rather mild conditions. In two papers on graphical-model-based methods, in particular fact, the error between the iteration-dependent estimate because the resulting algorithms are very efficient and the of the Bethe permanent and the Bethe permanent itself obtained permanent estimates have an accuracy that is good decays exponentially fast, with an exponent depending enough for many purposes. on the matrix θ. Interestingly enough, in the associated The main idea behind this graphical-model-based approach convergence analysis a key role is played by a certain is to formulate a factor graph whose partition function equals that maximizes the sum of its entropy rate the permanent that we are looking for. Consequently, the plus some average state transition cost. negative logarithm of the permanent equals the minimum of Besides leaving some questions open with respect to (w.r.t.) the so-called Gibbs free energy function that is associated with the Bethe free energy function (see, e.g., the above-mentioned this factor graph. Although being an elegant reformulation conjectures concerning permanent upper bounds), these results of the permanent computation problem, this does not yield by-and-large validate the empirical success, as observed by any computational savings yet. Nevertheless, it suggests to Chertkov, Kroc, and Vergassola [11] and by Huang and look for a function that is tractable and whose minimum is Jebara [12], of approximating the permanent by graphical- close to the minimum of the Gibbs free energy function. One model-based methods. such function is the so-called Bethe free energy function [13], Let us remark that for many factor graphs with cycles the and with this, paralleling the above-mentioned relationship Bethe free energy function is not as well behaved as the between the permanent and the minimum of the Gibbs free Bethe free energy function under consideration in this paper. energy function, the Bethe permanent is defined such that its In particular, as discussed in [16], every code picked from negative logarithm equals the global minimum of the Bethe an ensemble of regular low-density parity-check codes [17], free energy function. The Bethe free energy function is an where the ensemble is such that the minimum Hamming interesting candidate because a theorem by Yedidia, Freeman, distance grows (with high probability) linearly with the block and Weiss [13] says that fixed points of the SPA correspond length, has a Bethe free energy function that is non-convex to stationary points of the Bethe free energy function. in certain regions of its domain. Nevertheless, decoding such In general, this approach of replacing the Gibbs free energy codes with SPA-based decoders has been highly successful function by the Bethe free energy function comes with very (see, e.g. [18]). few guarantees, though. C. Related Work • The Bethe free energy function might have multiple local The literature on permanents (and adjacent areas of counting minima. perfect matchings, counting zero/one matrices with specified • It is unclear how close the (global) minimum of the Bethe row and column sums, etc.) is vast. Therefore, we just mention free energy function is to the minimum of the Gibbs free works that are (to the best of our knowledge) the most relevant energy function. to the present paper. • It is unclear if the SPA converges, even to a local Besides the already cited papers [11], [12] on Bethe- minimum of the Bethe free energy function. (As we will approximation-based methods to the permanent of a non- see, the factor graph that we use (see Fig. 1) is not sparse negative matrix, some aspects of the Bethe free energy and has many short cycles, in particular many four-cycles. function were analyzed by Watanabe and Chertkov in [19] These facts might suggest that the application of the SPA and by Chertkov, Kroc, Krzakala, Vergassola, and Zdeborov´a to this factor graph is rather problematic.) in [20]. (In particular, the paper [19] applied the loop calculus Luckily, in the case of the permanent approximation problem, technique by Chertkov and Chernyak [21].) Very recent work one can formulate a factor graph where the Bethe free energy in that line of research is presented in a paper by A. B. Yedidia function is very well behaved. In particular, in this paper we and Chertkov [22] that studies so-called fractional free energy discuss a factor graph that has the following properties. functionals, and resulting lower and upper bounds on the • We show that the Bethe free energy function is, when permanent of a non-negative matrix. suitably parameterized, a convex function; therefore it has Because computing the permanent is related to counting no non-global local minima. perfect matchings, the paper by Bayati and Nair [23] on 3 counting matchings in graphs with the help of the SPA is will comment on this connection in Section VII-E. very relevant. Note that their setup is such that the perfect Finally, as already mentioned in the previous subsection, matching case can be seen as a limiting case (namely the zero- Gurvits’s recent papers [14], [15] contain important observa- temperature limit) of the matching setup. However, for the tions w.r.t. the relationship between the permanent and the case (a case for which the authors of [23] Bethe permanent of a non-negative matrix, and puts them into make no claims) the convergence proof of the SPA in [23] the context of Schrijver’s permanental inequality. is incomplete. Moreover, their matchings are weighted only inasmuch as the weight of a matching depends on the size D. Overview of the Paper of the matching. Consequently, because all perfect matchings have the same size, they all are assigned the same weight. (See This paper is structured as follows. We conclude this intro- also the related paper by Bayati, Gamarnik, Katz, Nair, and ductory section with a discussion of some of the notation that Tetali [24], and an extension to counting perfect matchings in is used. In Section II we then introduce the main normal factor certain types of graph by Gamarnik and Katz [25].) For an graph (NFG) for this paper, in Section III we formally define SPA convergence analysis of a slightly generalized weighted the Bethe permanent, in Section IV we discuss properties of matching setup, the interested reader is referred to a recent the Bethe entropy function and the Bethe free energy function, paper by Williams and Lau [26]. in Section V we analyze the SPA, in Section VI we give Very relevant to the present paper are also papers on max- a “combinatorial characterization” of the Bethe permanent product algorithm / min-sum algorithm based approaches to in terms of graph covers of the above-mentioned NFG, in the maximum weight perfect matching problem [27]–[30]. Section VII we discuss Bethe-permanent-based bounds on the As shown in these papers, these algorithms find the desired permanent, in Section VIII we list some thoughts on using solution efficiently for bipartite graphs, a fact which is strongly the concept of the “fractional Bethe entropy function,” in related to the observation that the linear programming re- Section IX we list some observations and conjectures, and laxation of the underlying integer linear program is tight in we conclude the paper in Section X. Finally, the appendix this case. This tightness in relaxation, which is an immediate contains some of the proofs. consequence of a theorem by Birkhoff and von Neumann (see Theorem 3), goes also a long way towards explaining why the E. Basic Notations and Definitions Bethe free energy function under consideration in the present This subsection discusses the most important notations that paper is well behaved. Finally, let us remark that because the will be used in this paper. More notational definitions will be difference between two perfect matchings corresponds to a given in later sections. union of disjoint cycles, the max-product algorithm / min- We let R be the field of real numbers, R be the set of sum algorithm convergence analysis in [27]–[30] has some >0 non-negative real numbers, R be the set of positive real resemblance with Wiberg’s max-product algorithm / min-sum >0 numbers, Z be the ring of integers, Z be the set of non- algorithm convergence analysis for so-called cycle codes [31]. >0 negative integers, Z be the set of positive integers, and for Linial, Samorodnitsky, and Wigderson [10] published a >0 any positive integer L we define [L] , 1,...,L . Scalars deterministic strongly polynomial algorithm to compute the are denoted by non-boldface characters, whereas{ vectors} and permanent of an n n non-negative matrix within a multiplica- matrices by boldface characters. For any positive integer L, tive factor of en. This× is related to the present paper because the matrix 1 is the all-one matrix of size L L. their approach is based on Sinkhorn’s matrix rescaling method, L×L × which can be seen as finding the minimum of a certain free energy type function. Assumption 2 Throughout this paper, if not mentioned other- The present paper has some similarities with recent papers wise, n is a positive integer and θ = (θi,j )i,j is a non-negative matrix of size n n. Moreover, we assume that θ is such that by Barvinok on counting zero/one matrices with prescribed × row and column sums [32] and by Barvinok and Samorod- perm(θ) > 0, i.e., there is at least one permutation σ of [n] such that .  nitsky on computing the partition function for perfect match- i∈[n] θi,σ(i) > 0 ings in hypergraphs [33]. However, these papers pursue what We useQ calligraphic letters for sets, and the size of a set would be called a mean-field theory approach in the physics S literature [34]. An exception to the previous statement is is denoted by . For a finite set , we let ΠS be the set of probability mass|S| functions over ,Si.e., Section 3.2 in [32], which contains Bethe-approximation-type S computations. (See the references in that section for further , p > for all papers that investigate similar approaches.) ΠS = ps s∈S ps 0 s , ps =1 . ( ∈ S ) As mentioned in the abstract, the present paper discusses a s∈S  X combinatorial characterization of the Bethe permanent in terms Moreover, for any positive integer L, we define to be L×L of permanents of so-called lifted versions of the matrix under the set of all L L permutation matrices, i.e., P consideration. For this we use results from [16] that give a × combinatorial characterization of the Bethe partition function P is a matrix of size L L P contains exactly one 1×per row of a factor graph in terms of the partition function of graph , P  . PL×L P contains exactly one 1 per column covers of this factor graph. Interestingly, very similar objects    P contains 0s otherwise  were considered by Greenhill, Janson, and Ruci´nski [35]; we

    4

Clearly, there is a bijection between and the set of all 1 1 PL×L permutations of [L]. Finally, for any positive integer L, we let 2 2 ΓL×L be the set of doubly stochastic matrices of size L L, i.e., × 3 3 γ > 0 for all (i, j) [L] [L] 4 4 i,j ∈ × Γ , γ = γ γi,j =1 for all i [L] . 5 5 L×L  i,j j∈[L] ∈  γi,j =1 for all j [L]   Pi∈[L] ∈  Fig. 1. The NFG N(θ) is based on a complete with two times The convex hull [36] P of some subset of some multi- S n vertices (here n = 5). The function nodes on the left-hand side represent the dimensional real space is denoted by conv( ). In the follow- local functions {g } ∈I , the function nodes on the right-hand side represent S i i ing, when talking about the interior of a polytope, we will the functions {gj }j∈J , and with the edge e = (i,j) we associate the variable mean the relative interior [36] of that polytope. Ae = Ai,j . (See Definition 4 for more details.) When appropriate, we will identify the set of L L real matrices with the L2-dimensional real space. In that× sense, two times n vertices, is a rather natural candidate, and, as we can be seen as a polytope in the 2-dimensional real ΓL×L L will see, has very interesting and useful properties. space. Clearly, ΓL×L is a convex set, and every permutation matrix of size L L is a doubly stochastic matrix of size L Definition 4 We define the NFG N(θ) , N( , , , ) as . Most interestingly,× every doubly stochastic matrix of siz×e L follows (see also Fig. 1). F E A G L L can be written as a convex combination of permutation • The set of vertices (henceforth also called function nodes) matrices× of size L L; this observation is a consequence of , ˙ , the important Birkhoff–von× Neumann Theorem. is , where [n] will be called the set of leftF verticesI ∪ and J , [nI] will be called the set of right vertices.2 J Theorem 3 (Birkhoff–von Neumann Theorem) For any • The set of full-edges is , = (i, j) i positive integer L, the set of doubly stochastic matrices of full , j and the set ofE half-edgesI × is J = , i.e., the∈ size L L is a polytope whose vertex set equals the set of I ∈ J Ehalf  ∅ × empty set. (A full-edge is an edge connecting two vertices, permutation matrices of size L L, i.e., × whereas a half-edge is an edge that is connected to only one vertex.) The set of edges is , = . vertex-set(ΓL×L)= L×L. E Efull ∪Ehalf Efull P • With every edge e = (i, j) we associate the variable ∈E As a consequence, the set of doubly stochastic matrices of size Ae = Ai,j with alphabet e = i,j , 0, 1 ; a L L is the convex hull of the set of all permutation matrices realization of willA be denotedA by { } . × Ae = Ai,j ae = ai,j of size L L, i.e., • The set , = will be called the × e e i,j i,j configurationA set, andA so A ΓL×L = conv( L×L). Q Q P a , (ae)e∈E = (ai,j ) Proof: See, e.g., [37, Section 8.7].  (i,j)∈I×J ∈ A Finally, all logarithms will be natural logarithms and the will be called a configuration. For a given vector a, we value of 0 log(0) is defined to be equal to 0. also define the sub-vectors · ai , (ai,j )j∈J and aj , (ai,j )i∈I . II. NORMAL FACTOR GRAPH REPRESENTATION When convenient, the vector a will be considered to be an Factor graphs are a convenient way to represent multivariate n n matrix. Then a corresponds to the ith row of a, and × i functions [38]. In this paper we use a variant called “normal aj corresponds to the jth column of a. (Note that we will factor graphs (NFGs)” [39] (also called “Forney-style factor also use the notations ai , (ai,j )j∈J and aj , (ai,j )i∈I graphs” [40]), where variables are associated with edges. when there is not necessarily an underlying configuration As already mentioned in the introduction, the main idea a of the whole NFG.) 3 4 behind the graphical-model-based approach to estimating the • For every i we define the local functions permanent is to formulate an NFG such that its partition ∈ I function equals the permanent. There are of course different θi,j (if ai = uj ) gi : i,j′ R, ai A → 7→ 0 (otherwise) ways to do this and typically different formulations will yield j′ (p different results when estimating the permanent with sub- Y optimal algorithms like the SPA. It is well known that when 2Here, F , I ∪˙ J stands for the more cumbersome F , {left}×I ∪ the NFG has no cycles, then the SPA computes the partition {right}×J . In the following, i (and variations thereof) will refer to a left function exactly, however, for the given problem any NFG vertex and j (and variations thereof) will refer to a right vertex. In that spirit, variables like ηi and ηj are different variables, also if i = j. 3 without cycles yields highly inefficient SPA update rules for Here and in the following, uj , j ∈ J , stands for the length-n vector reasonably large n (otherwise there would be a contradiction where all entries are zero except for the jth entry that equals 1. The vector ui, i ∈I, is defined similarly. to the considerations in Section I-A), and so we will focus on 4 Here and in the following, we will use the short-hands , , ′ , NFGs with cycles. The NFG that is introduced in the following Pi Pj Pi Pj′ , Pe, Pe′ for Pi∈I , Pj∈J , Pi′∈I , Pj′∈J , Pe∈E , Pe′∈E , definition and that is based on a complete bipartite graph with respectively, with similar conventions for products. 5

Similarly, for every j we define the local functions Definition 6 The (Gibbs) partition function of the NFG N(θ) ∈ J is defined to be the sum of the global function over all θi,j (if aj = ui) gj : i′ ,j R, aj configurations, or, equivalently, the sum of the global function A → 7→ 0 (otherwise) i′ (p over all valid configurations, i.e., Y • For every i we define the function node alphabet i Z , g(a)= g(c). (2) to be the set∈ I A G a c X∈A X∈C In the following, when confusion can arise what NFG a certain , ′ i ai i,j gi(ai) =0 = uj j . N A  ∈ ′ A 6  { | ∈J} Gibbs partition function is referring to, we will use ZG (θ) ,  j  5 Y etc., instead of ZG.  Similarly, for every j we define the function node    alphabet to be the∈ set J Aj Definition 7 The Gibbs free energy function associated with the NFG N(θ) is defined to be , a ′ g (a ) =0 = u i . Aj j ∈ Ai ,j j j 6 { i | ∈ I} ( i′ ) F : Π R, p U (p) H (p), Y G C → 7→ G − G (The sets and are also known as the local con- i j where straint codesA of theA function nodes i and j, respectively.) • The global function g is defined to be U : Π R, p pc log g(c) , G C → 7→ − · c∈C X  g : R, a g (a ) g (a ) . HG : ΠC R, p pc log pc . A → 7→ i i ·  j j  → 7→ − · i ! j c∈C Y Y X  • A configuration c with g(c) = 0 will be called a valid Here, UG is called the Gibbs average energy function and HG configuration. The set of all valid6 configurations, i.e., is called the Gibbs entropy function. In the following, when confusion can arise what NFG a certain Gibbs free energy , c g(c) =0 C ∈ A 6 function is referring to, we will use FG,N(θ), etc., instead of  ci,j i,j , (i, j) FG. Similar comments apply to UG and HG.  = (c ) c∈ A , i ∈I×J ,  i,j i,j∈I×J i ∈ Ai ∈ I  cj j , j For more details on these functions we refer to, e.g., [13].  ∈ A ∈ J  will be called the global behavior of N(θ). Considering For a discussion of these functions in the context of NFGs we   the elements of as n n matrices, it can easily be refer to, e.g., [16]. Note that HG is a concave function of p, C × that U is a linear function of p, and that, consequently, F verified that = n×n. This allows us to associate with G G C P is a convex function of p. c the permutation σc :[n] [n] that maps i to ∈C → ∈ I j if ci,j =1.  ∈ J Lemma 8 The permanent of θ can be expressed in terms of Lemma 5 Consider the NFG N(θ) and let c be a valid the partition function or in terms of the minimum of the Gibbs N configuration of it. Then ∈C free energy function of (θ). Namely,

gi(ci)= θi,σc(i), i , perm(θ)= ZG = exp min FG(p) , (3) ∈ I − p q   gj(cj )= θ −1 , j , σc (j),j where the minimization is over p Π . ∈ J ∈ C q g(c)= θ = θ −1 . i,σc(i) σc (j),j Proof: The first equality is a straightforward consequence of i j Y Y Definitions 1 and 4, along with Lemma 5. For the second Proof: The first two expressions follow easily from the defi- equality we refer to, e.g., [13], [16].  nitions of g and g in Definition 4. The third expression is a i j The partition function ZG and the Gibbs free energy func- consequence of tion FG were specified for temperature T = 1 in the above definitions. For a general temperature parameter T R>0, ∈ 1/T g(c)= g (c ) g (c ) these functions have to be replaced by ZG , g(c) i i ·  j j  c∈C i ! j and by FG(p) , UG(p) T HG(p), respectively, and Y Y − · 1 P   Lemma 8 has to be replaced by ZG = exp minp FG(p) . − T Of course, ZG = perm(θ) does not hold anymore, unless a = θi,σ (i) θ −1 c ·  σc (j),j   i ! j suitable T -dependence is built into the definition of perm(θ). Y q Y q   5Note that “function” in “partition function” refers to the fact that the ex- = θ θ ′ ′ i,σc(i) · i ,σc(i ) pression in (2) typically is a function of some parameters like the temperature i ! i′ ! T (see the discussion below). A better word for “partition function” would Y q Y q possibly be “partition sum” or “state sum,” which would more closely follow = θi,σc(i). the German “Zustandssumme” whose first letter is used to denote the partition i  function. Y 6

III. THE BETHE PERMANENT where Although the reformulation of the permanent in Lemma 8 U : R, β U (β )+ U (β ) B B → 7→ B,i i B,j j in terms of a convex minimization problem is elegant, from a i j X X computational perspective it does not represent much progress. H : R, β H (β )+ H (β ) B B → 7→ B,i i B,j j However, it suggests to look for a minimization problem that i j can be solved efficiently and whose minimum value is related X X H (β ), to the desired quantity. This is the approach that is taken in − B,e e e this section and will be based on the Bethe approximation of X 6 the Gibbs free energy function: the resulting approximation with of the permanent of a non-negative square matrix will be U : R, β β a log g (a ) , B,i Bi → i 7→ − i, i · i i called the Bethe permanent. (Note that in this section we give ai the technical details only; for a general discussion w.r.t. the X  U : R, β β a log g (a ) , motivations behind the Bethe approximation we refer to [13], B,j Bj → j 7→ − j, j · j j aj and for a discussion of the Bethe approximation in the context X  R of NFGs we refer to [16].) HB,i : i , βi βi,ai log(βi,ai ), B → 7→ − a · Xi R Definition 9 Consider the NFG N(θ). We let HB,j : j , βj βj,aj log(βj,aj ), B → 7→ − · aj β , (β ) , (β ) , (β ) X i i∈I j j∈J e e∈E H : R, β β log(β ). B,e Be → e 7→ − e,ae · e,ae a be a collection of vectors based on the real vectors Xe Here, UB is the Bethe average energy function and HB is βi , (βi,ai )ai∈Ai , the Bethe entropy function. In the following, when confusion can arise what NFG a certain Bethe free energy function is βj , (βj,aj )aj ∈Aj , referring to, we will use FB,N(θ), etc., instead of FB. Similar βe , (βe,ae )ae∈Ae . comments apply to UB and HB.  Moreover, we define the sets With this, the Bethe partition function of an NFG is defined such that an equality analogous to the second equality in (3) i , ΠAi , i , B ∈ I holds. , Π , j , Bj Aj ∈ J , Π , e , Definition 11 The Bethe partition function of the NFG N(θ) Be Ae ∈E is defined to be and call , , and , the ith local marginal polytope, Bi Bj Be the jth local marginal polytope, and the eth local marginal ZB , exp min FB(β) . − β∈B polytope, respectively. (Sometimes i is also called the ith   belief polytope, etc.) B In the following, when confusion can arise what NFG a certain With this, the local marginal polytope (or belief polytope) Bethe partition function is referring to, we will use ZB(N), is defined to be the set etc., instead of ZB.  B βi i for all i The next definition is the main definition of this paper and ∈B ∈ I βj j for all j was motivated by the work of Chertkov, Kroc, and Vergasso-  β ∈B for all e ∈ J   e e  la [11] and by the work of Huang and Jebara [12].  ∈B ∈E   ′   βi,a = βe,ae   i  Definition 12 Consider the NFG N(θ). The Bethe permanent  a′ ′  =  β i∈Ai: ai,j =ae  , B  P  of θ, which will be denoted by permB(θ), is defined to be  for all e = (i, j) , ae e  ∈E ∈ A N permB(θ) , ZB (θ) .  β ′ = β  j,aj e,ae   a′ ∈A : a′ =a   j j i,j e   R  P  A similar comment w.r.t. a temperature parameter T >0  for all e = (i, j) , ae e  ∈  ∈E ∈ A  as at the end of Section II applies also to the definition of the   where β  is called a pseudo-marginal vector. (The two Bethe partition function and the Bethe free energy function. constraints∈ that B were listed last in the definition of will be In the following, however, we will only consider the case called “edge consistency constraints.”) B  T = 1. An exception is Section VIII on the fractional Bethe approximation: this approximation can be viewed as introduc- ing multiple temperature parameters, namely one temperature Definition 10 The Bethe free energy function associated with parameter for every term of H , and therefore includes the the NFG N(θ) is defined to be the function B single temperature parameter case as a special case.

FB : R, β UB(β) HB(β), 6 Here and in the following, we use the short-hand a for a , etc.. B → 7→ − P i P i∈Ai 7

IV. PROPERTIES OF THE BETHE ENTROPY FUNCTION Proof: It is straightforward to verify that the pseudo-marginal ANDTHE BETHE FREE ENERGY FUNCTION vector β which is specified in the lemma statement is indeed There are relatively few general statements about the shape in . Moreover, one can verify that for every pseudo-marginal vectorB β there is a γ Γ such that γ indexes β.  of the Bethe entropy function. In this section we show that ∈B ∈ n×n Bethe entropy function associated with N(θ) has many special In the following, for a given matrix γ = (γi,j )(i,j)∈I×J , properties. the ith row of γ will be denoted by γi = (γi,j )j∈J and the • In general, the Bethe entropy function is not a concave jth column of γ will be denoted by γj = (γi,j )i∈I . function. However, here we show that the Bethe entropy The above observations allow us to express the Bethe free N energy function and related functions in terms of γ Γn×n. function associated with (θ) is, when suitably parame- ∈ terized, a concave function. Similarly, the Bethe free energy function is in general Lemma 14 Consider the NFG N(θ). Then not a convex function. However, because the Bethe free FB : Γn×n R, γ UB(γ) HB(γ), energy function is the difference of the Bethe average → 7→ − energy function and the Bethe entropy function, because where the Bethe average energy function is linear in its argu- U : Γ R, γ U (γ )+ U (γ ), ments, and because the Bethe entropy function is concave, B n×n → 7→ B,i i B,j j i j the Bethe free energy function associated with N(θ) is X X 7 R convex and does not have non-global local minima. HB : Γn×n , γ HB,i(γi)+ HB,j (γi) → 7→ i j • In general, the Bethe entropy function can take on pos- X X itive, zero, and negative values. However, here we show H (γ ), − B,(i,j) i,j that the Bethe entropy function associated with N(θ) is i,j non-negative. X with • Very often, the directional derivative of the Bethe entropy 1 function away from a vertex of its domain is + or UB,i : Π R, γi γi,j log(θi,j ), N ∞ [n] → 7→ −2 · . For the Bethe entropy function of (θ) we show j −∞ X that the directional derivative away from any vertex of 1 U : Π R, γ γ log(θ ), its domain has a (non-negative) finite value. (As we will B,j [n] → j 7→ −2 i,j · i,j i see in Section V, this observation will have important X consequences for the SPA convergence analysis.) HB,i : Π R, γi γi,j log(γi,j ), [n] → 7→ − · j X H : Π R, γ γ log(γ ), A. Reformulation of the Bethe Entropy Function B,j [n] → j 7→ − i,j · i,j i and the Bethe Free Energy Function X H : [0, 1] R As mentioned in Section I-C, the successes of max-product B,(i,j) → γ γ log(γ ) (1 γ ) log(1 γ ), algorithm / min-sum algorithm based approaches to the bipar- i,j 7→ − i,j i,j − − i,j − i,j tite graph maximum weight perfect matching problem in the Proof: This follows straightforwardly from Definition 10 and papers [27]–[30] was heavily based on a theorem by Birkhoff Lemma 13.  and von Neumann (see Theorem 3). This theorem is equally central to the results of the present paper. Namely, in the next lemma we introduce a parameterization of the belief polytope Corollary 15 It holds that based on Γn×n that will be used for the rest of the paper. B permB(θ) = exp min FB(γ) , − γ∈Γn×n Lemma 13 Consider the NFG N(θ). Its belief polytope can   B where be parameterized by Γn×n, the set of doubly stochastic matri- ces of size n n. In particular, we define the parameterization FB(γ)= UB(γ) HB(γ), such that the× matrix γ = (γ ) Γ indexes the − i,j (i,j)∈I×J ∈ n×n U (γ)= γ log(θ ), pseudo-marginal vector β with B − i,j i,j ∈B i,j X βi,ai = βj,aj = γi,j , HB(γ)= γi,j log(γi,j )+ (1 γi,j ) log(1 γi,j ). ai=uj aj =ui − − − i,j i,j and X X Proof: This follows from Definitions 11 and 12 and from  βe,ae =1 γi,j , βe,ae = γi,j , Lemma 14. ae=0 − ae=1 If the sign in front of the second half of the expression for every i , j , and e = (i, j) . ∈ I ∈ J ∈E for HB(γ) in Corollary 15 were a minus sign, then HB(γ) could be expressed as a sum of binary entropy functions, 7The fact that convexity / non-convexity of a function depends on its param- eterization might explain the non-convexity observations in [12, Section 3.3] and therefore the concavity of HB(γ) would be immediate. w.r.t. the Bethe free energy function. However, the presence of the plus sign means that a more 8

0.15 1 careful look at HB(γ) is required to determine if it is concave 0.1 or not. 0.8

0.05 0.6 Assumption 16 For the rest of this section we assume that ) ξ 2

( 0 ξ n > 2 and that θ is a positive matrix of size n n. This s × 0.4 simplifies the wording of most results without hurting their −0.05

0.2 generality too much. In practice, two possible ways to deal −0.1 with the issue of zero entries in θ are the following. 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 • One can change the matrix θ so that zero entries become ξ ξ1 tiny positive entries. • One can redefine N(θ) by removing the edge e = (i, j), Fig. 2. Left: plot of the function s, see Definition 17. Right: contour plot of the function (ξ1, ξ2) 7→ S(ξ1, ξ2, 1−ξ1 −ξ2), see Definition 19. along with redefining the local functions gi and gj , if θi,j =0 

Fig. 2 (right) shows the function S for n = 3. More B. Concavity of the Bethe Entropy Function precisely, that plot shows the contour plot of the function and Convexity of the Bethe Free Energy Function (ξ , ξ ) S(ξ , ξ , 1 ξ ξ ). 1 2 7→ 1 2 − 1 − 2 Towards showing that HB(γ) is a concave function of γ, Clearly, if the domain of the function S were the set [0, 1]n, and subsequently that FB(γ) is a convex function of γ, we then S would not be concave everywhere because s is not first study two useful functions. Namely, in Definition 17 and concave everywhere. Therefore, the observation that is made in Lemma 18 we look at a function called s, and in Definition 19 the following theorem, namely that S is concave, is non-trivial. and Theorem 20 we look at a function called S. Note that in (Note that because the function s is concave in [0, 1/2], the this section we use the short-hands and ∗ for n ℓ ℓ6=ℓ ℓ∈[n] function S is concave in Π[n] [0, 1/2] . Therefore, as we will and ∗ , respectively. ∩ ℓ∈[n]: ℓ6=ℓ P P P see, most of the work in the proof of the following theorem P will be devoted to proving the concavity of the function S in Definition 17 Let s be the function Π [0, 1/2]n.) [n] \ s : [0, 1] R, ξ ξ log(ξ)+(1 ξ) log(1 ξ). → 7→ − − − Note that in contrast to the binary entropy function, there is a plus sign (not a minus sign) in front of the second term.  Theorem 20 The function S from Definition 19 is concave and satisfies S(ξ) > 0 for all ξ Π . Moreover, ∈ [n] Lemma 18 The function s that is specified in Definition 17 • For n =2, it holds that S(ξ)=0 for all ξ Π . ∈ [n] has the following properties. • For n > 3, the function S is at almost all points in • As can be seen from Fig. 2 (left), the graph of the function its domain a strictly concave function. However there s is s-shaped. are points in its domain and corresponding directions • The first-order derivative of s is in which the function S is linear. d s(ξ)= 2 log ξ(1 ξ) . dξ − − − Proof: See Appendix A.  • The second-order derivative of sis  After the original submission of the present paper, an d2 1 1 1 2ξ alternative proof of the concavity of the function S has been s(ξ)= + = − . dξ2 − ξ 1 ξ −ξ(1 ξ) given by Gurvits, see [15, Section 5.1]. − − Clearly, the function s(ξ) is strictly concave in the Interestingly, the functions s and S have recently appeared interval 0 6 ξ < 1/2 and strictly convex in the interval also in another context [41]. (We refer to [41] for details.) In 1/2 < ξ 6 1. particular, that paper gives a direct proof of S(ξ) > 0 for all • The graph of s has a point-symmetry at (1/2, 0). ξ Π[n]; this is in contrast to the proof of that statement in Theorem∈ 20 which was mainly based on the concavity of S. Proof: The proof of this lemma is based on straightforward calculus and is therefore omitted.  Lemma 21 The Bethe entropy function can be expressed in Definition 19 Let S be the function terms of the function S as follows S : Π R, ξ s(ξ )= ξ log(ξ ) [n] → 7→ ℓ − ℓ ℓ ℓ ℓ H : Γ R X X B n×n → + (1 ξℓ) log(1 ξℓ). 1 1 − − γ S(γ )+ S(γ ). ℓ 7→ 2 i 2 j X  i j X X 9

Proof: This result follows from i.e., the function S ξ(τ) can very well be approximated by a linear function for 0 < τ 1. Note that the coefficient of H (γ)  B τ in (4) is non-negative. ≪ (a) = γ log(γ )+ (1 γ ) log(1 γ ) − i,j i,j − i,j − i,j Proof: See Appendix B.  i,j i,j X X 1 = γ log(γ )+ (1 γ ) log(1 γ ) + A word of caution: the behavior of the function S is 2 − i,j i,j − i,j − i,j  i j j somewhat special around a vertex ξ of Π[n]: namely, in general X X X ˆ   there is no gradient vector G such that S(ξ + τ ξ) = 1 ˆ 2 ˆ · 2 γ log(γ )+ (1 γ ) log(1 γ ) S(ξ) + τ ℓ Gℓξℓ + O(τ ) = τ ℓ Gℓξℓ + O(τ ) for 2 − i,j i,j − i,j − i,j · · ˆ j i i ! 0 < τ 1 and for all possible direction vectors ξ. X X X ≪ P P (b) 1 1 Lemma 24 has the following consequences for the behavior = S(γ )+ S(γ ), 2 i 2 j of the Bethe entropy function at a vertex of its domain. i j X X where at step (a) we have used Corollary 15 and where at Lemma 25 Let step (b) we have used Definition 19.  γ(τ) , γ + τ γˆ, · Theorem 22 The Bethe entropy function HB(γ) is a concave where γ is a vertex of Γ and where γˆ = 0 is such ∈ C n×n 6 function of γ Γn×n. Moreover, for all γ Γn×n it holds that γ(τ) Γ for small non-negative τ. This means that γ ∈ ∈ ∈ n×n that HB(γ) > 0. corresponds to the permutation σγ . (In the following statement we will use the short-hands σ , σ and σ¯ , σ−1.) Then, for Proof: Lemma 21 showed that H (γ) can be written as a sum γ γ B 0 < τ 1, we have of S-functions. The concavity of HB(γ) then follows from ≪ Theorem 20 and the fact that the sum of concave functions HB γ(τ) is a concave function. Similarly, the non-negativity of HB(γ)  γˆ γˆ follows from Theorem 20 and the fact that the sum of non- = τ γˆ | i,j | log | i,j | + O(τ 2)  | i,σ(i)| ·− γˆ γˆ  negative functions is a non-negative function. i i,σ(i) i,σ(i) X j6=Xσ(i) | | | |   γˆ γˆ Corollary 23 The Bethe free energy function FB(γ) is a = τ γˆ | i,j | log | i,j | + O(τ 2), | σ¯(j),j | ·− γˆ γˆ  convex function of γ Γn×n. j σ¯(j),j σ¯(j),j ∈ X i6=¯Xσ(j) | | | | Proof: This follows from F (γ) = U (γ) H (γ) (see   B B − B i.e., the function HB γ(τ) can very well be approximated by Corollary 15), from the fact that UB(γ) is a linear function a linear function for 0 < τ 1. Note that the coefficient of  ≪ of γ (see Corollary 15), and from the fact that HB(γ) is a τ is non-negative. concave function of γ (see Theorem 22).  Proof: See Appendix C.  C. Behavior of the Bethe Entropy Function and the Bethe Free ˆ Energy Function at a Vertex of their Domain Assume that γ in Lemma 25 is chosen such that ˆ i γˆi,σ(i) = 1. (If this is not the case, then γ can be In this section we study the Bethe entropy function and rescaled| by| a positive real number such that this condition the Bethe free energy function near a vertex of their domain. Pis satisfied.) The coefficient of τ in the first display equation Because both functions can be expressed in terms of the of Lemma 25 can be given the following meaning. It is the function S, we first study the behavior of S near a vertex entropy rate of the time-invariant Markov chain corresponding of its domain. to the (backtrackless) random walk on the NFG N(θ) (see Fig. 1) with the following properties:8 Lemma 24 Let • The probability of being at vertex i is γˆ . ˆ ∈ I | i,σ(i)| ξ(τ) , ξ + τ ξ, • The probability of going to vertex j σ(i) , · ∈ J\{ } conditioned on being at vertex i , is γˆi,j / γˆi,σ(i) . where the vector ξ Π is a vertex of Π and where ξˆ = 0 ∈ I | | | | ∈ [n] [n] 6 The probability of going to vertex σ(i) , conditioned is such that ξ(τ) Π[n] for small non-negative τ. This means on being at vertex i , is 0. ∈ J ∈∗ that there is an ℓ [n] such that ξ satisfies ξ ∗ = 1 and ∈ I ℓ • The probability of being at vertex j is γˆ . ∗ ∈ ˆ σ¯(j),j ξ = 0, ℓ = ℓ , and such that ξ satisfies ξˆ ∗ < 0, ξˆ > 0, ∈ J | | ℓ ℓ ℓ • The probability of going to vertex σ¯(j) , conditioned ℓ = ℓ∗, and6 ξˆ =0. Then, for 0 < τ 1, we have ∈ I 6 ℓ ℓ ≪ on being at vertex j , is 1. The probability of going∈ J to vertex i′ σ¯(j) , P ˆ ˆ ∈ I\{ } ξℓ ξℓ 2 conditioned on being at vertex , is . S ξ(τ) = τ ξˆℓ∗ | | log | | + O(τ ), j 0 · | | · − ˆ ∗ ˆ ∗  ∈ J ℓ6=ℓ′ ξℓ ξℓ !  X | | | | 8For a discussion of the entropy rate of a time-invariant Markov chain, see,   (4) e.g., [42, Section 4.2]. 10

The above two half-steps of the random walk can be combined Corollary 27 Consider a vertex γ of Γn×n and define ρ for into one step: γ as in Theorem 26.

• The probability of being at vertex i is γˆi,σ(i) . • If ρ< 1 then FB has its unique minimum at γ. ′ ′ ∈ I | | • For i,i with i = i , the probability of going to • If ρ> 1 then FB is not minimal at γ. vertex σ(∈i′) Iand then6 to vertex i′, conditioned on being Proof: Consider the setup of Theorem 26. From that theorem at vertex i, is γˆ ′ / γˆ . | i,σ(i )| | i,σ(i)| we know that An analogous interpretation can be given to the coefficient of τ in the second display equation of Lemma 25. Observe that F γ(τ) > log(θ ) τ log(ρ)+ O(τ 2), B − i,σ(i) − · the condition γˆ = 1 is equivalent to the condition i i | i,σ(i)|  X γˆσ¯(j),j =1. with equality for the direction matrix γˆ that was specified j | | P Note that similar random walks appeared in the analysis of there. Moreover, from Corollary 23 we know that F is convex P B the Bethe entropy function for so-called cycle codes (cf. [43]) over Γn×n. Therefore, if log(ρ) < 0 (i.e., ρ< 1) then FB has and in the analysis of linear programming decoding of low- a unique minimum at γ. On the other hand, if log(ρ) > 0 density parity-check codes (cf. [44], which gives a ran- (i.e., ρ> 1) then FB cannot be minimal at γ. dom walk interpretation of a result by Arora, Daskalakis, Note that for log(ρ)=0 (i.e., ρ = 1), the minimality / 2 Steurer [45] and its extensions by Halabi and Even [46]). non-minimality of FB at γ is determined by the O(τ ) term. Actually, given the fact that the symmetric difference of two  perfect matchings corresponds to a union of cycles in N(θ), Typically, the Bethe entropy function and the Bethe free the similarity of the random walks here and of the random energy function have a positive or negative infinite directional walks in the above-mentioned context of cycle codes is not derivative away from a vertex of their domain because of the totally surprising. appearance of terms like c τ log(τ). However, because for the We come now to the main result of this subsection. Al- function S all these c τ log(· τ·) terms cancel in the vicinity of a though this result is interesting in its own right, it will be vertex of its domain (see· · the proof of Theorem 20, in particular especially important for the convergence analysis of the SPA Eq. (19) in Appendix A-B), the directional derivatives of the in Section V. Bethe entropy function and the Bethe free energy function are finite away from a vertex of their domain. Theorem 26 Let Let us conclude this section by pointing out that the obser- γ(τ) , γ + τ γˆ, vations that were made in this subsection give an alternative · viewpoint of some of the results that were presented in [19, where γ is a vertex of Γn×n and where γˆ = 0 is such Section 3]. that γ(τ)∈ CΓ for small non-negative τ. This means6 that γ ∈ n×n corresponds to the permutation σγ . (In the following statement V. SUM-PRODUCT-ALGORITHM-BASED SEARCHOFTHE −1 we will use the short-hands σ , σγ and σ¯ , σγ .) We also MINIMUMOFTHE BETHE FREE ENERGY FUNCTION assume that γˆ is normalized as follows Assumption 28 In this section we make the following two γˆ = γˆ =1. (5) | i,σ(i)| | σ¯(j),j | assumptions, both with the goal of simplifying the wording of i j X X most results without hurting their generality too much.9 Then, for 0 < τ 1, we have • We assume that n > 3 and that θ is a positive matrix of ≪ 2 size n n. FB γ(τ) > log(θi,σ(i)) τ log(ρ)+ O(τ ), (6) × − − · • We assume that the minimum of the Bethe free energy i  X function FB is either in the interior of Γn×n or at a where ρ is the maximal (real) eigenvalue of the n n matrix vertex of Γn×n, but not at a non-vertex boundary point A with entries × of Γn×n. A possibility to guarantee this with probability 1 θ ′ i,σ(i ) (if i = i′) is to apply tiny random perturbations to the entries of θ. θi,σ i Ai,i′ , ( ) 6 .  (0 (otherwise) Note that equality holds in (6) for the matrix γˆ with entries In Definition 12 we have defined the Bethe permanent of a square matrix θ via the minimum of the Bethe free energy L R u ·A ′ ·u ′ +κ i i,i i (if i = i′) γˆ ′ , ρ , 9The purpose of these assumptions is, in particular, to avoid dealing with i,σ(i ) · L R 6 ( κ u u (otherwise) matrices θ which have the following property. Namely, consider the subgraph − · i · i induced by the edge subset (i,j) ∈ E θi,j > 0}. Assume that one of where uL and uR are, respectively, the left and right eigen- the connected components of this subgraph is a cycle (necessarily of even vectors of A with eigenvalue ρ, and where κ is a suitable length), and consider the partition of the edge set of this cycle into two sets E′ and E′′ such that the edges of this cycle are alternatingly placed into E′ normalization constant such that (5) is satisfied. ′′ and E , respectively. If Q(i,j)∈E′ θi,j = Q(i,j)∈E′′ θi,j holds, then the SPA exhibits a periodic behavior unless the initial messages correspond to Proof: See Appendix D.  SPA fixed point messages. A matrix having this property is, e.g., the matrix 1 1 θ = 1 1 . Here, the relevant cycle (1, 1) − (1, 2) − (2, 2) − (2, 1) − (1, 1) has length four and one verifies that θ1,1 · θ2,2 = θ1,2 · θ2,1. 11 function of the NFG N(θ). In Corollary 23 we have seen that turns out to be sufficient to keep track of the likelihood ratios the Bethe free energy function is a convex function, i.e., it (t) (t) (t) −→µi,j (0) (t) ←−µi,j (0) behaves very favorably. This means that we could use any −→Λ , , ←−Λ , , generic optimization algorithm (see, e.g., [36], [47]) to find i,j (t) i,j (t) −→µi,j (1) ←−µi,j (1) the minimum of the Bethe free energy function, and with that respectively. Actually, for the NFG under consideration it is the Bethe permanent of θ. However, given the special structure more convenient to deal with the inverses of these quantities, of the optimization problem, there is the hope that there are and so we define the inverse likelihood ratios as follows more efficient approaches. −1 −1 A natural candidate for searching this minimum is the (t) (t) (t) (t) −→Vi,j , −→Λi,j , ←−Vi,j , ←−Λi,j . SPA [38]–[40]. The reason for this is that a theorem by     Yedidia, Freeman, and Weiss [13] says that fixed points of Lemma 29 Consider the NFG N(θ). The inverse likelihood the SPA correspond to stationary points of the Bethe free 10 ratio update rules for the left-hand side and right-hand side energy function. Given the convexity of the Bethe free function nodes of N(θ) are given by, respectively, energy function, the following two questions must therefore be answered: θi,j (t) > −→Vi,j = (t−1) , t 1, (i, j) , • If the minimum of F is in the interior of Γ , does ′ θ ′ ←−V ′ ∈I×J B n×n j 6=j pi,j · i,j the SPA always converge to a fixed point? θ (t) P p i,j > • If the minimum of FB is at a vertex of Γn×n, does the ←−Vi,j = (t) , t 1, (i, j) . ′ θ ′ −→V ′ ∈I×J SPA find that vertex? i 6=i p i ,j · i ,j In this section we answer both questions affirmatively, inde- The beliefsP at thep left-hand side and right-hand side function pendently of the matrix θ, and (nearly) independently of the nodes of N(θ) are given by, respectively, chosen initial messages. (t) (t) > βi,ai θi,j ←−Vi,j , t 0, (i, j) , The rest of this section is structured as follows. First ai=uj ∝ · ∈I×J we discuss the details of the SPA message update rules in (t) p (t) β θ −→V , t > 1, (i, j) . j,aj i,j i,j Section V-A. Afterwards, we state the SPA convergence result aj =ui ∝ · ∈I×J in Section V-B. Here the proportionalityp constants are defined such that for

every function node the beliefs sum to 1. At a fixed point of the SPA, the beliefs satisfy the edge consistency constraints, A. Sum-Product Algorithm Message Update Rules i.e., for every e = (i, j) and every ae e, it holds that (t) ∈E (∈t) A In this subsection we derive the SPA message update rules ′ ′ β ′ = ′ ′ β ′ . a ∈Ai: a =ae i,a a ∈Aj : a =ae j,a for the NFG N(θ) in Fig. 1. Here we only give the technical i i,j i j i,j j details; for a general discussion w.r.t. the motivations behind PProof: See Appendix E.P  the SPA we refer to [38]–[40]. Note that analogous SPA Let us remark on the side that the above update equations message update rules were already stated in [12], [20]. (In can be reformulated such that we only multiply by factors like contrast to [12], we use an undampened version of the SPA.) θi,j instead of by factors like θi,j . We leave the details to On a high level, the SPA works as follows. With every the reader. edge in Fig. 1 we associate a right-going message and a left- p going message. Every iteration of the SPA consists then of Remark 30 The SPA messages for the NFG N(θ) exhibit the two half-iterations, in the first half-iteration the right-going following property, a property that we will henceforth call messages are updated based on the left-going messages and “message gauge invariance.” Namely, consider the messages in the second half-iteration the left-going messages are updated (t) (t) based on the right-going messages. Finally, once some suitable ←−Vi,j and −→Vi,j i,j,t i,j,t convergence criterion is met or a fixed number of iterations n o n o has been reached, the pseudo-marginal vector (belief vector) that are connected by the update equations in Lemma 29. It is then easy to show that for any C R the messages is computed based on the messages at the last iteration. ∈ >0 Mathematically, we define for every t > 0 and every edge (t) 1 (t) (t) C ←−Vi,j and −→Vi,j (i, j) a left-going message µ : R, and · i,j,t C · ←−i,j i,j  i,j,t for every∈I×Jt > 1 and every edge (i, j) A a right-going→ n o (t) R ∈I×J also satisfy the update equations in Lemma 29. Moreover, the message −→µi,j : i,j . (t) (t) A → beliefs β a and β (aj ) are left unchanged For every left-going and for every right-going message it i, i i,ai,t j j,aj ,t by this rescaling of the inverse likelihood ratios. This is   10 because the normalization that appears in the definition of Strictly speaking, for NFGs with hard constraints, i.e., NFGs that contain (t) (t) β and β (aj ) removes the influence of this local functions that can assume the value zero for certain points in their i,ai i,ai,t j j,aj ,t domain (which is the case for N(θ)), this statement has only been proven message rescaling.  for interior stationary points of the Bethe free energy function (see [13,   Theorem 2]). For SPA fixed points with some beliefs equal to zero it is only conjectured that they correspond to edge-stationary points of the Bethe free Strictly speaking, the Bethe free energy function can only energy function (cf. discussion in [13, Section VI.D]). be evaluated at fixed points of the SPA. However, very often it 12 is desirable to track the progress towards the minimum Bethe of the underlying graph but also on the values of the non- free energy function value. This can be done via the so-called zero entries of the information matrix describing the Gaussian pseudo-dual function of the Bethe free energy function [48], graphical model. (Of course, the convergence speed of the SPA [49]. This function has the following two properties: it can on N(θ) does depend on the choice of θ.) be evaluated at any point during the SPA computations, and at a fixed point of the SPA its value equals the value of the Theorem 32 Consider the SPA for NFG N(θ), for which the Bethe free energy function. However, in general it is not a message update rules were established in Lemma 29. For non-increasing or a non-decreasing function of the iteration any initial set of inverse likelihood ratios (0) that ←−Vi,j i,j number. (0) satisfies 0 < ←−Vi,j < , (i, j) , the pseudo- marginals computed by∞ the SPA converge∈ I×J to the pseudo- N Lemma 31 Consider the NFG (θ). For any set of left- marginals that minimize the Bethe free energy function of going messages and any set of right-going messages ←−Vi,j i,j N(θ). More precisely, we can make the following statements. , the pseudo-dual function of the Bethe free energy (We remind the reader of the assumptions that were made in −→Vi,j i,j  function is Assumption 28.)  • If the minimum of FB is in the interior of Γn×n, then the # inverse likelihood ratios F ←−V , −→V = log θ ←−V Bethe i,j i,j −  i,j · i,j  i j (t) (t)   ←−Vi,j and −→Vi,j   X X p i,j,t t→∞ i,j,t t→∞   log θ −→V stay bounded and converge (modulo the message gauge − i,j · i,j j i ! invariance mentioned in Remark 30) to the fixed point X X p inverse likelihood ratios corresponding to the minimum + log 1+ ←−Vi,j −→Vi,j · of FB. i,j X   • If the minimum of FB is at the vertex γ of Γn×n, then Proof: See Appendix F.  the inverse likelihood ratios satisfy # In particular, if desired, we can evaluate FBethe af- (t) t→∞ (t) t→∞ ←−Vi,j , −→Vi,j , ter every half-iteration of the SPA, i.e., we can compute j=σγ (i) −−−→ ∞ j=σγ (i) −−−→ ∞ # (t−1) (t) # (t) (t) t→∞ t→∞ FBethe ←−Vi,j , −→Vi,j and FBethe ←−Vi,j , −→Vi,j for (t) (t) ←−Vi,j 0, −→Vi,j 0. every t > 1. j6=σγ (i) −−−→ j6=σγ (i) −−−→      

Finally, B. Convergence of the Sum-Product Algorithm exp F # ←−V (t) , −→V (t) perm (θ) 6 C e−ν·t Note that there are rather few general results concerning the − Bethe i,j i,j − B ·   behavior of message-passing type algorithms for NFGs with   for some constants C,ν R that depend on the matrix θ cycles. For certain classes of graphical models and message- >0 and the initial messages.∈ passing type algorithms, early results showed that under the assumption that the algorithm converges then the obtained Proof: See Appendix G.  estimates are correct (see, e.g., the results in [50], [51]). Later, Explicit convergence speed estimates (in particular, values conditions for convergence were established for a variety of for C and ν) can be extracted from the proof of Theorem 32. graphical models and message-passing type algorithms (see, However, we think that a more sophisticated analysis might e.g., [52]–[55] and references therein). However, these results yield tighter convergence speed estimates; we leave this as an do not seem to be applicable to the NFG under consideration open problem for future research. in this paper. The SPA convergence proof that is the most relevant for the present paper is the one in the paper by Bayati and Nair [23] VI. FINITE-GRAPH-COVER INTERPRETATION (see also the comments that we made about this paper in OFTHE BETHE PERMANENT Section I-C). However, the fact that the graphical model in [23] Note that the definition of the permanent of θ in Definition 1 counts matchings (and not only perfect matchings like here), has a “combinatorial flavor.” In particular, it can be seen as implies a different behavior of the Bethe free energy function a sum over all weighted perfect matchings of a complete near the boundary of its domain, and so no separate analysis bipartite graph. This is in contrast to the definition of the of interior and boundary minima of the Bethe free energy is Bethe permanent of θ (see Definitions 11 and 12) that has an required in the convergence proof in [23]. The SPA conver- “analytical flavor.” In this section we show that it is possible gence analysis for a slightly generalized weighted matching to represent the Bethe permanent by an expression that has setup was recently presented by Williams and Lau [26]. a “combinatorial flavor.” We do this by applying the results Note that, interestingly enough, establishing convergencefor from [16], that hold for general NFGs, to the NFG N(θ). The the SPA on N(θ) is independent of the choice of θ, which key concept in that respect are so-called finite graph covers. is in contrast to, say, Gaussian graphical models where the (We keep the discussion here somewhat brief and we refer convergence behavior not only depends on the connectivity to [16] for all the details. See also [56].) 13

(1, 1) (1, 1) (1, 1) (1, 1) Besides being of interest in its own right, we think that the (1, 2) (1, 2) (1, 2) (1, 2) 1 1 combinatorial interpretation of the Bethe permanent discussed (1, 3) (1, 3) (1, 3) (1, 3) , , , , in this section can lead to alternative proofs of known results (1 4) (1 4) (1 4) (1 4) or to proofs of new results for the Bethe permanent. See, e.g.,

Appendix I that gives an alternative proof of a special case of (2, 1) (2, 1) (2, 1) (2, 1) (2, 2) (2, 2) (2, 2) (2, 2) Theorem 49 in the next section. 2 2 (2, 3) (2, 3) (2, 3) (2, 3) This section is structured as follows. In Section VI-A we (2, 4) (2, 4) (2, 4) (2, 4) define the degree-M Bethe permanent of a non-negative square matrix with the help of finite graph covers and show that in the limit M the degree-M Bethe permanent converges to (3, 1) (3, 1) (3, 1) (3, 1) (3, 2) (3, 2) (3, 2) (3, 2) → ∞ 3 3 the Bethe permanent. Towards obtaining a better understanding (3, 3) (3, 3) (3, 3) (3, 3) of the degree-M Bethe permanent, we then study various (a) (3, 4) (3, 1) (3, 4) (3, 1) examples of 2 2 matrices in Sections VI-B–VI-E. Because (b) (c) the Bethe permanent× can be computed with the help of the SPA, and because the SPA is a locally operating algorithm on Fig. 3. (a) NFG N(θ) for n = 3. (b) “Trivial” 4-cover of N(θ) (c) A the relevant NFG, it is not surprising that finite graph covers possible 4-cover of N(θ). play a central role in the above-mentioned combinatorial interpretation of the Bethe permanent; this aspect will be discussed in Section VI-F. Proof: This follows from [16, Lemma 20] and the fact that the NFG N(θ) has n2 full-edges. 

A. The Degree-M Bethe Permanent of a Non-Negative Matrix The following definition is the main definition of this Definition 33 (see, e.g., [57], [58]) A cover of a graph G section. with vertex set and edge set is a graph G with vertex set V E ˜ and edge set ˜, along with a surjection π : ˜ which Definition 36 For any Z we define the degree- V E V→V M >0 M is a graph homomorphism (i.e., π takes adjacent vertices of Bethe permanent of θ to be∈ G to adjacent vertices of G) such that for each vertex v −1 ∈ V and each v˜ π (v), the neighborhood ∂(˜v) of v˜ is mapped M N˜ permB,M (θ) , ZG( ) , ∈ N˜∈N˜ bijectively to ∂(v). A cover is called an M-cover, where r M Z −1 11 D E M >0, if π (v) = M for every vertex v in .  where the angular brackets represent the arithmetic average ∈ V N˜ N˜ ˜ of ZG( ) over all M . (Note that the right-hand side is Because NFGs are graphs, it is straightforward to extend based on the Gibbs partition∈ N function, not the Bethe partition this definition to NFGs. (Of course, the variables that are function.)  associated with the M copies of an edge are allowed to take on different values.) For an M-cover, the left-hand side function As we will now show, one can express ZG(N˜) for any M- nodes will be labeled by elements of [M], the right-hand cover N˜ of N(θ) as the permanent of some matrix that is I× side function nodes will be labeled by elements of [M], derived from θ. and the edges will be labeled by elements of a cover-dependenJ × t subset of [M] [M]. We will denote the set of all Definition 37 For any Z we define ˜ to be the set I × ×J × M >0 ΨM M-covers N˜ of N(θ) by ˜M (θ). (Note that we distinguish ∈ two M-covers with differentN function node labels, even if the Ψ˜ , P˜ = P˜(i,j) P˜(i,j) . M i∈I,j∈J ∈PM×M underlying graphs are isomorphic; see also the comments on n o ˜ ˜ ˜ labeled graph covers after [16, Definition 19].) Moreover, for P ΨM we define the P -lifting of θ to be the following (nM) ∈ (nM) matrix × Example 34 Let n =3. The NFG N(θ) is shown in Fig. 3(a). (1,1) (1,n) θ1,1P˜ θ1,nP˜ There is only one 1-cover of N(θ), namely N(θ) itself. Two ··· θ↑P˜ , . . possible 4-covers of N(θ) are shown in Figs. 3(b)–(c). The  . .  . (n,1) (n,n) 4-cover in Fig. 3(b) is “trivial” in the sense that it consists of θn,1P˜ θn,nP˜ N  ···  4 disconnected copies of (θ). On the other hand, the 4-cover    in Fig. 3(c) is “nontrivial” in the sense that it consists of 4 copies of N θ that are intertwined.  ( ) For any positive integer M it is straightforward to see that there is a bijection between the set ˜M (θ) of all M-covers ↑P˜ N Lemma 35 It holds that of N(θ) and the set θ ˜ ˜ . In particular, because of { }P ∈ΨM 2 Lemma 8, for an -cover N˜ and its corresponding matrix ˜ (n ) M M (θ) = (M!) . (7) ↑P˜ N˜ ↑P˜ N θ it holds that ZG( ) = perm(θ ). Therefore, we have the following reformulation of Definition 36. 11The number M is also known as the degree of the cover. (Not to be confused with the degree of a vertex.) 14

Definition 38 (Reformulation of Definition 36) For any Z permB,M (θ) = permB(θ) M >0 we define the degree-M Bethe permanent of θ to M→∞ ∈ be

perm (θ) , M perm θ↑P˜ , (8) B,M ˜ ˜ perm (θ) P ∈ΨM B ,M rD  E where the angular brackets represent the arithmetic average of perm θ↑P˜ over all P˜ Ψ˜ . (Note that the permanent, ∈ M not the Bethe permanent, appears on the right-hand side of permB ,M (θ) = perm(θ)  M=1 the above expression.) 

Fig. 4. The degree-M Bethe permanent of the non-negative matrix θ for In order to better appreciate the right-hand side of the different values of M. above expression, it is worthwhile to make the following two observations. Lemma 40 For n =2 it holds that • For M =1, the averaging is trivial because Ψ˜ M contains only one element. Moreover, letting P˜ be this single perm(θ)= θ1,1θ2,2 + θ2,1θ1,2, element, it holds that θ↑P˜ θ. Therefore = permB(θ) = max(θ1,1θ2,2, θ2,1θ1,2).

permB,1(θ) = perm(θ). Proof: The result for perm(θ) follows from Definition 1. On the other hand, in order to obtain perm (θ), we apply • For any M Z>0, the “trivial” M-cover of N(θ) is given B by the choice∈ P˜ = P˜ (i,j) with P˜(i,j) = I˜, Corollary 15. The crucial step in Corollary 15 is to minimize i∈I,j∈J F (γ) over γ Γ . Because H (γ)=0, γ Γ , (i, j) , where I˜ is the identity matrix of size B 2×2 B 2×2  minimizing F (∈γ) is equivalent to minimizing U∈(γ) = M M∈I×J. For this M-cover we obtain B B × i,j γi,j log(θi,j ). ↑P˜ M − perm(θ ) = perm(θ) , • For θ θ = θ θ the minimum is achieved at every P 1,1 2,2 1,2 2,1 γ Γ . i.e. ∈ 2×2 • For θ1,1θ2,2 >θ1,2θ2,1 the minimum is achieved at γ = M ˜ 1 0 perm(θ↑P ) = perm(θ). 0 1 . • For θ1,1θ2,2 <θ1,2θ2,1 the minimum is achieved at γ = q 0 1  1 0 . With this, we are ready for the main result of this section.   Theorem 39 It holds that Example 41 For n =2 and θ =1, (i, j) , we have i,j ∈I×J lim sup permB,M (θ) = permB(θ). M→∞ perm(θ)=2,

Proof: This follows from Definitions 12 and 38, along with permB(θ)=1. the application of [16, Theorem 33] to N = N(θ).  Recall that perm(θ) represents the sum of all the weighted Theorem 39, together with the relation permB,1(θ) = perfect matchings of the complete bipartite graph N(θ), and perm(θ), are visualized in Fig. 4. Because the permanents so, for the special choice θi,j = 1, (i, j) , the that appear on the right-hand side of (8) are combinatorial quantity perm(θ) represents the number of perfect∈ I×J matchings objects, Definition 38 and Theorem 39 give the promised of N(θ). As is illustrated in Figs. 5(b)–(c), the graph N(θ) “combinatorial characterization” of the Bethe permanent. has two perfect matchings, thereby combinatorially verifying perm(θ)=2.  B. The Bethe Permanent for Matrices of Size 2 2 × In this and the following subsections we illustrate the C. The Degree-M Bethe Permanent for Matrices of Size 2 2 concepts and results that have been presented so far in this — Initial Considerations × section by having a detailed look at the case n = 2, i.e., we One of the goals of this and the next subsections is to obtain study the permanent, the Bethe permanent, and the degree-M a better combinatorial understanding of the result perm (θ)= Bethe permanent for the matrix B 1 for n = 2, in particular, why it is different from perm(θ), θ θ yet not too different. θ = 1,1 1,2 . θ2,1 θ2,2 Towards this goal, let us study the degree-M Bethe perma-   nent of θ as specified in Definition 38. Therein, the average The corresponding NFG N(θ) is shown in Fig. 5(a). Of course, ˜ 4 nobody would use the Bethe permanent to approximate the is taken over ΨM = (M!) matrices permanent of a 2 2 matrix, however, it gives some good × ˜ θ P˜ θ P˜ ˜ insights into the strengths and the weaknesses of the Bethe θ↑P = 1 ,1 1,1 1,2 1,2 , θ↑P Ψ˜ . θ P˜ θ P˜ ∈ M approximation to the permanent.  2,1 2,1 2,2 2,2 15

1′ 1′ 1′ 1′ 1 1 1′′ 1′′ 1′′ 1′′

2′ 2′ 2′ 2′ 2 2 2′′ 2′′ 2′′ 2′′ (a) (d) (i)

1 1 1 1 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′

2 2 2 2 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ (b) (c) (e) (f) (g) (h) (j) (k)

Fig. 5. Graphs (NFGs) that are discussed in Sections VI-C–VI-E. (a) Base graph. (b)–(c) Perfect matchings of the graph in (a). (d) A possible double cover of the graph in (a). (e)–(h) Perfect matchings of the graph in (d). (i) A possible double cover of the graph in (a). (j)–(k) Perfect matchings of the graph in (i).

↑(2) We can simplify the analysis by realizing that the permanent • The matrix θ corresponds to the double cover of N(θ) of θ↑P˜ equals the permanent of a modified matrix θ↑P˜ , where shown in Fig. 5(i). Because that graph has 2 perfect ˜−1 the first block row is multiplied from the left by P1,1 , where matchings, see Figs. 5(j)–(k), we have the second block row is multiplied from the left by P˜ −1, and 2,1 perm(θ↑(1))=2. where the second block column is multiplied from the right ˜−1 ˜ by P1,2 P1,1, i.e., Putting everything together, we obtain the degree-2 Bethe · permanent of θ, i.e., ˜ ˜ ↑P˜ θ1,1I θ1,2I perm θ = perm ˜ −1 −1 , 2 1 2 1 2 θ2,1I θ2,2P˜ P˜2,2P˜ P˜1,1 θ √  2,1 1,2  permB,2( )= (4+2) = 6= 3 1.732.  2! · 2! · ≈ ˜ r r where I is the identity matrix of size M M. Therefore, we We note that the graph in Fig. 5(d) consists of M independent × can rewrite permB,M (θ) as follows copies of the graph in Fig. 5(a), therefore it is not surprising that perm(θ↑(1)) = perm(θ)M =22 =4. On the other hand, θ I˜ θ I˜ the graph in Fig. 5(i) consists of M coupled copies of the perm (θ) , M perm 1,1 2,1 , B,M ˜ ′ ′ θ I θ P˜ P˜ ∈PM×M graph in Fig. 5(a), which implies that we cannot choose the s  2,1 2,2 2,2 2,2 D E (9) perfect matchings independently. Therefore, it is not surprising that we have perm(θ↑(2)) = perm(θ)M = 22 = 4, which i.e., an average over the M! permutation matrices of size M finally results in perm (θ)6 = perm(θ). Nevertheless, these × B,2 6 M. considerations also show why permB,2(θ) is not too different from perm(θ).  D. The Degree-M Bethe Permanent for Matrices of Size 2 2 — All-One Matrix × Example 43 Let n =2, M =3, and θi,j =1, (i, j) . The average in (9) is over 3! = 6 matrices. These∈I×J matrices In this subsection we consider the cases M = 2, M = 3, correspond to the triple covers of N(θ) shown in Fig. 6(b)– and general M for the special choice (g). Computing the number of perfect matchings for each of 1 1 these cases, we obtain θ = . 1 1 3 1   permB,3(θ)= (8+4+4+4+2+2) r3! · Example 42 Let , , and , . n =2 M =2 θi,j =1 (i, j) 3 1 3 We make the following observations. ∈I×J = 24 = √4 1.587. r3! · ≈ • The average in (9) is over 2!=2 matrices, namely over In particular, for the triple cover in Fig. 6(c) we show its 4 1 0 1 0 1 0 1 0 perfect matchings explicitly in Fig. 7. 0 1 0 1 0 1 0 1 Overall, we can make similar observations as at the end θ↑(1) ,   , θ↑(2) ,   . of Example 42 concerning the coupling of the copies of 1 0 1 0 1 0 0 1 M N(θ) that make up a degree-M cover and its influence on the  0 1 0 1   0 1 1 0      number of perfect matchings.      ↑(1) • The matrix θ corresponds to the double cover of N(θ) shown in Fig. 5(d). Because that graph has 4 perfect Example 44 Let n = 2, M Z>0, and θi,j = 1, (i, j) matchings, see Figs. 5(e)–(h), we have . The average in (9) is over∈ M! matrices that correspond∈ toI ×J the -covers of N θ . For each of these matrices, their ↑(1) M ( ) perm(θ )=4. permanent equals the number of perfect matchings in the 16

1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1 1 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′

2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2 2 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ (a) (b) 8 pms. (c) 4 pms. (d) 4 pms. (e) 4 pms. (f) 2 pms. (g) 2 pms.

Fig. 6. Graphs (NFGs) that are discussed in Sections VI-C–VI-E. (a) Base graph. (b)–(g) Possible triple covers of the graph in (a). (“pms.” stands for “perfect matchings”.)

1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′ 1′′′

2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ 2′′′ (a) (b) (c) (d)

Fig. 7. The four perfect matchings of the triple cover in Fig. 6(c).

corresponding M-cover. We make the following observations • The average in (9) is over 2!=2 matrices, namely over (see Figs. 5–7 for illustrations for the cases M = 2 and M =3). θ1,1 0 θ1,2 0 ↑(1) 0 θ1,1 0 θ1,2 • Every M-cover consists of up to M cycles. θ ,   , θ2,1 0 θ2,2 0 • Every cycle supports two perfect matchings (indepen-  0 θ 0 θ  dently of the cycle length and independently of the perfect  2,1 2,2    matchings chosen on the rest of the graph). θ1,1 0 θ1,2 0 0 θ1,1 0 θ1,2 Therefore, if an M-cover has c cycles then it has 2c perfect θ↑(2) ,   . θ 0 0 θ matchings. The average in (9) can then be evaluated with 2,1 2,2  0 θ θ 0  suitable combinatorial tools, for example by using the so-  2,1 2,2    called cycle index of the symmetric group over M elements • We obtain (see, e.g., [59]), and we obtain ↑(1) 2 perm θ = (θ1,1θ2,2 + θ1,2θ2,1) M permB,M (θ)= √M +1. 2 2 2 2  = θ1,1θ2,2 +2θ1,1θ1,2θ2,1θ2,2 + θ1,2θ2,1. Therefore, in the limit M , we get ↑(1) → ∞ Note that the coefficients add up to 4 because θ cor- responds to the double cover of N(θ) shown in Fig. 5(d), permB(θ) = lim sup permB,M (θ)=1. M→∞ which admits 4 (weighted) perfect matchings. • We obtain This confirms the result for permB(θ) in Example 41, which was obtained by analytical means.  ↑(2) 2 2 2 2 perm θ = θ1,1θ2,2 + θ1,2θ2,1. Note that the coefficients  add up to 2 because θ↑(2) cor- E. The Degree-M Bethe Permanent for Matrices of Size 2 2 N × responds to the double cover of (θ) shown in Fig. 5(i), — General Non-Negative Matrix which admits 2 (weighted) perfect matchings. In this subsection we consider the cases M = 2, M = 3, Putting everything together, we obtain for the square of the and general M for the general non-negative matrix degree-2 Bethe partition function of θ

θ1,1 θ1,2 2 1 ↑(1) ↑(2) θ = . permB,2(θ) = perm(θ ) + perm(θ ) θ2,1 θ2,2 2 ·   2 2 2 2 A particular goal of this subsection is to compare the degree-  = θ1,1θ2,2 + θ1,1θ1,2θ2,1θ2,2 + θ1,2θ2,1. M Bethe permanent of θ with the permanent of θ. In fact, as Given the observations that we will see, for every considered case in this subsection we ↑(1) 2 have permB,M (θ) 6 perm(θ). perm θ 6 perm(θ) , ↑(2) 2 perm θ  6 perm(θ) , Example 45 Let n = 2 and M = 2. We perform similar computations as in Example 42, but for a general non-negative it is not surprising that we also have the inequality matrix θ. Towards computing perm (θ) as given in (9), we B,2 2 6 2 make the following observations. permB,2(θ) perm(θ) ,   17 i.e., Moreover, in the limit M , we have → ∞ θ 6 θ permB,2( ) perm( ). permB(θ) = lim sup permB,M (θ) M→∞  = max(θ1,1θ2,2, θ2,1θ1,2).

Example 46 Let n = 2 and M = 3. We perform similar This confirms the result for permB(θ) in Lemma 40, which computations as in Example 43, but for a general non-negative was obtained by analytical means.  matrix θ. Towards computing permB,3(θ) as given in (9), we make the following observations. For n > 2, we leave it as an open problem to obtain an “explicit expression” for perm (θ), M Z , either for • The average in (9) is over 3! = 6 matrices. These B,M ∈ >0 matrices correspond to the triple covers of N(θ) shown the all-one matrix case, or for the general non-negative matrix in Fig. 6(b)–(g). case. ↑(2) • For example, for the matrix θ corresponding to the In conclusion, the above examples shows that in general triple cover in Fig. 6(c), we obtain permB(θ) = perm(θ), however, they also show that the Bethe permanent6 has the potential to give reasonably good ↑(2) 3 3 1 2 2 1 perm θ = θ1,1θ2,2 + θ1,1θ1,2θ2,1θ2,2 estimates, in particular in the cases where the “coupling effect” 2 1 1 2 3 3 in the average graph cover is not too strong. Heuristically, this  + θ1,1θ1,2θ2,1θ2,2 + θ1,2θ2,1, “coupling effect” seems actually to be the worst for n =2 and where each (weighted) perfect matching in Fig. 7 con- to become weaker the larger n is. tributes one monomial to the above expression. One can verify that F. Relevance of Finite Graph Covers perm θ↑(2) = θ2 θ2 +θ2 θ2 (θ θ +θ θ ) 1,1 2,2 1,2 2,1 1,1 2,2 1,2 2,1 N 2· If the NFG (θ) had no cycles then the SPA could  6 θ1,1θ2,2 +θ1,2θ2,1 (θ1,1θ2,2 +θ1,2θ2,1) · be used to exactly compute the partition function. Namely, 3 after a finite number of iterations, the SPA would reach a = θ1,1θ2,2 +θ1,2θ2,1 3 fixed point and the partition function ZG N(θ) = perm(θ) = perm(θ) .  could be computed with the help of an expression like # (t) (t) #  (The product expression in the first line is not surprising exp F ←−V , −→V , where F is defined − Bethe { i,j } { i,j } Bethe given the fact that graph in Fig. 6(c) contains two in Lemma 31. However, N(θ) has cycles: the use of this  independent components, each contributing one factor to expression at a fixed point of the SPA is still possible but the above product.) usually it does not yield the correct partition function. In this Similar observations can be made for the other five triple subsection, we would like to better understand the source of covers in Fig. 6(b)–(g), and so we obtain this suboptimality. 3 3 To that end, observe that the SPA is an algorithm that permB,3(θ) 6 perm(θ) , processes information locally on N(θ), i.e., messages are sent i.e.,   along edges, function nodes take incoming messages from incident edges, do some computations, and send out new permB,3(θ) 6 perm(θ). messages along the incident edges. On the one hand, this  locality explains the main strengths of the SPA, namely its low complexity and its parallelizability, two key factors for making the SPA a popular algorithm. On the other hand, this locality Example 47 Let n = 2 and M Z . We perform similar >0 explains also the main weakness of the SPA. Namely, a locally computations as in Example 44,∈ but for a general non- operating like SPA “cannot distinguish” if it is operating on negative matrix θ. The observations that we made there can N(θ) or any of its covers [16], [60], [61]. be generalized (beyond the all-one matrix), and we obtain More precisely, let N˜ be an M-cover N˜ of N(θ). Such an M M-cover “looks locally the same” as N(θ) in the sense that M M−ℓ ℓ permB,M (θ) = (θ1,1θ2,2) (θ1,2θ2,1) . the local structure of N˜ is exactly the same as the one of ℓ=0 N N˜ N  X (θ). (Of course, globally and (θ) are different because Because the former NFG contains M times as many function nodes M and M times as many edges.) Consequently, if the SPA is M M M−ℓ ℓ perm(θ) = (θ1,1θ2,2) (θ1,2θ2,1) , run on N˜ with the same initialization as the SPA on N(θ) ℓ ℓ=0   (every initial message is replicated M times), we observe that,  X we see that because both graphs look locally the same and because the SPA is a locally operating algorithm, after every iteration the M 6 M permB,M (θ) perm(θ) , messages on N˜ are exactly the same as the messages on N(θ), i.e.,   simply replicated M times. In that sense, the SPA “cannot distinguish” if it is operating on N(θ), or, implicitly, on N˜, 6 permB,M (θ) perm(θ). or any other M-cover of N(θ). This observation allows us to 18 give the following interpretation of (8) (which is reproduced VII. THE RELATIONSHIPBETWEENTHE PERMANENT here for the ease of reference) ANDTHE BETHE PERMANENT In this section we explore the relationship between perm(θ) , M ↑P˜ permB,M (θ) perm θ . (10) and perm (θ), in particular, if and how perm(θ) can be P˜∈Ψ˜ M B rD  E upper and lower bounded by expressions that are functions Namely, because the SPA implicitly tries to compute in parallel of perm (θ). For an additional/complementary discussion on N ↑P˜ ↑P˜ B the partition function ZG (θ ) = perm(θ ) for all M- this topic we refer to [22]. covers of N(θ), yet it has to give back one real number only,  We start with a lemma that shows that there are non-negative the “best it can do” is to give back the average of these square matrices for which the Bethe permanent can give rather partition functions, i.e., perm θ↑P˜ . (The Mth root P˜∈Ψ˜ M accurate estimates of the permanent, thereby showing the that appears in (10) is included so that the result is properly   overall potential of the Bethe permanent to be the basis for normalized w.r.t. ZG N(θ) = perm(θ).) good upper and lower bounds on the permanent of general Let us conclude this subsection by commenting on two non-negative square matrices. recent papers.  • Translating the results of a paper by Greenhill, Janson, Lemma 48 Let 1n×n be the all-one matrix of size n n. Then × and Ruci´nski [35] to graphical models, it turns out that perm(1 ) 2πn the authors compute a high-order approximation to the n×n 1 = 1+ o(1) , ′ ′ permB( n×n) e · quantity N˜ ′ for some NFG N θ with r ZG( ) N˜ ′∈N˜ ( ) M  Z (N′(θ)) = perm(θ). The NFG N′(θ) is in general where o(1) is w.r.t. n. G different from N(θ), where the latter NFG was specified Proof: See Appendix H.  in Definition 4. We will elaborate on this interesting connection in Section VII-E. • The paper [32] by Barvinok presents bounds on the Although the factor 2πn/ e is non-negligible, compared to perm(1 )= n! it is rather small. number of zero/one matrices with prescribed row and n×n p column sums. (As already mentioned in Section I-C, in statistical physics terms the approach taken therein can A. Lower Bounds on the Permanent of the Matrix θ be considered as a mean-field approach.) In terms of In this subsection we study lower bounds on perm(θ) based NFGs, the quantity of interest is expressed as the partition on permB(θ). function of an NFG that has the same topology as N(θ) but different function nodes. Theorem 49 (Gurvits [14], [15]) It holds that Section 3.1 of [32] then presents an interpretation of perm(θ) these bounds that has a similar flavor of the graph cover > 1. interpretation of the Bethe permanent, however, it also has permB(θ) stark differences. Namely, in terms of NFGs, Section 3.1 Proof: This result was recently shown by Gurvits [14], [15]. of [32] presents an NFG where every function node of Roughly speaking, its elegant proof is based on first expressing the base graph is replicated M times and every edge is θ in terms of a stationary point of FB,N(θ) and then applying replicated M 2 times, i.e., all Mn left-hand side function an inequality due to Schrijver [62].  nodes are connected by exactly one edge to all the Mn For more details, along with a discussion of this result’s right-hand side function nodes. In order for this to make relationship to the results in [63], [64], we refer to [14], [15]. sense, the local functions are adapted so that they have For a somewhat different approach to proving this theorem, Mn arguments instead of n arguments. It is then shown we refer the interested reader to [22]. that the M 2th root of the partition function of this new NFG, M , yields the relevant number in which → ∞ Corollary 50 (Gurvits [14], [15]) For any γ Γn×n it the bounds are expressed. Despite all the similarities, the holds that ∈ differences to finite graph covers are clear: perm(θ) – There is only one such M-fold version of the base > 1. exp F N (γ) graph, whereas the number of M-covers of N(θ) is − B, (θ) 2 (M!)(n ). Proof: This is a straightforward consequence of Theorem 49 – The number of edges is M 2n2, whereas the number and Definitions 11 and 12.  of edges in an M-cover of N(θ) is Mn2. – The local functions need to be adapted in order to Some comments on Theorem 49 and Corollary 50: allow for instead of arguments, whereas the Mn n • Corollary 50 has its significance when one is not willing N local functions of an M-cover of (θ) are the same to run the SPA algorithm, but one has a reasonably good as the local functions of N(θ). estimate of the γ Γn×n that minimizes FB,N(θ). This approach is for example∈ interesting when one wants to obtain analytical lower bounds on the permanent of some parameterized class of non-negative square matrices. 19

• Chertkov, Kroc, and Vergassola [11] observed in 2008 Conjecture 52 Fix some M Z and consider the expres- ∈ >0 that perm(θ) > permB(θ) holds for all the matrices that sions they experimented with. They also outlined a potential ˜ M perm θ↑P and perm(θ) approach to proving this inequality via the loop calculus P˜∈Ψ˜ M D  E technique by Chertkov and Chernyak [21], which in the as polynomials in the indeterminates θ . We conjecture N { i,j}i,j case of (θ) states that perm(θ) equals permB(θ) plus that the coefficient of every monomial of the first polynomial certain correction terms (see [65] for a reformulation of is upper bounded by the coefficient of the corresponding the loop calculus in terms of NFGs). However, given monomial of the second polynomial. the fact that for N(θ) these correction terms happen Possibly also the following, stronger, statement is true. Fix to be positive and negative, it is at present unclear if some M Z>0 and P˜ Ψ˜ M , and consider the expressions Theorem 49 can be proven with this technique. ∈ ∈ ↑P˜ M • In the Allerton 2010 version of this paper we stated the perm θ and perm(θ) inequality that appears in Theorem 49 as a theorem. How-   as polynomials in the indeterminates θi,j i,j. We conjecture ever, while writing the present paper we realized that our that the coefficient of every monomial{ of the} first polynomial “proof” had a flaw, which, so far, we have not been able is upper bounded by the coefficient of the corresponding to fix. Nevertheless, we still think that our proof strategy monomial of the second polynomial.  can work out and possibly give an alternative viewpoint of Schrijver’s inequality that features prominently in [14], Let us conclude this subsection by noting that the inequali- M [15]. In that respect, we list below some special cases of ties ZG(N˜) 6 ZG(N) , M Z>0, N˜ ˜M , i.e., inequalities matrices θ for which our proof strategy works, along with of the type that appear in Conjecture∈ ∈ 51,N have recently been conjectures that, if true, would give an alternative proof used to prove ZB(N) 6 ZG(N) for graphical models N of Theorem 49 in its full generality. appearing in other contexts. We refer the interested reader to [16, Example 34 and Lemma 35] and [66] for details. Conjecture 51 For any M Z it holds that ∈ >0 ˜ M B. Upper Bounds on the Permanent of the Matrix θ perm θ↑P 6 perm(θ) . P˜∈Ψ˜ M In this subsection we list conjectures and open problems D  E Possibly also the following, stronger, statement is true: for any w.r.t. upper bounds on perm(θ) based on permB(θ). M Z and any P˜ Ψ˜ it holds that ∈ >0 ∈ M Conjecture 53 (Gurvits [14], [15]) Let θ be an arbitrary ˜ M perm θ↑P 6 perm(θ) . non-negative matrix of size n n. For even n it is conjectured that ×     perm(θ) n 6 √2 , (11) perm (θ) Theorem 49 would then follow from B with a similar conjecture for odd n. Note that (11) holds with (a) permB(θ) = lim sup permB,M (θ) equality for the matrix θ = I(n/2)×(n/2) 12×2, i.e., the M→∞ Kronecker product of an identity matrix of size⊗ (n/2) (n/2) (b) × = lim sup M perm θ↑P˜ and the all-one matrix of size 2 2.  P˜∈Ψ˜ × M→∞ r M (c) D  E We refer the interested reader to [14], [15] for a discussion M M 6 lim sup perm(θ) of families of non-negative matrices for which the above M→∞ q conjecture has been verified. = lim sup perm(θ) M→∞ Note that Conjecture 53 replaces the conjecture that we (d) made in the Allerton 2010 version of this paper where, for = perm(θ), fixed n, the largest ratio perm(θ)/ permB(θ) was thought to where at step (a) we have used Theorem 39, where at step (b) be obtained for the all-one matrix of size n n. × we have used Definition 38, where at step (c) we have used Besides proving the bound in Conjecture 53, it would be the weaker part of Conjecture 51, and where step (d) follows desirable to prove statements of the form from evaluating the (now trivial) limit M . perm(θ) → ∞ Pr θ Θ : 6 τ > 1 ε, We now list some special matrices θ for which Conjec- ∈ perm (θ) − ture 51 is true.  B  where Θ is some ensemble of random matrices of size n n, • Conjecture 51 is true for θ 1 . (The proof is given = n×n where τ is some positive real number, and where ε is some× in Appendix I.) small positive number. For example, for the ensemble of • Conjecture 51 is true for all matrices θ that were studied n n matrices where the matrix entries are chosen uni- in Section VI. formly× and independently between 0 and 1, we conjecture that Actually, the results in Section VI suggest the following, perm(θ)/ permB(θ) is, with high probability, upper bounded stronger version of Conjecture 51. by the ratio that appears in Lemma 48. (Note that this ratio is much smaller than the ratio that appears in Conjecture 53.) 20

C. Closeness of the Permanent to the Bethe Permanent Example 56 For some positive integer M, consider the ma- trix In this subsection we list some cases where perm(θ) is ˜ ˜ relatively close to perm (θ). We start with an auxiliary result ˜ θ I θ I B θ↑P = 1,1 1,2 , that relates the Bethe permanent of a lifted matrix to the Bethe θ I˜ θ P˜′  2,1 2,2 2,2 permanent of the base matrix. ˜ ˜′ where I is the identity matrix of size M M and where P2,2 is a once cyclically left-shifted identity matrix× of size M M. Lemma 54 For any Z and any P˜ ˜ it holds M >0 ΨM Then × that ∈ ∈ ↑P˜ M M M M perm θ = θ1,1θ2,2 + θ1,2θ2,1, θ↑P˜ θ M permB = permB( ) . P˜ M perm θ↑  = perm (θ)   B B Proof: See Appendix J.   M  = max(θ1,1θ2,2, θ1,2θ2,1) , where the first result is a consequence of the observation that Theorem 55 For any α> 1 and any M > Mα, the majority the underlying graph has exactly one cycle, i.e., only two of the matrices in θ↑P˜ satisfies P˜∈Ψ˜ M perfect matchings, and where the second result follows from Lemmas 40 and 54. Therefore,  ↑P˜ perm θ ↑P˜ 1 6 < αM . perm θ  ˜ 1 6 6 2. perm θ↑P ↑P˜ B permB θ    Note that the right-hand side of the above expression does not Here Mα is a parameter that depends on α.  only grow sub-exponentially in M, it does not grow at all.  Proof: The first inequality follows from Theorem 49. We prove the second inequality by contradiction. So, assume that there Let us conclude this subsection with the following remark. is an α > 1 and a constant Mα such that for all M > Mα As already mentioned, the proof of Theorem 49 takes ad- ˜ ′ ˜ ↑P˜ vantage of an inequality by Schrijver [62], and therefore the the set ΨM ΨM of all lifted matrices θ that satisfy ↑P˜ ⊆ M ↑P˜ ˜ closeness of perm(θ) to perm (θ) is linked with the tightness perm θ > α permB θ has size at least ΨM /2. B Then · | | of Schrijver’s inequality. Now, interestingly enough, when   Schrijver demonstrates a certain asymptotic tightness of his (a) M ↑P˜ inequality, see [62, Section 3], he implicitly evaluates and permB,M (θ) = perm θ P˜∈Ψ˜ r M compares both sides of his inequality for some finite cover D  E of a certain graph. (b) 1 ↑P˜ = M perm θ ˜ v ΨM ˜ ˜ u| | P ∈ΨM   u X D. Open Problems on the Relationship between the Permanent t and the Bethe Permanent 1 ↑P˜ > M perm θ Ψ˜ M There are also classes of structured matrices for which v P˜ ˜ ′ u| | ∈ΨM   u X it would be interesting to better understand the relationship (c) t 1 between the permanent and the Bethe permanent. For example, > M ↑P˜ M α permB θ the permanent of the matrix Ψ˜ M · v P˜∈Ψ˜ ′ u| | XM   µ1 µ2 µm u α1 α1 α1 1 1 t µ1 µ2 ··· µm ··· (d) 1 M M α2 α2 α2 1 1 = M α permB(θ) θ =  . . ··· . . ··· .  , Ψ˜ M · . . . . . v P˜∈Ψ˜ ′ . . . . . u| | XM u   µ1 µ2 µm  αn αn αn 1 1 t ˜ ′  ··· ···  M ΨM   = | | α permB(θ) with 0 6 m 6 n, real numbers αℓ > 0, ℓ [n], and real s Ψ˜ M · · ∈ | | numbers µℓ, ℓ [m], turns up in a variety of contexts. (e) −1/M ∈ > 2 α permB(θ), • When ℓ∈[n] αℓ = 1 and µℓ are non-negative integers · · then perm(θ) corresponds to the probability of the pat- where at step (a) we have used Definition 38, where at step (b) tern ofP a sequence (see, e.g., [67], [68]). we have replaced the angular brackets by the corresponding • When m = n and µℓ = n 1 ℓ, ℓ [n], then perm(θ) normalized sum, where at step (c) we have used the assump- appears in the analysis of− list− ordering∈ algorithms (see, tion, where at step (d) we have used Lemma 54, and where at e.g., [69]) or in the analysis of source coding algorithms step (e) we have again used the assumption. However, taking (see, e.g., [70]). Note that in this case, θ is a Vander- lim supM→∞ on both sides of the above expression, we see monde matrix.  that we obtain a contradiction w.r.t. Theorem 39. Moreover, given the fact that the above θ depends only on (at The following example partially corroborates Theorem 55. most) 2n parameters (and not on n2 parameters as θ in (1)), 21 one wonders if speed-ups in the SPA-based computation of 1 1 1 1 permB(θ) are possible. In some applications one is not interested in the absolute value of the permanent, only the relative value in the sense that for two matrices θ and θ′ one wants to know which 2 2 2 2 one has the larger permanent. Therefore, for some suitable (a) NFG N , N(θ). (b) NFG N′ , N′(θ). stochastic setting it would be desirable to state with what prob- ′ ability perm(θ) 6 perm(θ ) is equivalent to permB(θ) 6 Fig. 8. NFGs used in Section VII-E. ′ permB(θ ). Some very encouraging initial investigations of this topic have been presented in [12, Section 4.2]. generalizing the convexity results of Corollary 23 from N to N′ E. Connections to Results by Greenhill, Janson, and Rucinski´ .) With this, after a few manipulations, n After the initial submission of the present paper, we became (d 1)d−1 Z (N′)= − . (13) aware of the paper by Greenhill, Janson, and Ruci´nski [35] on B dd−2 counting perfect matchings in random graph covers. Using the   Interestingly, the expression on the right-hand side of (13) findings of [16] and the present paper, their results can, once appears also in Corollary 1a in [62]. (One of the main results they have been translated to factor graphs, be seen as defining of Schrijver’s paper [62] is to show that this expression is a an NFG N′ , N′(θ) with Z (N′) = perm(θ) and computing G lower bound on perm(θ).) Z (N′), along with approximately computing Z (N′). The B B,M Clearly, the advantage of N′ is that we can explicitly NFG N′ is in general different from N , N(θ), where the latter compute Z (N′). However, Z (N′) is a weaker lower bound NFG was specified in Definition 4 and shown in Figure 1. B B on perm(θ) than perm (θ)= Z (N). (For example, for the The advantage of N′ is that minimizing its Bethe free energy B B matrix θ in (12) we obtain perm(θ) = 10 > perm (θ) = function towards determining Z (N′) is quite straightforward. B B Z (N)=9 > Z (N′) = 729/256 = 2.848 ... .) This is not Moreover, high-order approximations to Z (N′) can be B B B,M totally surprising given the fact that the right-hand side of (13) given. The disadvantage of N′ is that Z (N′) is a weaker B depends only on θ inasmuch as θ determines n and d. Indeed, lower bound to perm(θ) than perm (θ)= Z (N). B B observing that 1 θ is a doubly stochastic matrix, we get Let us elaborate on these comments. Namely, consider a d · matrix like log ZB(N) 3 1 (a) θ , , (12) >  N γ FB, ( ) γ= 1 ·θ 1 3 − d   (b) where all entries are non-negative integers and where all row N γ N γ = UB, ( )+ HB, ( ) γ= 1 ·θ and all column sums are equal to some constant d. Here, − d (c) θi,j n = 2, d = 4, and perm(θ) = 10. Its NFG N , N(θ) as = log(θi,j ) d specified in Definition 4 is shown in Figure 8 (a). In terms of i,j X factor graphs, the paper [35] considers the NFG N′ , N′(θ) θ θ θ θ i,j log i,j + 1 i,j log 1 i,j shown in Figure 8 (b): like N it has n function nodes on the − d d − d − d i,j i,j left-hand side and n function nodes on the right-hand side. X   X     θi,j θi,j However, for every (i, j) , there are d θi,j edges = n log(d)+ 1 log 1 ∈I×J · − d − d connecting function node i on the left-hand side to function i,j X     node j on the right-hand side. The variable associated with n max(n,d) N′ (d) θ an edge of takes on values in the set 0, 1 . Moreover, = n log(d)+ u i,j + u(0) a local function takes on the value 1 if exactly{ } one of the  d  i j=1   j=n+1 variables associated with the incident edges is 1, and takes on X X X  d max(n,d)  the value 0 otherwise. One can show that these definitions (e) 1 ′ > n log(d)+ u + u(0) yield Z(N ) = perm(θ). Indeed, this result follows from  d  i j=1 observing that valid configurations of N′ correspond to perfect X X   j=Xd+1 matchings of the graph underlying N′, that the global function = n(d 1) log(d 1) n(d 2) log(d)  ′ − − − − value of every valid configurations of N is 1, and that the (f) ′ = log ZB(N ) , graph underlying N′ has perm(θ) perfect matchings. Note that in the case of N, the graph structure is independent where at step (a) we have used Definition 11, where at steps (b) of θ but the local function values depend on θ, whereas in and (c) we have used Lemma 14, where at step (d) we have the case of N′, the graph structure depends on θ but the local used the function u : [0, 1] R, ξ (1 ξ) log(1 ξ), where function node values are independent of θ. at step (e) we have used Karamata’s→ 7→ inequality− [71]− (note that u ′ The Bethe free energy function of N is minimized by is convex and that, after sorting, (θi,1/d,...,θi,n/d, 0,..., 0) ′ ′ N′ (βe,0,βe,1) = (1 1/d, 1/d), e ( ), with corresponding majorizes (1/d, . . . , 1/d, 0,..., 0)), and where at step (f) we beliefs for the function− nodes.∈ (This E can, e.g., be verified have used (13). (See also [15, Section 3] for similar inequali- with the help of symmetry arguments, along with suitably ties as in the above display equation.) 22

Interestingly enough, as shown by the authors of [35], for entropy function to be any M Z>0 one can give a high-order approximation ∈′ (κ) of N˜ ′ , and therefore of the degree- Bethe R ZG( ) N˜ ′∈N˜ M HB : Γn×n , M → ′ ′ 1/M partition function [16] N N˜ ′ . ZB,M ( ) = ZG( ) N˜ ′∈N˜ γ κi HB,i(γi)+ κj HB,j (γi) M 7→ · · For the corresponding expressions we refer the interested i j  X X reader to [35]. κ H (γ ). − i,j · B,(i,j) i,j Near the beginning of this subsection we assumed that θ is i,j a non-negative integral matrix where all row and all column X sums are equal to some constant d. This is less restrictive than (κ) (Clearly, if all values in κ equal 1 then H (γ) = HB(γ), it appears. Namely, Sinkhorn’s theorem states that any positive B with HB(γ) as shown in Lemma 14.)  n n matrix θ can be written as θ = D θ′ D where θ′ is × 1 · · 2 doubly stochastic and where D1 and D2 are diagonal matrices with strictly positive diagonal elements (see, e.g., [72], which Lemma 58 The fractional Bethe entropy function from Defi- presents also some generalizations of this statement). If there nition 57 can also be expressed as follows is a positive integer d such that d θ′ has only integral entries, 1 · ′ then we can write θ = D1 (d θ ) D2. (If there is no (κ) d H (γ)= (κi +κj κi,j) γi,j log(γi,j ) such d, then d can be chosen· large· enough· · so that d θ′ is as B − − · i,j close to an integral matrix as desired.) With this, perm(· θ)= X 1 D θ′ D , and we + κi,j (1 γi,j) log(1 γi,j). dn i∈[n]( 1)i,i perm(d ) i∈[n]( 2)i,i · − − · · · · i,j have reduced the problem of (approximately) computing the X permanent Q of θ to (approximately) computing Q the permanent (κ) ′ of d θ , a non-negative integral matrix where all row and all (If all values in κ equal 1 then HB (γ) = HB(γ), with · column sums are equal to some constant d. The complexity of HB(γ) as shown in Corollary 15.) ′ (approximately) computing the decomposition θ = D1 θ D2 is discussed in [10]. · · Proof: Follows from combining Definition 57 and Lemma 14. 

VIII. THE FRACTIONAL BETHE PERMANENT The following definition generalizes Definitions 11 and 12 and Corollary 15. The terms that appear in HB(β) in Definition 10 all have either coefficient +1 or 1, with obvious implications for the − Definition 59 We define the κ-fractional Bethe free energy coefficients of the terms of HB(γ) in Lemma 14. The main idea behind the fractional Bethe entropy function is to allow function to be these coefficients to take on also other values. This is done (κ) R towards the goal of obtaining a modified Bethe free energy F : Γn×n , B → function whose minimum resembles the minimum of the Gibbs (κ) γ UB(γ) HB (γ), free energy function even more.12 Such generalizations of the 7→ − Bethe entropy function were for example considered in [73]– and the κ-fractional Bethe permanent to be [78] and a combinatorial characterization of the fractional

Bethe entropy function was discussed in [56]. In particular, (κ) (κ) permB (θ) , exp min FB (β) . for the permanent estimation problem such generalizations are − β∈B extensively studied in the very recent paper by A. B. Yedidia   and Chertkov [22], to which we refer for additional discussion  on this topic. As we will see in this section, if the modifications to the The following theorem gives a sufficient condition on κ so Bethe entropy function are applied within some suitable limits, that the κ-fractional Bethe entropy function is concave in γ, the concavity of the modified Bethe entropy function (and thereby generalizing Theorem 22. therefore the convexity of the modified Bethe free energy function) will be maintained. Theorem 60 If κ is such that Definition 57 Let κ > 0 (i ), i ∈ I κ , κi i∈I, κj j∈J , κi,j (i,j)∈I×J κ > 0 (j ), { } { } { } j ∈ J  κi + κj > 2κi,j ((i, j) ). be a collection of real values. We define the κ-fractional Bethe ∈I×J (κ) (κ) 12 then H (γ) is a concave function of γ and F (γ) is a One might also modify UB(γ), however, we do not pursue this option B B here. convex function of γ. 23

Proof: We have 1

(κ) H (γ) ) 0.99

B n ×

(a) n κi +κj κi +κj 0.98 = + κi,j γi,j log(γi,j ) 1 (

− 2 2 − · )

i,j   κ X ( B 0.97 κi +κj κi +κj + + κi,j (1 γi,j) log(1 γi,j)

2 − 2 · − − perm i,j   0.96 X /

(b) κi κj )

= S(γi)+ S(γj ) n 0.95

2 · 2 · ×

i j n

X X 1 κ +κ 0.94 + i j κ h(γ ), 2 − i,j · i,j

i,j perm( X   0.93 where at step (a) we have used Lemma 58, and where at 0.92 step (b) we have used the S-function as specified in Def- 0 10 20 30 40 50 inition 19 and have introduced the binary entropy function n R . If > , h : [0, 1] , ξ ξ log(ξ) (1 ξ) log(1 ξ) κi 0 κ → κi+κ7→j − − − − 1 ( ) 1 > > Fig. 9. Illustration of the ratio perm( n×n)/ perm ( n×n) for the κj 0, and 2 κi,j 0 (the latter being equivalent B − (κ) special choice of κ in Lemma 61, when n varies from 2 to 50. to κi + κj > 2κi,j ), then the concavity of HB (γ) in γ follows from Theorem 20, the well-known concavity of the binary entropy function, and the fact that the sum of concave • For integers n and k such that k divides n we have functions is a concave function. (κ) The convexity of FB (γ) in γ follows from the concavity perm(θ) (κ) (0.922 ...)n/k 6 6 1 of HB (γ) in γ and the linearity of UB(γ) in γ.  (κ) permB (θ)

Lemma 61 An interesting choice for κ is for the matrix θ , I 1 . (n/k)×(n/k) ⊗ k×k Let us conclude this section on the fractional Bethe entropy κ = 1 (i ), i ∈ I function with a few comments. κj = 1 (j ), ∈ J • The SPA message update equations in Section V need 1 κ =1 ((i, j) ). to be modified so that its fixed points correspond to i,j − 2n ∈I×J stationary points of the fractional Bethe free energy, i.e., (κ) so that a modified version of the theorem by Yedidia, The resulting HB (γ) is a concave function of γ and the (κ) Freeman, and Weiss [13] holds. In contrast to the SPA resulting FB (γ) is a convex function of γ. Moreover, letting message update equations in Section V, the modified 1n×n be the all-one matrix of size n n, we obtain × SPA message update equations will be such that the perm(1n×n) √2π right-going messages depend not only on the previous = 1+ o(1) =0.922 ... 1+o(1) , (κ) 1 e · · left-going messages but also on the previous right-going permB ( n×n)   messages, and such that the left-going messages depend where o(1) is w.r.t. n. (Note that, in contrast to Lemma 48, not only on the previous right-going messages but also on there is no √n-factor on the right-hand side of the above the previous left-going messages. (We omit the details.) expression.) Moreover, the convergence analysis in Section V has to be revisited. Proof: See Appendix K.  • We leave it as an open problem to explore the κ parameter Let us make a few comments about the choice of κ in space and to find fractional Bethe permanents for which Lemma 61. interesting statements can be made, in particular for • Fig. 9 shows the exact ratios for n from 2 to 50. In which a statement like the one in Theorem 49 can be particular, note that for n =2 we have made. perm(1 ) 2×2 =1. (κ) 1 permB ( 2×2) IX. COMMENTS AND CONJECTURES

• For even integers n and for the choice of κ from It is an interesting challenge to look at theorems involving Lemma 61, the matrix θ = I 1 yields (n/2)×(n/2) ⊗ 2×2 permanents and to prove that the theorems still hold if the per- the ratio perm(θ) . This is in stark contrast to (κ) = 1 manents in these theorems are replaced by Bethe permanents. permB (θ) Conjecture 53 where θ represents the conjectured “worst- Let us mention two conjectures along these lines that were perm(θ) case” matrix for the ratio . listed in [43]. permB(θ) 24

A. Perm-Pseudo-Codewords ACKNOWLEDGMENTS The following conjecture is based on a theorem in [79] We gratefully acknowledge Farzad Parvaresh for pointing involving permanents of submatrices of a parity-check matrix. out to us the papers [32], [33], Krishna Viswanathan for discussions on the permanent of structured matrices, Adam Definition 62 Let be a binary linear code described by Yedidia and Misha Chertkov for sharing an early version of a parity-check matrixC H Fm×n, m < n. For a size- ∈ 2 their paper [22], and Leonid Gurvits for general discussions (m+1) subset of the column index set (H) we define about permanents. Moreover, we very much appreciate the S I n the Bethe perm-vector based on to be the vector ω Z helpful comments that were made by the reviewers. with components S ∈ perm H if i APPENDIX A , B S\i ωi ∈ S , PROOF OF THEOREM 20 (0  otherwise where H is the submatrix of H consisting of all the Observe that once the concavity of S is established, it is S\i straightforward to verify the claim in the theorem statement columns of H whose index is in the set i .  S\{ } that S(ξ) > 0 for all ξ Π[n]. Indeed, because Π[n] is a polytope with n vertices,∈ because S takes on the value 0 at Conjecture 63 Let be a binary linear code described by each of these vertices, and because S is concave, this statement the parity-check matrixC H Fm×n, m 3. ω (H), (14) By definition, a multi-dimensional function is concave if ∈ K  it is a concave function along any straight line in its domain. Towards showing that this is indeed the case for S, let usfix an ˆ Rn 0 A proof of this conjecture has recently been presented by arbitrary point ξ Π[n] and an arbitrary direction ξ ∈ ˆ ∈ \{ } Smarandache [80]. such that the function ξ(τ) , ξ + τ ξ satisfies ξ(τ) Π[n] for a suitable τ-interval around 0 (to· be defined later).∈ We B. Permanent-Based Kernels need to distinguish three different cases that will be discussed separately in the following subsections: Based on a result by Cuturi [81], Huang and Jebara [12] made the following conjecture. 1) The point ξ is in the interior of Π[n]. 2) The point ξ is at a vertex of Π[n]. Conjecture 64 (Huang and Jebara [12]) Let n be a posi- 3) The point ξ is neither in the interior nor at a vertex tive integer and let be a set endowed with a kernel κ. Let of Π[n]. X = x ,...,x X n and Y = y ,...,y n. Then { 1 n} ∈ X { 1 n} ∈ X A. The Point ξ is in the Interior of κ : (X, Y ) perm κ(x ,y ) Π[n] permB B i j 16i6n, 16j6n 7→ It is straightforward to see that the direction vector ξˆ must is a positive definite kernel on n n.    X × X satisfy

X. CONCLUSIONS ξˆℓ =0, (15) In this paper, we have pursued a graphical-model-based Xℓ approach to approximating the permanent of a non-negative otherwise ξ(τ) Π[n] holds only for τ = 0. Therefore, square matrix, the resulting approximation being called the we assume that∈ (15) is satisfied. Moreover, because ξ Bethe permanent. We have seen that the associated functions, interior(Π ), we have 0 < ξ < 1, ℓ [n], and we can∈ [n] ℓ ∈ like the Bethe entropy function and the Bethe free energy find an ε > 0 such that ξ(τ) Π[n] for ε 6 τ 6 ε. We function, are remarkably well behaved for a graphical model will now show that the function∈ τ S ξ(−τ) is concave at with a non-trivial cycle structure. In that respect, an important τ =0. 7→ part is played by a theorem by Birkhoff and von Neumann (see We start by computing the first-order derivative Theorem 3). Moreover, the SPA can be used to efficiently find d d the minimum of the Bethe free energy function and thereby the S ξ(τ) = s ξℓ(τ) ξˆℓ, dτ − dξℓ(τ) · Bethe permanent. We have also presented a graph-cover-based ℓ  X  analysis that gives additional insights into the inner workings and the second-order derivative of the Bethe permanent, its strengths, and its weaknesses, d2 d2 and we have commented on Bethe-permanent-based upper ξ ˆ2 2 S (τ) = 2 s ξℓ(τ) ξℓ and lower bounds on the permanent. Along the way we have dτ dξℓ(τ) · Xℓ stated several conjectures and open problems, that, if answered  2  2 (a) ξˆ ξˆ one way or the other, could further elucidate the relationship = ℓ + ℓ , − ξℓ(τ) 1 ξℓ(τ) between the permanent and the Bethe permanent. Xℓ Xℓ − 25 where at step (a) we have used Lemma 18. In particular, at all ℓ = ℓ∗. Moreover, step (b) follows from a simple inequality τ =0 we have and step6 (c) follows from (18). ′′ d2 The fact δ 6 0 is shown as follows. We start by observing S ξ(τ) = δℓ, that dτ 2 τ=0 ℓ  X ˆ2 ˆ2 ξℓ (a) ξℓ where δℓ, ℓ [n], is defined as (1 ξℓ∗ ) = ξℓ ∈ − ·  ∗ ξℓ   ∗  ·  ∗ ξℓ  ˆ2 ˆ2 ℓ6=ℓ ℓ6=ℓ ℓ6=ℓ ξℓ ξℓ ˆ2 1 2ξℓ X X X δℓ , + = ξℓ − . (16)       2 − ξℓ 1 ξℓ − · ξℓ(1 ξℓ) 2 ξˆ ℓ ℓ ℓ ℓ − − = ξℓ | | X X X   ·  √ξ  The proof will be finished once we have shown that ℓ6=ℓ∗ ℓ6=ℓ∗ ℓ ! 2 X p X d 6 2 dτ 2 S ξ(τ) 0 at τ =0, which is equivalent to the condition     (b) that ˆ (c) ˆ2  > ξℓ = ξℓ∗ ,  ∗ | | δℓ 6 0. (17) ℓX6=ℓ   Xℓ where step (a) follows from ξ being in Π[n] (which implies ∗ We show this by separately considering two cases, the first that ξℓ = 1 ℓ6=ℓ∗ ξℓ), where at step (b) we use the n − case being ξ interior(Π[n]) [0, 1/2] , the second case Cauchy-Schwarz inequality, and where at step (c) we use (18). ∈ ∩ n P being ξ interior(Π[n]) [0, 1/2] . Rearranging this inequality, we see that it is equivalent to the ∈ \ n ′′ The first case, ξ interior(Π[n]) [0, 1/2] , is relatively inequality δℓ 6 0. straightforward. Namely,∈ for all ℓ [n∩] we have 0 < ξ 6 1/2, ∈ ℓ which implies 1 2ξ > 0, which in turn implies δ 6 0, and B. The Point ξ is at a Vertex of Π[n] − ℓ ℓ so (17) is satisfied. Clearly, the direction vector ξˆ must satisfy (15). Moreover, n ∗ The second case, ξ interior(Π[n]) [0, 1/2] , needs because ξ is at a vertex of Π[n], there is an ℓ [n] such that ∈ \ ∗ ∈ somewhat more work. We start by observing that there is a ξℓ∗ =1 and ξℓ =0, ℓ = ℓ , and such that ξˆℓ∗ < 0 and ξˆℓ > 0, ∗ ∗ ∗ 6 unique ℓ [n] such that ξℓ > 1/2. (Note that there can ℓ = ℓ . Then we can find an ε> 0 such that ξ(τ) Π[n] for ∈ ∗ 6 ∈ only be one such ℓ [n] because ℓ ξℓ =1.) Consequently, 0 6 τ 6 ε. We will now show that the function τ S ξ(τ) ∈ ∗ 7→ 1 2ξℓ∗ < 0 and 1 2ξℓ > 0, ℓ = ℓ . is concave at τ =0. − − 6 P ˆ In the following, it is sufficient to consider only directions ξ We start by plugging in the definition of ξ(τ) into S ξ(τ) , ˆ ˆ ∗ ˆ that satisfy ξℓ∗ > 0 and ξℓ 6 0, ℓ = ℓ , or that satisfy ξℓ∗ < 0 i.e., and ξˆ > 0, ℓ = ℓ∗. This follows6 from contemplating (15)  ℓ 6 S ξ(τ) = ξ (τ) log ξ (τ) and (16) and from observing that for a given ξ and given − ℓ ℓ directional magnitudes ˆ , the left-hand side of (17) is ℓ ξℓ ℓ6=ℓ∗  X  ˆ | | + 1 ξ (τ) log 1 ξ (τ) maximized by a ξ that satisfies the conditions that we have just − ℓ − ℓ  ℓ mentioned.13 From (15) it follows that such direction vectors X   ˆ = (1 + τξˆ ∗ ) log(1 + τξˆ ∗ ) (τξˆ ) log(τξˆ ) ξ satisfy − ℓ ℓ − ℓ ℓ ℓ6=ℓ∗ ˆ ˆ X ξℓ∗ = ξℓ . (18) ˆ ˆ ˆ ˆ | | | | + ( τξℓ∗ ) log( τξℓ∗ )+ (1 τξℓ) log(1 τξℓ). ℓ6=ℓ∗ − − − − X ℓ6=ℓ∗ X Before continuing, let us introduce From this we compute the first-order derivative ˆ2 ˆ2 ′ ξℓ∗ ξℓ d δ , + , S ξ(τ) = ξˆℓ∗ log(1 + τξˆℓ∗ ) ξˆℓ∗ −ξℓ∗ 1 ξℓ dτ − − ℓ6=ℓ∗ − X  ˆ ˆ ˆ ˆ ˆ2 ˆ2 ξℓ log(τ) ξℓ log(ξℓ) ξℓ ξ ∗ ξ − − − δ′′ , + ℓ ℓ . ℓ6=ℓ∗ ℓ6=ℓ∗ ℓ6=ℓ∗ ∗ X X X 1 ξℓ − ∗ ξℓ − ℓ6=ℓ ξˆ ∗ log(τ) ξˆ ∗ log( ξˆ ∗ ) ξˆ ∗ X − ℓ − ℓ − ℓ − ℓ Note that ′ ′′, and so, if we can show that ′ 6 ℓ δℓ = δ + δ δ 0 ξˆℓ log(1 τξˆℓ) ξˆℓ ′′ − − − and δ 6 0 then we have verified the desired result (17). ℓ6=ℓ∗ ℓ6=ℓ∗ P ′ X X The fact δ 6 0 is a consequence of the equation (a) = ξˆℓ∗ log(1 + τξˆℓ∗ ) ξˆℓ log(ξˆℓ) 2 − − ∗ 2 (a) (b) 2 ℓ6=ℓ ξˆ 1 1 (c) ξˆ ∗ X ℓ 6 ξˆ2 6 ξˆ = ℓ , ℓ ℓ ˆ ∗ ˆ ∗ ˆ ˆ (19) 1 ξℓ ξℓ∗ ξℓ∗ ·  | | ξℓ∗ ξℓ log( ξℓ ) ξℓ log(1 τξℓ), ℓ6=ℓ∗ ℓ6=ℓ∗ ℓ6=ℓ∗ − − − − X − X X ℓ6=ℓ∗   X where step (a) follows from ξ being in Π[n], which implies that ˆ where at step (a) we have used ℓ ξℓ =0 multiple times. The ξ ∗ =1 ′ ∗ ξ ′ , which in turn implies that ξ ∗ 6 1 ξ for ℓ − ℓ 6=ℓ ℓ ℓ − ℓ second-order derivative is then 2 P2 2 13 P ˆ d ξˆ ∗ ξˆ In other words, such a ξ produces the “worst-case” left-hand side in (17): ξ ℓ ℓ 2 S (τ) = + . if we can show non-positivity for such direction vectors, we have implicitly dτ − ˆ ∗ ˆ 1+ τξℓ ℓ6=ℓ∗ 1 τξℓ shown non-positivity for any other direction vector.  X − 26

For τ 0 we obtain APPENDIX C ↓ 2 PROOF OF LEMMA 25 d 2 2 S ξ(τ) = ξˆ ∗ + ξˆ dτ 2 − ℓ ℓ τ↓0 ℓ6=ℓ∗  X Clearly we have γi,j =1 if j = σ(i) and γi,j =0 otherwise. 2 ˆ From the condition that γ is such that γ(τ) Γn×n for small (a) ˆ ˆ2 ∈ = ξℓ + ξℓ non-negative τ, it follows that j γˆi,j =0 for all i and − − ∗  ∗ ∈ I ℓ6=ℓ ℓ6=ℓ γˆi,j =0 for all j . Moreover, for every i we have X X i ∈ J P ∈ I (b)   γˆi,j 6 0 if j = σ(i) and γˆi,j > 0 otherwise. Then 6 0, P where at step (a) we have used (15) and where step (b) follows HB γ(τ) ˆ ∗ from a simple inequality and the fact that ξℓ > 0 for ℓ = ℓ . (a) 1 1 6 = S γ (τ) + S γ (τ) Therefore, the function τ S ξ(τ) is concave at τ =0. 2 i 2 j 7→ i j X X  (b) τ  τ  = γˆ log(ˆγ )+ ( γˆ ) log( γˆ ) C. The Point ξ is Neither in the Interior nor at a Vertex of Π[n] − 2 i,j i,j 2 − i,σ(i) − i,σ(i) i j6=σ(i) i The fact that ξ is neither in the interior nor at a vertex of τ X X τ X ∗ Π means that there is an ℓ [n] such that 0 < ξ ∗ < 1. γˆi,j log(ˆγi,j )+ ( γˆσ¯(j),j ) log( γˆσ¯(j),j ) [n] ℓ − 2 2 − − ˆ ∈ j j Clearly, the direction vector ξ must satisfy (15), plus some Xi6=¯Xσ(j) X additional constraints that are irrelevant for the discussion + O(τ 2), here. Then we can find an ε > 0 such that ξ(τ) Π[n] for 6 6 . The concavity of the function ∈ ξ at 0 τ ε τ S (τ) where step (a) follows from Lemma 21 and where at step (b) τ =0 follows then from the observation that,7→ for small non- we have used S(γ )=0, S(γ )=0, and (20). negative τ, the second-order derivative of S ξ(τ) w.r.t. τ is i j dominated by the second-order derivative of the expression We observe that in the above expression there are exactly  two terms for every edge e = (i, j) . Rewriting these ˆ ξℓ(τ) log ξℓ(τ) , a function that is concave ∈I×J − ℓ: ξℓ=0, ξℓ>0 summations such that all the main summations are over i , in τ. ∈ I P  we obtain

APPENDIX B PROOF OF LEMMA 24 HB γ(τ)

We obtain the expression in the lemma statement by evaluat- = τ  γˆi,j log(ˆγi,j )+ τ ( γˆi,σ(i)) log( γˆi,σ(i)) − − − i i ing S ξ(τ) and the first-order derivative of S ξ(τ) w.r.t. τ Xj6=Xσ(i) X at τ = 0. Clearly, S ξ(τ) = 0 and so we can focus on + O(τ 2) computing  the first-order derivative.   Fortunately, in Appendix A-B we have already computed (a) γˆi,j γˆi,j 2 = τ γˆi,σ(i) | | log | | + O(τ ), the first-order derivative for exactly the same setup. Namely, | | · − γˆi,σ(i) γˆi,σ(i)  i j6=σ(i) | |  | | from (19) we obtain X X   d S ξ(τ) = ξˆ ∗ log(1 + τξˆ ∗ ) ξˆ log(ξˆ ) which is the first display equation in the lemma state- dτ − ℓ ℓ − ℓ ℓ ℓ6=ℓ∗ ment. Here, at step (a) we have used γˆi,σ(i) = γˆi,σ(i) ,  X − | | γˆi,j = γˆi,j , j = σ(i), and γˆi,σ(i) = j6=σ(i) γˆi,j , i.e., ξˆℓ∗ log( ξˆℓ∗ ) ξˆℓ log(1 τξˆℓ). | | 6 | | | | − − − − j6=σ(i) γˆi,j / γˆi,σ(i) =1. ℓ6=ℓ∗ | | | | P X The non-negativity of the coefficient of τ in the above P In the limit τ 0 this simplifies to expression follows from γˆ > γˆ , j = σ(i), which ↓ | i,σ(i)| | i,j | 6 d is a consequence of the above-mentioned relation γˆi,σ(i) = S ξ(τ) = ξˆℓ log(ξˆℓ) + ( ξˆℓ∗ ) log( ξˆℓ∗ ). (20) | | dτ − − − j6=σ(i) γˆi,j . τ↓0 ℓ6=ℓ∗ | |  X On the other hand, rewriting these summations such that all P This can be rewritten as follows the main summations are over j , we obtain the second display equation in the lemma statement.∈ J d ξˆℓ ξˆℓ S ξ(τ) = ξˆℓ∗ | | log | | , dτ | | · − ∗ ξˆ ∗ ξˆ ∗  τ↓0 ℓ6=ℓ ℓ ℓ !  X | | | |

 ∗ APPENDIX D where we have used ξˆℓ∗ = ξˆℓ∗ , ξˆℓ = ξˆℓ , ℓ = ℓ , and − | | | | 6 PROOF OF THEOREM 26 ξˆ ∗ = ∗ ξˆ , i.e., ∗ ξˆ / ξˆ ∗ = 1. This verifies | ℓ | ℓ6=ℓ | ℓ| ℓ6=ℓ | ℓ| | ℓ | the expressions for S ξ(τ) in the lemma statement. P P Finally, the non-negativity of the coefficient of τ in (4) From the assumptions in the theorem statement it follows  ∗ follows from ξˆℓ∗ > ξˆℓ , ℓ = ℓ , which is a consequence that γˆ = γˆ for all i and that γˆ =γ ˆ for | | | | 6 | i,σ(i)| − i,σ(i) ∈ I | i,j | i,j of the above-mentioned relation ξˆℓ∗ = ∗ ξˆℓ . all i , j σ(i) (see also the proof of Lemma 25 in | | ℓ6=ℓ | | ∈ I ∈J\{ } P 27

5 Appendix C). Then,

4 UB γ(τ) (a) = (1+τγˆi,σ(i)) log(θi,σ(i)) (τγˆi,j ) log(θi,j ) − − 3 i i X X j6=Xσ(i) (b) θi,j = log(θi,σ(i)) τ γˆi,j log − − | | θ 2 i i i,σ(i) X X j6=Xσ(i)   (c) θi,j = C τ γˆi,j log , − | | θ 1 i i,σ(i) X j6=Xσ(i)   Fig. 10. Trellis for the random walk described in Appendix D. (Here n = 5.) where at step (a) we have used Corollary 15, where at step (b) Highlighted is an instance of a possible walk. we have used that γˆ = 0 holds for every i , j i,j ∈ I i.e., that γˆi,σ(i) = γˆi,j = γˆi,j holds − P j6=σ(i) j6=σ(i) | | for every i , and where at step (c) we have defined for all (i,i′) with i = i′. One can verify that the ∈ I P P 2 ∈ I×I 6 C , i log(θi,σ(i)). (Note that there is no O(τ ) term assumptions on γˆ imply that in the− above expressions.) Then P µi =1, i FB γ(τ) X (a) pi,i′ =1 (for all i ), = UB(γ) HB(γ) ′ ∈ I − Xi 6=i (b) θ i,j ′ = C τ γˆi,j log Qi,i = µi (for all i ), − | | θ ′ ∈ I i i,σ(i) i 6=i X j6=Xσ(i)   X ′ Qi,i′ = µi′ (for all i ), γˆi,j γˆi,j ′ ∈ I τ γˆi,σ(i) | | log | | i6=i − | | · − γˆi,σ(i) γˆi,σ(i)  X i j6=σ(i)   X X | | | | Qi,i′ =1. 2 + O(τ )   i i′6=i X X (c) θi,σ(i′) = C τ γˆ ′ log In order to obtain the theorem statement, we need to − | i,σ(i )| θ i i′6=i i,σ(i) maximize the coefficient of ( τ) in (21). Before doing this, X X   let us quickly discuss the meaning− of this coefficient. γˆ ′ γˆ ′ τ γˆ | i,σ(i )| log | i,σ(i )| Namely, consider the trellis in Fig. 10 with state space − | i,σ(i)| · − γˆ γˆ  I i i′6=i | i,σ(i)|  | i,σ(i)|  (i.e., with n states) and where a trellis section has a branch X X ′ ′ + O(τ 2)   from state i to state i if and only if i = i . It is straightforward∈ toI see that there∈ I is a bijection between,6 on the (d) 2 = C τ µi pi,i′ log(pi,i′ )+ Ti,i′ + O(τ ), one hand, the set of all left-to-right walks in the time-invariant − · · − i i′6=i trellis shown in Fig. 10, and, on the other hand, the set of X X = Qi,i′   N (21) backtrackless walks in (θ) (see Fig. 1) that were mentioned | {z } after Lemma 25. In particular, going from state i to state i′ i in the trellis of Fig. 10 corresponds to the∈ I two half- where at step (a) we have used Corollary 15, where at step (b) steps∈ I\{ of going} from node i to node σ(i′) and then we have inserted the above expression for U (γ) and the B to node i′ in N(θ). With∈ this,I translating (backtrackless)∈ J expression for HB(γ) from Lemma 25, where at step (c) we random walks∈ I to left-to-right random walks in the trellis in have replaced the summations over , , by j j = σ(i) Fig. 10, we obtain that summations over i′ , σ(i′) = σ(i),∈i.e. J, by summations6 ∈ I 6 over i′ , i′ = i, and where at step (d) we have introduced • µi is the probability of being in state i, ′ the definitions∈ I 6 • pi,i′ is the probability of going to state i = i, conditioned on being in state i, 6 • Qi,i′ is the probability of being in state i and then going µi , γˆi,σ(i) (22) to state i′ = i, | | 6 γˆ ′ ′ ′ i,σ(i ) • i i′6=i µipi,i log(pi,i ) is the entropy rate of (the pi,i′ , | |, (23) − γˆi,σ(i) Markov chain corresponding to) the random walk on this | | trellis,P P Qi,i′ , µi pi,i′ = γˆi,σ(i′) , (24) · | | • Ti,i′ is a branch metric, θ ′ i,σ(i ) ′ ′ Ti,i′ , log , (25) • i i′6=i µipi,i Ti,i is the average branch metric of the θi,σ(i) random walk on this trellis,   P P 28

• and maximizing the coefficient of ( τ) in the above where C is some suitable normalization constant. Conse- − i,j expression for FB γ(τ) (see (21)) means to find the quently, the update of the likelihood ratio reads (time-invariant) left-to-right random walk on this trellis (t−1) (t) ∈A ′′  ai i fi(ai) j′′6=j µi,j (ai,j ) that maximizes µ (0) a =0 ←− (t) , −→i,j i,j · −→Λi,j (t) = (t−1) ′′ µ (1) P ai∈Ai fi(ai) Q ′′ µ ′′ (ai,j ) −→i,j a j 6=j ←−i,j µi pi,i′ log(pi,i′ )+ Ti,i′ , i,j =1 · ′ · · − (t−1) (t−1) i i 6=i ′ θ P′ µ ′ (1) Q′′ ′ µ ′′ (0) X X   (a) j 6=j i,j · ←−i,j · j 6=j,j ←−i,j = (t−1) P p θi,j ′′ µQ ′′ (0) i.e., the sum of the entropy rate and the average branch · j 6=j ←−i,j −1 metric of the random walk. (In statistical physics terms, (b) 1 (t−1) = p θ ′Q ←−Λ ′ , this expression can be considered to be some negative · i,j · i,j θi,j j′6=j free energy function.) X p   where atp step (a) we have used = u j for The purpose of rewriting the above expression in the way i j simplifying the numerator, and whereA at{ step| (b)∈ we J} have we did, was so that it is very close to the notation used (t−1) ′ used the definition of ←−Λ ′ , j = j. This yields the first in [82, Lemma 44] that solved exactly the above maximization i,j expression in the lemma statement.6 The second expression is problem. (Note that related problems were also solved in [83] obtained analogously by considering the SPA message update and [84].) rule for function nodes gj , j , at iteration t > 1. As was shown in [82, Lemma 44], the maximal value of Now we turn our attention∈ to J computing the beliefs at the function nodes gi, i , at iteration t > 0. Following [38]– µi pi,i′ log(pi,i′ )+ Ti,i′ ∈ I · · − [40], we have for every i and every ai i, i i′6=i ∈ I ∈ A = Q ′ X X i,i   (t) 1 (t) β = fi(ai) µ (ai,j ), i,ai C · · ←−i,j | {z } i j is log(ρ) and is attained by Y (t) where Ci is chosen such that a βi,a =1. In particular, for µ∗ = κ uL uR, i i i i i ai = uj, j , we get · R · ′ ∈ J P ui′ Ai,i ′ ∗ uR ρ (if i = i ) (t) 1 (t) p ′ = i , β = f (a ) µ ′ (a ′ ) i,i · 6 i,ai i i ←−i,j i,j Ci · · (0 (otherwise) j′ L R Y ′ ui ·Ai,i ·ui′ ′ (t) ∗ ∗ ∗ κ (if i = i ) 1 µ ′ (ai,j′ ) ′ ′ ρ (t) ←−i,j Qi,i = µi pi,i = · 6 , a ′ = fi( i) ←−µi,j (0) (t) · (0 (otherwise) Ci · ·   · j′ j′ ←−µi,j′ (0) Y Y L R   where A, ρ, u , and u are defined in the theorem statement, 1 (t) (t) ∗ = θi,j ←−µi,j′ (0) ←−Vi,j . and where κ is a normalization constant such that i µi = Ci · ·   · j′ 1. Note that A, called the noisy in [82, p Y P ′   Lemma 44], is such that A ′ = exp(T ′ ) for i = i and Because Ci and the expression in the parentheses are inde- i,i i,i 6 such that Ai,i =0. pendent of j, we have just verified the third expression in Because A contains only non-negative entries, ρ is the so- the lemma statement. The fourth expression in the lemma called Perron eigenvector of A, and uL and uR are the so- statement is obtained analogously by considering the beliefs called left and right, respectively, Perron eigenvectors of A; at function nodes gj , j , at iteration t > 1. ∈ J one can show that these two vectors contain only non-negative entries. APPENDIX F Translating this result back using (22), (23), and (24), we PROOF OF LEMMA 31 obtain the result given in the theorem statement. The pseudo-dual function of the Bethe free energy function is given by evaluating the Lagrangian of the Bethe free energy function at a stationary point [48]. Therefore, in a first step, we APPENDIX E want to write down the Lagrangian of the Bethe free energy PROOF OF LEMMA 29 function. To that end, we take the Bethe free energy function as in Definition 10, i.e., We start by formulating the SPA message update rule for FB βi , βj , βe = UB,i(βi)+ UB,j (βj ) functions node gi, i , at iteration t > 1. Following [38]– { } { } { } ∈ I i j [40], we have for every i , every j , and every a¯i,j  X X ∈ I ∈ J ∈ H (β ) H (β )+ H (β ). i,j , − B,i i − B,j j B,e e A i j e X X X (t) 1 (t−1) (For the purposes of this appendix, the expression for in , ′ FB −→µi,j (¯ai,j ) fi(ai) ←−µi,j′ (ai,j ), Ci,j · · ′ Definition 10 is somewhat more convenient than the one in ai∈Ai j 6=j ai,jX=¯ai,j Y Lemma 14.) 29

Now, introducing a Lagrange multiplier for the edge con- taking advantage of the binary alphabet e = 0, 1 , e , sistency constraints (but not for the other constraints imposed we obtain (after some simplifications) A { } ∈ E by the local marginal polytope , see Definition 9), we obtain B F # ←−λ , −→λ the relevant Lagrangian Bethe { e} { e}  LBethe βi , βj , βe , ←−λe , −→λe = log θ exp ←−λ ←−λ { } { } { } { } { } −  i,j · (i,j),1 − (i,j),0  = F ( β , β , β ) i j B i j e  X X p   { } { } { }   log θi,j exp −→λ(i,j),1 −→λ(i,j),0 ←−λe,ae βi,ai βe,ae − · − − ·  −  j i ! e=(i,j) ae ai: ai,e=ae   X X X X X p   + log 1+exp ←−λe,1 ←−λe,0 + −→λe,1 −→λe,0 − − e −→λe,ae βj,aj βe,ae ,    − ·  −  X   a a : a =a e=(Xi,j) Xe j Xj,e e From the results in [13] it follows that at a fixed point of   the SPA, the quantity ←−λ ←−λ represents the log- Because F is convex in β and β , but concave in (i,j),0 (i,j),1 B i i j j likelihood ratio of the left-going− message along the edge (i, j), β , the pseudo-dual function{ } of F {is} given by { e}e B and the quantity −→λ −→λ represents the log-likelihood (i,j),0− (i,j),1 # ratio of the right-going message along the edge (i, j). Clearly, F ←−λe , −→λe Bethe { } { } for every edge (i, j) , these quantities are related to ∈I×J = max min LBethe βi , βj , βe , ←−λe , −→λe , the inverse likelihood ratios by {βe} {βi}, {βj } { } { } { } { } { }  where the maximization/minimization is over all β , β , ←−Vi,j = exp ←−λ(i,j),1 ←−λ(i,j),0 , { e}e { i}i − βj j that satisfy the constraints imposed by the local   { } −→Vi,j = exp −→λ −→λ , marginal polytope , except for the edge consistency con- (i,j),1 − (i,j),0 B   straints. We obtain the maximizing βe e and the minimizing respectively. Therefore, we get β , β by setting suitable partial{ } derivatives to zero. { i}i { j}j This yields, F # ←−V , −→V Bethe { i,j } { i,j } 1 = log θi,j ←−Vi,j log θi,j −→Vi,j βi,ai = gi(ai) exp ←−λe,ai,j(e) , − · − · Zi · · i j e: i(e)=i X   X   Y   p p 1 + log 1+ ←−Vi,j −→Vi,j , · βj,aj = gj (aj ) exp −→λe,ai(e),j , i,j Zj · · X   e: jY(e)=j   1 which is the expression in the lemma statement. βe,ae = exp ←−λe,ae exp −→λe,ae , Although the interpretation of the log-likelihood ratios was Ze · ·     given by looking at fixed points of the SPA, it is not difficult where i(e) and j(e) give the label of the, respectively, left to see that we can evaluate this last expression for any set of and right vertex to which e is incident, and where Z , inverse likelihood ratios. { i}i Zj j , and Ze e are suitable normalization constants such that{ } relevant{ sums} are equal to one. APPENDIX G Now, plugging these beliefs into the Lagrangian, we obtain PROOF OF THEOREM 32 (after cancelling several terms) the expression This appendix has two subsections. The first subsection con- # F ←−λ , −→λ siders the case where the global minimum of FB is achieved Bethe { e} { e} at a vertex of Γn×n, whereas the second subsection considers = log(Zi) log(Zj )+ log(Ze) − − the case where the global minimum of FB is achieved in the i j e X X X interior of Γn×n. For ease of reference, we reproduce here the SPA message = log g (a ) exp ←−λ −  i i · e,ai,j(e)  update rules from Lemma 29, i.e., i ai X X e: iY(e)=i     θ (t) i,j > (26) −→Vi,j = (t−1) , t 1, (i, j) , log gj(aj ) exp −→λe,a ′ ∈I×J −  · i(e),j  j′6=j pθi,j ←−Vi,j′ j aj e: j(e)=j · X X Y   θ   (t) P p i,j > (27) ←−Vi,j = (t) , t 1, (i, j) . + log exp ←−λ + −→λ . ′ θ ′ −→V ′ ∈I×J e,ae e,ae i 6=i p i ,j · i ,j e ae ! X X   In both partsP ofp this appendix, the main task will be to exhibit We proceed by using some details of the definition of N(θ). a contraction operation of a suitably chosen subset of the SPA Namely, using the definition of the local function nodes and messages. 30

A. Global Minimum of FB is Achieved at a Vertex of Γn×n Therefore,

Let γ be the vertex of Γn×n that uniquely minimizes (t) t→∞ ∈ C −→Λi,σ(i) 0, i . FB. This means that γ corresponds to the permutation σγ . −−−→ ∈ I (In the following, we will use the short-hands σ , σγ and A similar argument shows that , −1 σ¯ σγ .) (t) t→∞ (t) (t) ←−Λ 0, j . From (26) it follows that −→Λ =1/−→V , i , can be σ¯(j),j −−−→ ∈ J i,σ(i) i,σ(i) ∈ I written as14 Finally, from (26) and (27) and the above results it follows

(t) 1 (t−1) that −→Λ = θi,j ←−Vi,j , t > 1, i . i,σ(i) θ · · ∈ I (t) t→∞ i,σ(i) j6=σ(i) −→V 0, i , j , j = σ(i), X p i,j −−−→ ∈ I ∈ J 6 p (t) t→∞ On the other hand, for i and j = σ(i) the SPA message ←−Vi,j 0, i , j , j = σ(i). update equation in (27) implies∈ I 6 −−−→ ∈ I ∈ J 6 All these quantities converge to zero exponentially fast. θ (t−1) i,j When FB achieves its minimum in the interior of Γn×n, ←−Vi,j = (t−1) # ′ then we have equality between F and F at stationary i′6=i pθi ,j −→Vi′,j B Bethe · points of the SPA. However, we also have equality in the θ 1 # P p i,j present case. Namely, evaluating F (see Lemma 31) for = (t−1) −→(t−1) Bethe · √θi′,j · V ′ θσ¯(jp),j −→V σ¯(j),j i ,j the above messages, we obtain · 1+ −→(t−1) ′ √θσ¯(j),j · V σ¯(j),j p i 6=i,σ¯(j) # (t) (t) t→∞ P FBethe ←−Vi,j , −→Vi,j log(θi,σ(i)), θi,j −−−→ − 6 i (t−1)    X θσ¯(jp),j −→V · σ¯(j),j which indeed equals FB(γ). From ρ < 1 and FB(γ) = θ log perm (θ) it also follows that p i,j (t−1) > − B = −→Λ σ¯(j),j , t 1, i , j = σ(i), θσ¯(j),j · ∈ I 6 p #  (t) (t) −ν·t exp FBethe ←−Vi,j , −→Vi,j permB(θ) 6 C e where the inequalityp follows from the fact that all terms in the − − ·    summation ′ are non-negative. Then, combining the   i 6=i,σ¯(j) for some suitable constants C,ν R>0. two above expressions, we obtain ∈ P θ (t) 6 i,j (t−1) > B. Global Minimum of F is Achieved in the Interior of Γ −→Λi,σ(i) −→Λ σ¯(j),j , t 1, i . B n×n θi,σ(i) θσ¯(j),j · ∈ I j6=Xσ(i) In Corollary 23 we established that the Bethe free energy Rearranging terms,p we obtainp function of N(θ) is convex, i.e., it does not have stationary (t) (t−1) points besides the global minimum. Therefore, using a theorem −→Λ θ −→Λ i,σ(i) 6 i,j σ¯(j),j by Yedidia, Freeman, Weiss [13], we know that fixed points θi,σ(i) θi,σ(i) · θσ¯(j),j of the SPA correspond to the global minimum of the Bethe j6=Xσ(i) free energy function. p p(t−1) θi,σ(i′ ) −→Λ i′,σ(i′) Let ←−V , −→V be inverse likelihood ratios that = , t > 1, i . i,j i,j i,j i.j θ · θ ′ ′ ∈ I constitute a fixed point of the SPA update rules in (26)–(27). i′6=i i,σ(i) i ,σ(i )   X As such, these inverse likelihoods must satisfy p (t) Now, for every t > 0, consider the length-n vector −→m whose (t) θi,j ith entry is −→Λi,σ(i)/ θi,σ(i). Grouping several of the above −→Vi,j = , (29) inequalities together, we obtain the vector inequality ′ pθi,j′ ←−Vi,j′ p j 6=j · m(t) 6 A m(t−1), t > 1, (28) P pθi,j −→ · −→ ←−Vi,j = , (30) ′ θ ′ −→V ′ where the vector inequality has to be understood component- i 6=i p i ,j · i ,j wise, and where the n n matrix A was defined in Theorem 26 for every (i, j) . NoteP thatp these SPA fixed point inverse for the vertex γ of×Γ . Let ρ be the maximal (real) ∈ E n×n likelihood ratios satisfy 0 < −→Vi,j < and 0 < ←−Vi,j < , eigenvalue of A. Then, Corollary 27 and the assumption that otherwise the assumption that we are∞ dealing with an interio∞r γ is the unique minimizer of F allow us to conclude that B point of Γn×n would be violated. ρ< 1. However, because ρ< 1 implies that all eigenvalues of It follows from the message gauge invariance mentioned in A have magnitude strictly smaller than 1, the update equation Remark 30 that, for any positive real number C, the inverse in (28) represents a contraction, and so likelihoods C ←−V , 1 −→V also constitute a fixed · i,j i,j C · i,j i.j m(t) t→∞ 0. point of the SPA update rules. We will use this fact later on. −→ 2   (t) (t) −−−→ On the other hand, let , be a set of ←−Vi,j i,j,t −→Vi,j i,j,t 14For simplicity, because j does not appear on the left-hand side of this inverse likelihoods obtained by running the SPA on N(θ) ac-   equation, we use j as a summation variable on the right-hand side. This cording to the SPA update rules in (26)–(27). In the following, is in contrast to (26) where j appears on the left-hand side and where the (t) (t) ′ we will not work with , directly, but summation variable on the right-hand side is j . ←−Vi,j i,j,t −→Vi,j i,j,t   31 with (t) , (t) , which are implicitly defined by we obtain , which is structurally the same ←−ε i,j i,j,t −→εi,j i,j,t δ δ = ε/(1 + ε) the equations expression as the− first expression but with the roles of ε and   δ interchanged. −→V (t) = −→V 1+ ε (t) , (31) i,j i,j · −→i,j (t)  (t)  Lemma 66 Fix an iteration number t > 1. Taking advantage ←−Vi,j = ←−Vi,j 1+ ←−ε i,j . (32) · of the message gauge invariance that was mentioned in Re-   (Note that 1 < ε (t) < and 1 < ε (t) < .) mark 30, we can rescale the left-going and right-going fixed- ←−i,j −→i,j (t−1) −(t) (t) ∞ − ∞ point messages such that all are non-negative. Clearly, ε , ε can be considered to be a ←−ε i,j i,j ←−i,j i,j,t −→i,j i,j,t { (t−1)} (t) “measure” of the distance of the SPA messages to the fixed- With this we define the numbers ←−εmax > 0 and ←−εmax > 0 point messages. In particular, we have established convergence to be the smallest numbers that satisfy of the SPA if we can show that these values converge to zero ε (t−1) 6 ε (t−1), (i, j) , for t . ←−i,j ←−max ∈E → ∞ (t) 6 (t) In a first step, we express the SPA message update rules in ←−ε i,j ←−εmax, (i, j) . (t) (t) ∈E terms of ε and ε . ←−i,j i,j,t −→i,j i,j,t Then   Lemma 65 For the right-going messages it holds that 6 (t) 6 (t) 6 (t−1) 0 ←−ε i,j ←−εmax ←−εmax (i, j) . (t−1) ∈E ′ θ ′ ←−V ′ ε ′ (t) j 6=j i,j i,j ←−i,j Proof: It follows immediately from (33) that −→δi,j , · · , (33) ′ ′ P pj′ 6=j θi,j ←−Vi,j · 6 (t) 6 (t−1) (t) 0 −→δi,j ←−εmax , (i, j) , −→Pδ p ∈E ε (t) = i,j . (34) −→i,j − (t) and so, because of (34), we have 1+ −→δi,j ε (t−1) For the left-going messages it holds that ←−max 6 (t) 6 (37) 1 < (t−1) −→εi,j 0, (i, j) . (t) (t) − −1+ ε ∈E ′ ←−max (t) i′6=i θi ,j −→Vi′,j −→ε i′,j , · · (35) ←−δi,j (t) , Using (35), this implies P p′ θi′,j −→V ′ i 6=i · i ,j (t) (t−1) εmax (t) (t) P←−δi,j p ←− 1 < 6 ←−δi,j 6 0, (i, j) , ←−ε i,j = . (36) − − (t−1) ∈E − (t) 1+ ←−εmax 1+ ←−δi,j and so, because of (36), we have Proof: Let us establish (34). The expression in (36) then follows analogously. We compute 0 6 ε (t) 6 ε (t−1), (i, j) . (38) ←−i,j ←−max ∈E (t) (a) (t) −→Vi,j 1+ ε = −→V This proves the statement in the lemma.  · −→i,j i,j   (b) θi,j This shows that the errors stay bounded but it does not = (t−1) prove convergence yet. (This result is essentially equivalent ′ θ ′ ←−V ′ j 6=j pi,j · i,j to the result that is obtained by taking the zero-temperature (c) θi,j limit of the contraction coefficient that is computed in the = P p (t−1) ′ ′ SPA convergence analysis of [23]: the result is a contraction j′6=j θi,j p←−Vi,j 1+ ←−ε i,j′ · · coefficient of 1, which is non-trivial, but not good enough to 15 (d) P p θi,j   show that the message update map is a contraction. ) = (t) It turns out that in order to improve these bounds we have ′ θi,j′ p←−Vi,j′ 1+ −→δ j 6=j · · i,j to track the error values over two iteration, i.e., four half     (e) P−→Vi,j p iterations. (We suspect that this is related to the fact that the = , N N 1+ −→δ (t) girth of (θ), i.e., the length of the shortest cycle of (θ), i,j is 4.) where at step (a) we have used (31), where at step (b) we have used (26), where at step (c) we have used (32), where Lemma 67 Fix an iteration number t > 1. Taking advantage at step (d) we have used (33), and where at step (e) we have of the message gauge invariance that was mentioned in Re- used (29). Dividing both sides by −→Vi,j , and then subtracting 1 mark 30, we can rescale the left-going and right-going fixed- from both sides, yields the expression in (34).  (t−1) point messages such that all ←−ε i,j i,j are non-negative and Note that −→δ (t) is a weighted arithmetic average of the error { (t−1)} i,j such that, additionally, mini,j ε =0. With this, we define (t−1) (t) ←−i,j values ′ , and that is a weighted arithmetic ←−ε i,j j′ 6=j ←−δi,j (t) 15Given the difference in the graphical model in [23] and the graphical average of the error values ε ′ .  −→i ,j i′6=i model considered here, some care is required when comparing the temperature Note also that the expressions in (34) and (36) have the  that is mentioned here and the temperature that is mentioned in Sections II following peculiarity. Namely, solving ε = δ/(1 + δ) for and III. − 32

(t−1) (t+1) the numbers εmax > 0 and εmax > 0 to be the smallest for suitable constants C,ν R . This follows from, on the ←− ←− ∈ >0 numbers that satisfy one hand, the fact that when FB achieves its minimum in the interior of Γ then we have equality between F and F # ε (t−1) 6 ε (t−1), (i, j) , n×n B Bethe ←−i,j ←−max ∈E at stationary points of the SPA [13], and, on the other hand, ε (t+1) 6 ε (t+1), (i, j) . the above convergence analysis. ←−i,j ←−max ∈E Then APPENDIX H 0 6 ε (t+1) 6 ε (t+1) 6 ν′ ε (t−1) (i, j) , ←−i,j ←−max · ←−max ∈E PROOF OF LEMMA 48 ′ for some constant 0 6 ν < 1 that depends only on θ and In a first step we evaluate perm(1n×n). Namely, we obtain ′ the fixed-point messages ←−Vi,j i,j and −→Vi,j i,j , i.e., ν is (a) n n { } { } perm(1 )= n! = √2πn 1+ o(1) , (39) independent of t. n×n · e ·   (t+1) > where at step (a) we have used Stirling’s approximation  of n!. Proof: The statement ←−ε i,j 0, (i, j) follows from ∈ E 1 applying Lemma 66 twice. Therefore, we can focus on the In a second step we evaluate permB( n×n). From Defini- (t+1) ′ (t−1) tions 11 and 12 it follows that proof of ←−εmax 6 ν ←−εmax . For a given edge · , we observe that (t−1) (i, j) ←−εmax 1+ 1 (t−1) (t) ∈E − (t−1) permB( n×n) , exp min FB(γ) . εmax 6 ε in (37) holds with equality only if ε ′ = − γ ←− −→i,j ←−i,j   (t−1) ′ ′ ←−εmax for all edges (i, j ) with j = j. Similarly, for a given From Corollary 23 and symmetry considerations it follows that (6 t) 6 (t−1) edge (i, j) we observe that ←−ε i,j ←−εmax in (38) holds the minimum in the above expression is achieved by γi,j = ∈E (t) (t−1) (t−1) with equality only if ε ′ = ε 1 + ε for 1/n, (i, j) . Therefore, −→i ,j ←−max ←−max ∈I×J all edges (i′, j) with i′ = i. This− motivates the definition of log perm (1 ) the following sets where6 we track the edges for which a strict B n×n = FB(γ) inequality holds w.r.t. the inequalities just mentioned. Namely, − γi,j =1 /n, (i,j)∈I×J for t > 1 we define (a) = UB(γ)+ HB(γ) ′ − γi,j =1/n, (i,j)∈I×J (t) there is at least one edge (i, j ), −→ , (i, j) , (b) 2 1 1 2 1 1 E ∈E j′ = j, such that (i, j′) ←−(t−1) = n log + n 1 log 1  6 ∈ E  − · n · n · − n · − n ′       (t) there is at least one edge (i , j), ←− , (i, j) . 1 E ∈E i′ = i, such that (i′, j) −→(t) = n log(n)+ n (n 1) log 1   · · − · − n 6 ∈ E   With this, assume that ←−(t−1) contains all the edges for 1 1 1 (t−1) (t −1) E = n log(n)+ n (n 1) + o which . Clearly, (t) then contains all edges · · − · −n − 2n2 n2 ←−ε i,j < ←−ε max −→    (t) (t−1)E (t−1) n 1 (i, j) for which ε > ε max 1+ ε max . Similarly, −→i,j −←− ←− = n log(n) (n 1) − + o(1) (t) contains all edges for which (t) (t−1). · − − − 2n ←− (i, j)  ←−ε i,j < ←−ε max 1 E (t−1) = n log(n) n + + o(1), If ←−εmax = 0 then the lemma is clearly true. So, assume · − 2 (t−1) (t−1) that ←−εmax > 0. Let ←− contain all edges (i, j) for which (t−1) (t−1) E where at steps (a) and (b) we have used Corollary 15. −→ε i,j < ←−εmax . The assumptions in the lemma statement Consequently, guarantee that there is at least one such edge, namely the n (t−1) (t−1) 1 n edge(s) (i, j) for which −→ε i,j = 0, and so the set ←− permB( n×n)= √e 1+ o(1) . (40) is non-empty. It can then be verified that four half-iteratioE ns · e ·    later we have ←−(t+1) = . Combining (39) and (40) we obtain the promised result in The fact thatE there is, asE mentioned in the lemma statement, the lemma statement. a constant ν′ that is t-independent and strictly smaller than 1 is then established by tracking the differences between the APPENDIX I left- and the right-hand sides in the above-mentioned strict PROOF OF CONJECTURE 51 FOR θ = 1n×n inequalities. This is done with the help of (33) and (35).  Let θ = 1n×n. In this appendix we prove that for any The convergence proof is then completed by applying M Z and any P˜ Ψ˜ it holds that ∈ >0 ∈ M Lemma 67 repeatedly. One detail needs to be mentioned, ˜ M (t+1) perm θ↑P 6 perm(θ) . (41) though. Namely, if mini,j ←−ε i,j > 0, and a non-trivial re-gauging occurs at the beginning of the next application Although the proof is somewhat lengthy, the combinatorial of Lemma 67, then in this re-gauging process the value (t+1) idea behind it is quite straightforward. Moreover, the only of maxi,j ←−ε i,j > 0 never increases (in fact, it always inequality that we use is the AM–GM inequality, which decreases). says that the arithmetic mean of a list of non-negative real Finally, we have numbers is at least as large as the geometric mean of this list of numbers. Notably, there is no need to use Stirling’s exp F # ←−V (t) , −→V (t) perm (θ) 6 C e−ν·t − Bethe i,j i,j − B · approximation of the factorial function.     

33

(1, 1) (1, 1) (1, 1) (1, 1) Towards showing (41), let us fix some positive inte- (1, 2) (1, 2) (1, 2) (1, 2) 1 1 ger M, fix some collection of permutation matrices P˜ = (1, 3) (1, 3) (1, 3) (1, 3) (1, 4) (1, 4) (1, 4) (1, 4) P˜(i,j) Ψ˜ , define θ˜ , θ↑P˜ as in Definition 37, i∈I,j∈J ∈ M and let the row and column index sets of θ↑P˜ be [M] and  I× [M], respectively. With this, it follows from Definition 1 (2, 1) (2, 1) (2, 1) (2, 1) (2, 2) (2, 2) (2, 2) (2, 2) J × 2 2 that (2, 3) (2, 3) (2, 3) (2, 3) (2, 4) (2, 4) (2, 4) (2, 4) perm(θ)= θi,σ(i), (42) σ X iY∈I ˜ ˜ (3, 1) (3, 1) (3, 1) (3, 1) perm(θ)= θ(i,m),σ˜((i,m)), (43) (3, 2) (3, 2) (3, 2) (3, 2) σ˜ (i,m)∈I×[M] 3 3 X Y (3, 3) (3, 3) (3, 3) (3, 3) where σ ranges over all permutations of the set and where (a) (3, 4) (3, 1) (3, 4) (3, 1) σ˜ ranges over all permutations of the set [MI]. (b) (c) Note that, because all entries of θ˜ areI× either equal to Fig. 11. (a) NFG N(θ) for n = 3. (b) “Trivial” 4-cover of N(θ) (c) zero or to one, the products in (43) evaluate either to zero N ˜ A possible 4-cover of (θ). The coloring of the edges in (b) and (c) show or to one. Computing perm(θ) is therefore equivalent to visually the fact that he sets ∂˜((i, m)), m ∈ [M], form a partition of J×[M] counting the σ˜’s for which these products evaluate to one. (here for i = 1). (For more details, see the text in Appendix I). Equivalently, perm(θ˜) equals the number of perfect matchings in the NFG N(θ˜). to be the number of possibilities of choosing σ˜((i,m)), i.e., Example 68 Some of the steps of the proof will be illustrated the number of ways that the edge of the perfect matching of with the help of the NFGs in Fig. 3 (which are reproduced in N(θ˜) that is incident on (i,m) can be chosen. Fig. 11 for ease of reference), where n =3 and M =4. ˜ • Let i = 1. Then di,m|σ˜i−1 , m [M], is the number of N 1 • Fig. 11(a) shows the NFG (θ); perm(θ) equals possibilities of choosing the edge∈ of the perfect matching the number of perfect matchings in Fig. 11(a). Note: of N(θ˜) that is incident on (i,m). Because the ith perm(θ)= n! . row of θ contains only ones, and because of the above (i,j) ˜ ˜ • If P˜ = P˜ = I , where I is i∈I,j∈J i∈I,j∈J partitioning observation, we find that d˜i,m = n for all the identity matrix of size M M, then we obtain   m [M], and so, the M-cover shown in Fig. 11(b),× which is a “trivial” ∈ ˜ M-cover of N(θ); perm θ↑P equals the number of ˜ di,m|σ˜i−1 = Mn. ↑P˜ 1 perfect matchings in Fig. 11(b). Note: perm θ = m∈[M] M  X perm(θ) = (n!)M .  We observe that, whatever the selection of these M edges • For a “non-trivial” collection of permutation matrices P˜ = P˜(i,j) we obtain an M-cover like in is, M vertices on the right-hand side will be incident on i∈I,j∈J a selected edge, and therefore be “not available anymore” ↑P˜ Fig. 11(c); perm θ equals the number of perfect in the following steps. This reduces the number of “avail- matchings in Fig. 11(c).   able” right-hand side vertices to Mn M = M (n 1). ˜ − · − • Let i = 2. Then di,m|σ˜i−1 , m [M], is the number of Let us therefore count the number of perfect matchings in 1 ∈ N(θ˜), see Fig. 11(c). Before continuing, we define ∂˜((i,m)), possibilities of choosing the edge of the perfect matching N ˜ (i,m) [M], to be the set of neighbors of the vertex of (θ) that is incident on (i,m). Because the ith row of (i,m) in∈N I×(θ˜), i.e., θ contains only ones, because of the above partitioning observation, and because of the observation at the end of ′ (i,j) ∂˜((i,m)) , (j, m ) [M] P˜ ′ =1 . the above step, we find that ∈J × m,m n ˜o One can easily verify that for every i , the sets ∂((i,m)), ˜ i−1 6 di,m|σ˜ M (n 1). (44) ∈ I 1 · − m [M], form a partition of [M]. (See Figs. 11(b)–(c) m∈[M] that∈ highlight this partitioningJ for × i = 1.) This observation X will be the crucial ingredient of the following steps. (If all permutation matrices in P˜ are identity matrices, ˜ We count the number of perfect matchings in N(θ) by then it can be verified that the inequality in (44) is considering the vertices for , , up to ˜ (i,m) m∈[M] i =1 i =2 an equality. However, for general P , equality in (44) i = n, thereby counting in how many ways we can specify σ˜ does not need to hold.) Similar to the end of the above  such that the product in (43) equals one. Note that because of step, we observe that whatever the selection of these the above partitioning observation, we can, conditioned on the M edges is, M vertices on the right-hand side will selection of a perfect matching up to and including step i 1 be incident on a selected edge, and therefore be “not i−1 − (which we shall symbolically denote by σ˜1 ), consider the available anymore” in the following steps. This reduces vertices independently. Then we define (i,m) m∈[M] the number of “available” right-hand side vertices to M (n 1) M = M (n 2).  ˜ i−1 di,m|σ˜ , (i,m) [M], · − − · − 1 ∈I× • Continuing as above, we observe that for general i ∈ I 34

it holds that For the rest of the proof, we will use the short-hand θ˜ for θ↑P˜ . We remind the reader of Assumption 2, i.e., we will ˜ i−1 6 di,m|σ˜ M (n i + 1). (45) 1 · − assume that there is at least one permutation σ :[n] [n] such m∈[M] ˜ → X that i θi,σ(i) > 0 (otherwise, permB(θ) = permB(θ)=0). Note that for i we have Moreover, N(θ˜) will be the NFG associated with θ˜.16 ∈ I Q M Towards proving the first inequality, let γ Γ be a ∈ n×n ˜ ˜ 1/M matrix that minimizes FB,N(θ). Based on γ, we define the d i−1 = d i−1 i,m|σ˜1 i,m|σ˜  1  (Mn) (Mn) matrix γ˜ with entries mY∈[M] mY∈[M] ×   M ˜(i,j) (a) 1 γ˜(i,m),(j,m′) , γi,j Pm,m′ 6 d˜ i−1 · i,m|σ˜1 M  ′ mX∈[M] for all (i,m,j,m ) [M] [M]. One can easily verify M ˜ ∈I× ×J× ˜ (b)  1  that γ Γ(Mn)×(Mn) and that F N ˜ (γ)= M FB,N(θ)(γ). 6 M (n i + 1) ∈ B, (θ) · M · · − From this and Corollary 15 it then follows that   M = (n i + 1) , (46) ˜ M − permB(θ) > permB(θ) . where at step (a) we have used the fact that the geometric mean  of a collection of non-negative numbers is upper bounded by Towards proving the second inequality, let γ˜ Γ ∈ (Mn)×(Mn) ˜ the arithmetic mean of the same collection of numbers, and be a matrix that minimizes FB,N(θ). One can easily verify where at step (b) we have used (45). (i,j) ′ that ′ whenever ˜ ′ , ˜ γ˜(i,m),(j,m ) = 0 Pm,m = 0 (i,m,j,m ) With this, we obtain the following upper bound on perm(θ). [M] [M]. Based on γ˜, we define the n n matrix∈ Namely, γI×with entries×J × × ˜ (a) perm(θ) = 1 1 (i,j) ··· , ′ ˜ ′ σ˜1 σ˜2|σ˜1 n−1 n−2 n n−1 γi,j γ˜(i,m),(j,m ) Pm,m 1 1 1 σ˜1 |σ˜1 σ˜1 |σ˜1 M · X X X X m m′ (b) X X = d˜ n−1 n,mn|σ˜1 1 2 1 ··· n−1 n−2 for all (i, j) . One can easily verify that γ Γn×n. Let σ˜ σ˜ |σ˜ σ˜ |σ˜ mn∈[M] X1 X1 1 1 X1 Y γ˜ be the∈ length- I×J n vector based on the (i,m)∈th row of γ˜, (c) (i,m) 6 M ˜(i,j) (n n + 1) where we include an entry only if Pm,m′ =1. Similarly, define ··· − ′ σ˜1 σ˜2|σ˜1 σ˜n−1|σ˜n−2 the length-n vector γ˜ ′ based on the (j, m )th column X1 X1 1 1 X1 (j,m ) (d) of γ˜. One can verify that the ith row of γ, i.e., γi, equals 6 (n n + 1)M 1 1 ˜ M m γ(i,m). Similarly, the jth column of γ, i.e., γj , equals − · 1 2 1 ··· n−1 n−2 1 σ˜ σ˜ |σ˜ ˜ ′ 1 1 1 σ˜1 |σ˜1 ′ γ(j,m ). Then X X X M Pm . . P (a) 1 1 H N ˜ (γ˜) = S(γ˜ )+ S(γ˜ ′ ) (e) B, (θ) 2 (i,m) 2 (j,m ) 6 (n i + 1)M i m j m′ − X X X X i∈I (b) M M Y 6 S(γ˜ )+ S(γ˜ ) M 2 i 2 j = (n!) i j X X (f) M (c) = perm(θ) , = M H N (γ), · B, (θ) where at step (a) we have used the fact that perm(θ˜) equals where at step (a) we have used Lemma 21, where at step (b) the number of perfect matchings in N(θ˜), where at step (b) ˜ we have used the concavity of the S-function (see Theo- we have used the definition of dn,m|σ˜n−1 , where at step (c) we 1 rem 20), and where at step (c) we have used once again have used (46) for i = n, where at step (d) we take advantage ˜ ˜ M n−1 Lemma 21. Moreover, one can easily show that UB,N(θ)(γ)= of the fact that (n n+1) is independent of σ˜1 , where at − M U N (γ), and so F ˜ (γ˜) > M F N (γ). From step (e) we apply similar results as at steps (b)–(d) (note that · B, (θ) B,N(θ) · B, (θ) M i−1 this and Corollary 15 it then follows that for all i, the quantity (n i+1) is independent of σ˜1 ), and where at step (f) we have− used the observation perm(θ)= n!. θ˜ 6 θ M This shows that the desired inequality (41) indeed holds for permB( ) permB( ) . ˜ ˜ arbitrary positive integer M and P ΨM .  ∈ 16Let N˜ be the M-cover of N(θ) corresponding to P˜. Note that, strictly ˜ speaking, N˜ and N(θ) are not the same NFG. The former is an M-cover APPENDIX J of N(θ) (therefore it has two times Mn function nodes, all of them with PROOF OF LEMMA 54 degree n), whereas the latter is a complete bipartite graph with two times M Mn function nodes. However, with the above condition on θ, for all practical We first prove perm θ↑P˜ > perm (θ) and then B B purposes they are the same because F N θ˜ (γ˜) < ∞ only for matrices M B, ( ) ↑P˜ (i,j) perm θ 6 perm (θ) , from which the promised γ˜ for which ′ whenever ˜ , B B   ∈ Γ(Mn)×(Mn) γ˜(i,m),(j,m ) = 0 Pm,m′ = 0 equality follows. (i,m,j,m′) ∈I× [M] ×J × [M].   35

APPENDIX K [8] A. Barvinok, “Polynomial time algorithms to approximate permanents PROOF OF LEMMA 61 and mixed discriminants within a simply exponential factor,” Random Structures and Algorithms, vol. 14, no. 1, pp. 29–61, Jan. 1999. Because κ satisfies the conditions listed in Theorem 60, the [9] M. Jerrum and U. V. Vazirani, “A mildly exponential approximation concavity statement for the Bethe entropy function and the algorithm for the permanent,” Algorithmica, vol. 16, no. 4–5, pp. 392– 401, Oct.–Nov. 1996. convexity statement for the Bethe free energy function follow [10] N. Linial, A. Samorodnitsky, and A. Wigderson, “A deterministic immediately. strongly polynomial algorithm for matrix scaling and approximate Therefore, let us turn our attention to evaluating the ratio permanents,” Combinatorica, vol. 20, no. 4, pp. 545–568, 2000. 1 (κ) 1 [11] M. Chertkov, L. Kroc, and M. Vergassola, “Belief propagation and perm( n×n)/ permB ( n×n). In a first step we evaluate beyond for particle tracking,” CoRR, available online under http:// perm(1n×n). Namely, as in the proof of Lemma 48 in arxiv.org/abs/0806.1199, Jun. 2008. Appendix H we have [12] B. Huang and T. Jebara, “Approximating the permanent with belief prop- agation,” CoRR, available online under http://arxiv.org/abs/ n n 0908.1769, Aug. 2009. perm(1 )= n!= √2πn 1+ o(1) . (47) n×n · e · [13] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy   approximations and generalized belief propagation algorithms,” IEEE (κ)  Trans. Inf. Theory, vol. 51, no. 7, pp. 2282–2312, Jul. 2005. In a second step we evaluate perm (1n×n). From The- B [14] L. Gurvits, “Unharnessing the power of Schrijver’s permanental inequal- orem 60 and symmetry considerations it follows that the ity,” CoRR, http://arxiv.org/abs/1106.2844, Jun. 2011. minimum in the above expression is achieved by γi,j =1/n, [15] ——, “Unleashing the power of Schrijver’s permanental inequality with (i, j) . Therefore, the help of the Bethe approximation,” Elec. Coll. Comp. Compl., Dec. ∈I×J 2011. 1 [16] P. O. Vontobel, “Counting in graph covers: a combinatorial characteriza- log permB( n×n) tion of the Bethe entropy function,” submitted to IEEE Trans. Inf. The- (a) (κ) ory, Nov. 2010, available online under http://arxiv.org/abs/ = UB(γ)+ H  (γ) − B 1012.0065 (ver. 2), Oct. 2012. (b) 1 1 1 = n2 1+ log [17] R. G. Gallager, Low-Density Parity-Check Codes. M.I.T. Press, − · 2n · n · n Cambridge, MA, 1963.     [18] T. Richardson and R. Urbanke, Modern Coding Theory. New York, 1 1 1 + n2 1 1 log 1 NY: Cambridge University Press, 2008. · − 2n · − n · − n [19] Y. Watanabe and M. Chertkov, “Belief propagation and loop calculus       for the permanent of a non-negative matrix,” Journal of Physics A: 1 1 1 = n + log(n)+ n (n 1) log 1 Mathematical and Theoretical, vol. 43, p. 242002, 2010. 2 · − 2 · − · − n [20] M. Chertkov, L. Kroc, F. Krzakala, M. Vergassola, and L. Zdeborov´a,       1 “Inference in particle tracking experiments by passing messsages be- = n + log(n) tween images,” Proc. Natl. Acad. Sci., vol. 107, no. 17, pp. 7663–7668, 2 · Apr. 2010.   1 1 1 1 [21] M. Chertkov and V. Y. Chernyak, “Loop series for discrete statistical + n (n 1) + o models on graphs,” J. Stat. Mech.: Theory and Experiment, p. P06009, − 2 · − · −n − 2n2 n2 Jun. 2006.      1 [22] A. B. Yedidia and M. Chertkov, “Computing the permanent with belief = n + log(n) n +1+ o(1), propagation,” submitted to J. Mach. Learn. Res., available online under 2 · − http://arxiv.org/abs/1108.0065   , Jul. 2011. [23] M. Bayati and C. Nair, “A rigorous proof of the cavity method for (κ) (κ) where at step (a) we have used FB (γ)= UB(γ) HB (γ), counting matchings,” in Proc. 44th Allerton Conf. on Communications, where at (b) we have used U (γ)= γ log(− θ )=0 Control, and Computing, Allerton House, Monticello, IL, USA, Sep. 27– B − i,j i,j i,j 29 2006. and the expression for (κ) γ from Lemma 58. Therefore, HB ( ) P [24] M. Bayati, D. Gamarnik, D. Katz, C. Nair, and P. Tetali, “Simple deterministic approximation algorithms for counting matchings,” in n n perm(κ)(1 )=e √n 1+ o(1) . (48) Proc. Symp. Theory of Computing,, San Diego, CA, USA, Jun.13–16 B n×n · · e · 2007, pp. 122–127.    [25] D. Gamarnik and D. Katz, “A deterministic By combining (47) and (48) we obtain the promised result. for computing the permanent of a 0, 1 matrix,” J. Computer and System Sciences, vol. 76, no. 8, pp. 879–883, Dec. 2010. [26] J. L. Williams and R. A. Lau, “Convergence of loopy belief propagation REFERENCES for data association,” in Proc. 6th Int. Conf. on Intelligent Sensors, Sen- [1] H. Minc, Permanents. Reading, MA: Addison-Wesley, 1978. sor Networks and Information Processing, Brisbane, Australia, Dec. 7– [2] H. J. Ryser, Combinatorial Mathematics (Carus Mathematical Mono- 10 2010, pp. 175–180. graphs No. 14). Mathematical Association of America, 1963. [27] B. Huang and T. Jebara, “Loopy belief propagation for bipartite max- [3] L. Valiant, “The complexity of computing the permanent,” Theor. Comp. imum weight b-matching,” in Proc. 11th Intern. Conf. on Artificial Sc., vol. 8, no. 2, pp. 189–201, 1979. Intelligence and Statistics, San Juan, Puerto Rico, Mar. 21–24 2007. [4] A. Z. Broder, “How hard is it to marry at random? (On the approximation [28] M. Bayati, D. Shah, and M. Sharma, “Max-product for maximum weight of the permanent),” in Proc. 18th Annual ACM Symp. Theory of Comp., matching: convergence, correctness, and LP duality,” IEEE Trans. Inf. Berkeley, CA, USA, May 28–30 1986, pp. 50–58, (Erratum in Proc. Theory, vol. 54, no. 3, pp. 1241–1251, Mar. 2008. 20th Annual ACM Symposium on Theory of Computing, 1988, p. 551). [29] M. Bayati, C. Borgs, J. Chayes, and R. Zecchina, “Belief-propagation [5] M. Jerrum, A. Sinclair, and E. Vigoda, “A polynomial-time approxima- for weighted b-matchings on arbitrary graphs and its relation to linear tion algorithm for the permanent of a matrix with nonnegative entries,” programs with integer solutions,” SIAM J. Discr. Math., vol. 25, no. 2, J. ACM, vol. 51, no. 4, pp. 671–697, Jul. 2004. pp. 989–1011, 2011. [6] M. Huber and J. Law, “Fast approximation of the permanent for very [30] S. Sanghavi, D. Malioutov, and A. Willsky, “Belief propagation and LP dense problems,” in Proc. ACM-SIAM Symp. Discr. Alg., San Francisco, relaxation for weighted matching in general graphs,” IEEE Trans. Inf. CA, USA, Jan. 20–22 2008. Theory, vol. 57, no. 4, pp. 2203–2212, Apr. 2011. [7] N. Karmarkar, R. Karp, R. Lipton, L. Lov´asz, and M. Luby, “A Monte- [31] N. Wiberg, “Codes and decoding on general graphs,” Ph.D. dissertation, Carlo algorithm for estimating the permanent,” SIAM J. Comp., vol. 22, Department of Electrical Engineering, Link¨oping University, Sweden, no. 2, pp. 284–293, Apr. 1993. 1996. 36

[32] A. Barvinok, “On the number of matrices and a random matrix with [58] H. M. Stark and A. A. Terras, “Zeta functions of finite graphs and prescribed row and column sums and 0–1 entries,” Adv. in Math., vol. coverings,” Adv. in Math., vol. 121, no. 1, pp. 124–165, Jul. 1996. 224, no. 1, pp. 316–339, May 2010. [59] N. L. Biggs, Discrete Mathematics, 2nd ed. New York: The Clarendon [33] A. Barvinok and A. Samorodnitsky, “Computing the partition function Press and Oxford University Press, 1989. for perfect matchings in a hypergraph,” Comb., Prob., and Comp., [60] R. Koetter and P. O. Vontobel, “Graph covers and iterative decoding vol. 20, no. 6, pp. 815–835, Nov. 2011. of finite-length codes,” in Proc. 3rd Intern. Symp. on Turbo Codes and [34] J. Yedidia, “An idiosyncratic journey beyond mean field theory,” in Related Topics, Brest, France, Sep. 1–5 2003, pp. 75–82. Advanced Mean Field Methods, Theory and Practice, M. Opper and [61] P. O. Vontobel and R. Koetter, “Graph-cover decoding and finite-length D. Saad, Eds. MIT Press, Jan. 2001, pp. 21–36. analysis of message-passing iterative decoding of LDPC codes,” CoRR, [35] C. Greenhill, S. Janson, and A. Ruci´nski, “On the number of perfect http://www.arxiv.org/abs/cs.IT/0512078, Dec. 2005. matchings in random lifts,” Comb., Prob., and Comp., vol. 19, no. 5–6, [62] A. Schrijver, “Counting 1-factors in regular bipartite graphs,” J. Comb. pp. 791–817, Nov. 2010. Theory, Ser. B, vol. 72, no. 1, pp. 122–135, Jan. 1998. [36] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: [63] L. Gurvits, “Van der Waerden / Schrijver-Valiant like conjectures and Cambridge University Press, 2004. stable (aka hyperbolic) homogeneous polynomials: one theorem for all,” [37] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge: Cambridge Elec. J. Comb., vol. 15, p. R66, 2008. University Press, 1990, corrected reprint of the 1985 original. [64] M. Laurent and A. Schrijver, “On Leonid Gurvits’s proof for perma- [38] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and nents,” Amer. Math. Monthly, vol. 117, no. 10, pp. 903–911, Dec. 2010. the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. [65] G. D. Forney, Jr. and P. O. Vontobel, “Partition functions of normal 498–519, Feb. 2001. factor graphs,” in Proc. Inf. Theory Appl. Workshop, UC San Diego, La [39] G. D. Forney, Jr., “Codes on graphs: normal realizations,” IEEE Trans. Jolla, CA, USA, Feb. 6–11 2011. Inf. Theory, vol. 47, no. 2, pp. 520–548, Feb. 2001. [66] N. Ruozzi, “The Bethe partition function of log-supermodular graphical [40] H.-A. Loeliger, “An introduction to factor graphs,” IEEE Sig. Proc. Mag., models,” in Proc. Neural Inf. Proc. Sys. Conf., Lake Tahoe, NV, USA, vol. 21, no. 1, pp. 28–41, Jan. 2004. Dec. 3–6 2012. [41] F. Lad, G. Sanfilippo, and G. Agr`o, “Extropy: a complementary dual of [67] K. Viswanathan, “Pattern maximum-likelihood,” Talk at Workshop on entropy,” CoRR, available online under http://arxiv.org/abs/ “Permanents and modeling probability distributions,” American Institute 1109.6440, Sep. 2011. of Mathematics, Palo Alto, CA, USA, Sep. 1 2009. [42] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. [68] P. O. Vontobel, “The Bethe approximation of the pattern maximum like- New York: John Wiley & Sons Inc., 2006. lihood distribution,” in Proc. IEEE Int. Symp. Inf. Theory, Cambridge, [43] P. O. Vontobel, “Connecting the Bethe entropy and the edge zeta function MA, USA, Jul. 1–6 2012, pp. 2012–2016. of a cycle code,” in Proc. IEEE Int. Symp. Inf. Theory, Austin, TX, USA, [69] R. Rivest, “On self-organizing sequential search heuristics,” Comm. Jun. 13–18 2010, pp. 704–708. ACM, vol. 19, no. 2, pp. 63–67, Feb. 1976. [44] ——, “A factor-graph-based random walk, and its relevance for LP [70] J. Sayir, “Ordering memoryless source alphabets using competitive lists,” decoding analysis and Bethe entropy characterization,” in Proc. Inf. in Proc. First INTAS International Seminar on Coding Theory and Theory Appl. Workshop, UC San Diego, La Jolla, CA, USA, Jan. 31– Combinatorics, Thahkadzor, Armenia, Oct. 6–11 1996. Feb. 5 2010. [71] A. W. Marshall and I. Olkin, Inequalities: Theory of Majorization and [45] S. Arora, C. Daskalakis, and D. Steurer, “Message-passing algorithms Its Applications. San Diego, CA: Academic Press, 1979. and improved LP decoding,” in Proc. 41st Annual ACM Symp. Theory [72] A. Marshall and I. Olkin, “Scaling of matrices to achieve specified row of Computing, Bethesda, MD, USA, May 31–June 2 2009. and column sums,” Num. Math., vol. 12, no. 1, pp. 83–90, Jan. 1968. [46] N. Halabi and G. Even, “LP decoding of regular LDPC codes in [73] W. Wiegerinck and T. Heskes, “Fractional belief propagation,” in memoryless channels,” IEEE Trans. Inf. Theory, vol. 57, no. 2, pp. 887– Advances in Neural Information Processing Systems 15, S. Becker, 897, Feb. 2011. S. Thrun, and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003, [47] D. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA: Athena pp. 438–445. Scientific, 1999. [74] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “A new class of [48] P. A. Regalia and J. M. Walsh, “Optimality and duality of the turbo upper bounds on the log partition function,” IEEE Trans. Inf. Theory, decoder,” Proceedings of the IEEE, vol. 95, no. 6, pp. 1362–1377, Jun. vol. 51, no. 7, pp. 2313–2335, Jul. 2005. 2007. [75] T. Heskes, “On the uniqueness of loopy belief propagation fixed points,” [49] M. M´ezard and A. Montanari, Information, Physics, and Computation. Neural Computation, vol. 16, no. 11, pp. 2379–2413, Nov. 2004. New York, NY: Oxford University Press, 2009. [76] Y. Weiss, T. Meltzer, and C. Yanover, “MAP estimation, linear program- [50] Y. Weiss and W. T. Freeman, “On the optimality of the max-product ming and belief propagation with convex free energies,” in Proc. Conf. belief propagation algorithm in arbitrary graphs,” IEEE Trans. Inf. Uncert. in Artif. Intell., Vancouver, Canada, July 19–22 2007. Theory, vol. 47, no. 2, pp. 736–744, 2001. [77] T. Hazan and A. Shashua, “Norm-product belief propagation: primal- [51] ——, “Correctness of belief propagation in Gaussian graphical models dual message-passing for approximate inference,” IEEE Trans. Inf. of arbitrary topology,” Neural Computation, vol. 13, no. 10, pp. 2173– Theory, vol. 56, no. 12, pp. 6294–6316, Dec. 2010. 2200, Oct. 2001. [78] N. Ruozzi and S. Tatikonda, “Convergent and correct message passing [52] P. Rusmevichientong and B. Van Roy, “An analysis of belief propagation schemes for optimization problems over graphical models,” submitted on the turbo decoding graph with Gaussian densities,” IEEE Trans. Inf. to JMLR, 2010, available online under http://arxiv.org/abs/ Theory, vol. 47, no. 2, pp. 745–765, 2001. 1002.3239. [53] J. M. Mooij and H. J. Kappen, “Sufficient conditions for convergence [79] R. Smarandache and P. O. Vontobel, “Absdet-pseudo-codewords and of the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 53, no. 12, perm-pseudo-codewords: definitions and properties,” in Proc. IEEE Int. pp. 4422–4437, Dec. 2007. Symp. Inf. Theory, Seoul, Korea, June 28–July 3 2009. [54] D. M. Malioutov, J. K. Johnson, and A. S. Willsky, “Walk-sums and [80] R. Smarandache, “Pseudocodewords from Bethe permanents,” submit- belief propagation in Gaussian graphical models,” J. Mach. Learn. Res., ted to IEEE Trans. Inf. Theory, available online under http:// vol. 7, pp. 2031–2064, Dec. 2006. arxiv.org/abs/1112.4625, Dec. 2011. [55] N. Ruozzi, J. Thaler, and S. Tatikonda, “Graph covers and quadratic min- [81] M. Cuturi, “Permanents, transportation polytopes and positive definite imization,” in Proc. 47th Allerton Conf. on Communications, Control, kernels on histograms,” in Proc. Int. Joint Conf. Artificial Intelligence, and Computing, Allerton House, Monticello, IL, USA, Sep. 30–Oct. 2 Hyderabad, India, Jan. 6–12 2007. 2009, pp. 1590–1596. [82] P. O. Vontobel, A. Kavˇci´c, D. M. Arnold, and H.-A. Loeliger, “A [56] P. O. Vontobel, “A combinatorial characterization of the Bethe and the generalization of the Blahut-Arimoto algorithm to finite-state channels,” Kikuchi partition functions,” in Proc. Inf. Theory Appl. Workshop, UC IEEE Trans. Inf. Theory, vol. 54, no. 5, pp. 1887–1918, May 2008. San Diego, La Jolla, CA, USA, Feb. 6–11 2011. [83] J. Justesen and T. Høholdt, “Maxentropic Markov chains,” IEEE Trans. [57] W. S. Massey, Algebraic Topology: an Introduction. New York: Inf. Theory, vol. 30, no. 4, pp. 665–667, Jul. 1984. Springer-Verlag, 1977, reprint of the 1967 edition, Graduate Texts in [84] A. S. Khayrallah and D. L. Neuhoff, “Coding for channels with cost Mathematics, Vol. 56. constraints,” IEEE Trans. Inf. Theory, vol. 42, no. 3, pp. 854–867, May 1996.