The Gibbs Free Energy, the Bethe Free Entropy, and the Sum-Product Algorithm

The Gibbs Free Energy, the Bethe Free Entropy, and the Sum-Product Algorithm Supplemental Material for Graphical Models and Inference Henry D. Pfister November 5th, 2014 1 Introduction n Let X be a finite set and f : X ! R≥0 be a non-negative real function. For a finite set A, let P(A) be the set of probability distributions over A. The function f implicitly defines a probability distribution µ 2 P(X n) via 1 µ(x) f(x); , Z P where Z = x2X n f(x) is called the partition function (or sum). For [n] , f1; 2; : : : ; ng and a subset I = fi1; i2; : : : ; ikg ⊆ [n] with i1 < i2 < ··· < ik , let xI , (xi1 ; xi2 ; : : : ; xik ) denote the ordered n subvector. For µ 2 P(X ), the marginal distribution of xI is denoted X µI (xI ) , µ(x) x[n]nI or in shorthand as µ(xI ). Likewise, the conditional distribution of xI given xJ is denoted µI[J (xI[J ) µIjJ (xI jxJ ) , µJ (xJ ) or in shorthand as µ(xI jxJ ). The Shannon entropy (in nats) of any ν 2 P(A) is denoted X 1 H(ν) ν(a) ln : , ν(a) a2A 2 The Gibbs Free Energy In physics, the configuration distribution is denoted µβ and the partition function is denoted Z(β) due 1 to their dependence on the inverse temperature parameter β. In this case, the quantity − β ln Z(β) is called the free energy of the system and the quantity ln Z(β) is called the free entropy of the system. The choice of β = 1 is natural for inference problems and, thus, the definitions of µ and Z above assume β = 1. Therefore, one finds that the free entropy is simply the log partition-function ln Z and the free energy is simply the negative log partition-function − ln Z. Remark 2.1. If f(x) 2 f0; 1g, then µ(x) is uniform over the set of valid patterns V = fx 2 X n j f(x) = 1g and Z = jVj is the number of valid patterns. In this case, the free entropy ln Z equals the Shannon entropy H(µ). The Gibbs free energy G : P(X n) ! R maps ν 2 P(X n) to the real numbers X 1 G(ν) = ν(x) ln −H(ν); (1) f(x) x2X n | {z } U(ν) 1 where U(ν) is the average “energy” of the system for the configuration distribution ν. Since f(x) = Zµ(x), one can also write the Gibbs free energy as X 1 X 1 G(ν) = ν(x) ln − ν(x) ln Zµ(x) ν(x) x2X n x2X n X ν(x) = − ln Z + ν(x) ln ; (2) µ(x) x2X n | {z } D(νjjµ) where D(νjjµ) is the Kullback-Liebler divergence. Since D(νjjµ) ≥ 0 with equality iff ν = µ, it follows ∗ that minν2P(X ) G(ν) = − ln Z is achieved uniquely at ν = µ. This formulation is often called the variational Gibbs free energy. To motivate this definition, we mention a connection to the theory of large deviations. Let PN be the probability that the empirical distribution of N samples drawn according to µ is equal to ν. If n −N[D(νjjµ)+o(1)] Nν(x) is integer for all x 2 X , then [1, Thm. 11.1.4] shows that PN = e and, thus, D(νjjµ) = G(ν) + ln Z implies that −N[G(ν)+ln Z+o(1)] PN = e : 3 Factor Graphs Many operations on f can be simplified if f factors into the product of local potentials. These simplified operations can often be understood in terms of a factor graph. A factor graph is bipartite graph defined by a set of variable nodes V = [n], a set of factor nodes F , a set of edges E ⊆ V × F , and a weight function fa for each factor a 2 F . Let V (a) , fi 2 V j (i; a) 2 Eg denote the set of variable nodes adjacent to a and F (i) , fa 2 F j (i; a) 2 Eg denote the set of factor nodes adjacent to i. Such a factor graph represents the factorization ! ! Y Y f(x1; : : : ; xn) = fa(xV (a)) fi(xi) : a2F i2V From this, we see that a variable node i 2 V participates in factor a 2 F iff i 2 V (a). Likewise, a factor a 2 F depends on variable i 2 V iff a 2 F (i). 3.1 Factor Graphs without Cycles If the factor graph does not have any cycles, then inference and analysis are both greatly simplified. In particular, the sum-product algorithm (SPA), which is also called belief propagation (BP), can be used to efficiently compute the factor marginals fµV (a)ga2F and fµigi2V . The message-passing update rules of the SPA are given by (`+1) Y (`) µj!a (x) / fj(x) µ^b!j(x) b2F (j)na (3) (`) X Y (`) µâ!j(x) / fa(xV (a))δxj ;x µi!a(xi); xV (a) i2V (a)nj P (`) P (`) along with the normalization conditions x2X µj!a(x) = 1 and x2X µâ!j(x) = 1. The symbol δxj ;x denotes the Kronecker delta function and equals 1 if xj = x and 0 otherwise. The algorithm is typically (0) initialized to µj!a(x) / fj(x). If the factor graph does not have cycles, then this iteration converges ∗ to a fixed point after a finite number of steps and we denote the fixed point messages by µâ!j(x) and ∗ µj!a(x). In this case, the factor marginals are given by Y ∗ µi(x) / fi(x) µ^b!i(x) b2F (i) (4) Y ∗ µV (a)(xV (a)) / fa(xV (a)) µi!a(xi): i2V (a) 2 Another consequence of the factor graph not having cycles is that the joint distribution µ can be written as a function of the factor marginals. This is especially convenient given that these marginals are easily computed with the SPA. The following lemma makes this precise. Lemma 3.1. Consider a factor graph without cycles. Let A be any subset of factor nodes whose induced subgraph is connected and let V (A) , [a2AV (a) denote the set of variable nodes adjacent to A. Then, the marginal µV (A) can be written as ! 0 1 Y µV (a)(xV (a)) Y µV (A)(xV (A)) = Q @ µi(xi)A : µi(xi) a2A i2V (a) i2V (A) Proof. The proof is by induction on jAj. If jAj = 1, then let b denote the single factor node in A and observe that the base case ! 0 1 µV (b)(xV (b)) Y µV (b)(xV (b)) = Q @ µi(xi)A = µV (b)(xV (b)) µi(xi) i2V (b) i2V (b) holds trivially. The subgraph, S(A), induced by A is a tree because it is a connected subgraph of a cycle free graph. If jAj > 1; then choose b 2 A to be any factor node with jV (b)j ≥ 2 that is adjacent to a leaf variable node. Such a b exists because S(A) is a tree and jAj > 1. Since S(A) is a tree, there is a unique variable node k 2 V (b) that is in both V (b) and V (Anb). In this case, S(V (Anb)) is connected and jAnbj = jAj − 1. Therefore, we can apply the induction hypothesis to get the formula for µV (Anb). Since xk separates V (Anb) and V (b) in the factor graph, conditional independence implies µV (A)(xV (A)) = µV (Anb)(xV (Anb))µV (b)nk(xV (b)nkjfkgjxk) µV (b)(xV (b)) = µV (Anb)(xV (Anb)) µk(xk) 0 1 0 1 Y µV (a)(xV (a)) Y µV (b)(xV (b)) = @ Q A @ µi(xi)A µi(xi) µk(xk) a2Anb i2V (a) i2V (Anb) ! 0 1 0 1 Y µV (a)(xV (a)) Y Y 1 = Q @ µi(xi)A @ µi(xi)A µi(xi) µk(xk) a2A i2V (a) i2V (b) i2V (Anb) ! 0 1 Y µV (a)(xV (a)) Y = Q @ µi(xi)A ; µi(xi) a2A i2V (a) i2V (A) where the last equality holds because V (Anb) \ V (b) = fkg implies that the second and third products have an extra factor of µk(xk) that cancels the 1/µk(xk) term. Remark 3.2. Choosing A = F in the above lemma shows that the formula holds for any tree factor graph. Likewise, it is easy to verify that conditional independence implies the formula also holds for any factor graph without cycles (i.e., consisting of disjoint tree components). Lemma 3.3. For a factor graph without cycles, the entropy of µ can be written in terms of the factor marginals as X X H(µ) = H(µV (a)) − (jF (i)j − 1) H(µi) (5) a2F i2V and the free energy can be written as 2 3 " # X X 1 X X 1 − ln Z = 4 µV (a)(xV (a)) ln 5 + µi(xi) ln − H(µ): (6) fa(xV (a)) fi(xi) a2A xV (a) i2V xi 3 Proof. Using the form of µ(x) given by Lemma 3.1, one can directly compute the entropy with X 1 H(µ) = µ(x) ln µ(x) x2X n " ! !# X Y µV (a)(xV (a)) Y = − µ(x) ln Q µi(xi) µi(xi) x2X n a2F i2V (a) i2V " # X X X = − µ(x) ln µV (a)(xV (a)) − (jF (i)j − 1) ln µi(xi) x2X n a2F i2V X X 1 X X 1 = µ(x) ln − (jF (i)j − 1) µ(x) ln µ (x ) µi(xi) a2F x2X n V (a) V (a) i2V x2X n X X = H(µV (a)) − (jF (i)j − 1) H(µi): a2F i2V The second result follows by combining the result G(µ) = − ln Z from (2) with (1).

Load more