Expectation and Belief Propagation

Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh [email protected] Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/∼ywteh/teaching/probmodels The Other KL Variational methods find Q = argmin KL[QkQ(YjX)]: • guaranteed convergence; • maximising lower bound may increase log likelihood; What about the reversed KL (Q = argmin KL[P (YjX)kQ])? For a factored approximation the marginals are correct: h Y i Z Y argmin KL P (YjX) Qj(Yj) = argmin − dY P (YjX) log Qj(Yj) q i Qi j X Z = argmin − dY P (YjX) log Qj(Yj) Qi j Z = argmin − dYi P (YijX) log Qi(Yi) Qi = P (YijX) and the marginals are what we need for learning. But (perversely) this means optimizing this KL is intractrable... Expectation Propagation The posterior distribution we need to approximate is often a (normalised) product of factors: Y P (YjX) / fi(YCj ) j Y ~ We wish to approximate this by a product of simpler terms: Q(Y) := fj(YCj ) j h i Y Y ~ min KL fj(YCj ) fj(YCj ) (intractable) ff~ (Y g j Cj i j h i ~ min KL fi(YCi) fi(YCi) (simple, non-iterative, inaccurate) f~(Y ) i Ci h i Y ~ ~ Y ~ min KL fi(YCi) fj(YCj ) fi(YCi) fj(YCj ) (simple, iterative, accurate) EP f~(Y ) i Ci j6=i j6=i Expectation Propagation Input ffi(YCi)g ~ Q ~ Initialize fi(YCi)=1 , Q(Y) = i fi(YCi) repeat for each factor i do Q(Y) Y Deletion: Q (Y) = f~ (Y ) :i ~ j Cj fi(YCi) j6=i ~new 0 Projection: fi (YCi) argmin KL[fi(YCi)Q:i(Y)kfi(YCi)Q:i(Y)] f0(Y ) i Ci ~new Inclusion: Q(Y) fi (YCi) Q:i(Y) end for until convergence • KL minimisation (projection) only depends on Q:i(Y) marginalised to YCi. ~ • If fi(Y) in exponential family, then the projection step is moment matching. • Update order need not be sequential. • Minimizes the opposite KL to variational methods. • Loopy belief propagation and assumed density filtering are special cases. • No convergence guarantee (although convergent forms can be developed). • The names (deletion, projection, inclusion) are not the same as in (Minka, 2001). Recap: Belief Propagation on Undirected Trees k Joint distribution of undirected tree:g Mi j 1 Y Y → p(X) = f (X ) f (X ;X ) Z i i ij i j i j nodes i edges (ij) Recursively compute messages: X Y Mi!j(Xj) := fij(Xi;Xj)fi(Xi) Mk!i(Xi) Xi k2ne(i)nj Marginal distributions: Y p(Xi) / fi(Xi) Mk!i(Xi) k2ne(i) Y Y p(Xi;Xj) / fij(Xi;Xj)fi(Xi)fj(Xj) Mk!i(Xi) Ml!j(Xj) k2ne(i)nj l2ne(j)ni Loopy Belief Propagation k Joint distribution of undirected graph: Mi j 1 Y Y → p(X) = f (X ) f (X ;X ) Z i i ij i j i j nodes i edges (ij) Recursively compute messages (and hope that updates converge): X Y Mi!j(Xj) := fij(Xi;Xj)fi(Xi) Mk!i(Xi) Xi k2ne(i)nj Approximate marginal distributions: Y p(Xi) ≈ bi(Xi) / fi(Xi) Mk!i(Xi) k2ne(i) Y Y p(Xi;Xj) ≈ bij(Xi;Xj) / fij(Xi;Xj)fi(Xi)fj(Xj) Mk!i(Xi) Ml!j(Xj) k2ne(i)nj l2ne(j)ni Practical Considerations • Convergence: Loopy BP is not guaranteed to converge for most graphs. – Trees: BP will converge. – Single loop: BP will converge for graphs containing at most one loop. – Weak interactions: BP will converge for graphs with weak enough interactions. – Long loops: BP more likely to converge for graphs with long (weakly interacting) loops. – Gaussian networks: Means correct, variances many converge under some conditions. • Damping: Popular approach to encourage convergence. new old X Y Mi!j(Xj) := (1 − α)Mi!j(Xj) + α fij(Xi;Xj)fi(Xi) Mk!i(Xi) Xi k2ne(i)nj • Other graphical models: equivalent formulations for DAG and factor graphs. Different Perspectives on Loopy Belief Propagation • Expectation propagation. • Tree-based Reparametrization. • Bethe free energy. Loopy BP as Expectation Propagation i j i j Approximate each factor fij describing interaction between i and j as: ~ fij(Xi;Xj) ≈ fij(Xi;Xj) = Mi!j(Xj)Mj!i(Xi) The full joint distribution is thus approximated by a factorized distribution: 1 Y Y 1 Y Y Y p(X) ≈ f (X ) f~ (X ;X ) = f (X ) M (X ) = b (X ) Z i i ij i j Z i i j!i i i i nodes i edges (ij) nodes i j2ne(i) nodes i Loopy BP as Expectation Propagation ~ Each EP update to fij is as follows: • “Corrected” distribution is: Y Y fij(Xi;Xj)q:ij(X) =fij(Xi;Xj)fi(Xi)fj(Xj) Mk!i(Xi) Ml!j(Xj) k2ne(i)nj l2ne(j)ni Y Y fs(Xs) Mt!s(Xs) s6=i;j t2ne(s) • Moments are just marginal distributions on i and j. ~ • Thus optimal fij(Xi;Xj) minimizing ~ KL[fij(Xi;Xj)q:ij(X)kfij(Xi;Xj)q:ij(X)] is given by: Y X Y Y fj(Xj)Mi!j(Xj) Ml!j(Xj) = fij(Xi;Xj)fi(Xi)fj(Xj) Mk!i(Xi) Ml!j(Xj) l2ne(j)ni Xi k2ne(i)nj l2ne(j)ni X Y Mi!j(Xj) = fij(Xi;Xj)fi(Xi) Mk!i(Xi) Xi k2ne(i)nj Similarly for Mj!i(Xi). Loopy BP as Tree-based Reparametrization Many ways of parametrizing tree-structured distributions. 1 Y Y p(X) = f (X ) f (X ;X ) undirected tree (1) Z i i ij i j nodes i edges (ij) Y = p(Xr) p(XijXpa(i)) directed (rooted) tree (2) i6=r Y Y p(Xi;Xj) = p(Xi) locally consistent marginals (3) p(Xi)p(Xj) nodes i edges (ij) Undirected tree representation is redundant—multiplying a factor fij(Xi;Xj) by g(Xi), and dividing fi(Xi) by the same g(Xi) does not change the distribution. BP on tree can be understood as reparametrizing (1) by locally consistent factors. This results in (3), from which the local marginals can be read off. Loopy BP as Tree-based Reparametrization graph spanning tree 1 spanning tree 2 1 Y Y p(X) = f 0(X ) f 0 (X ;X ) Z i i ij i j nodes i edges (ij) 1 Y Y Y = f 0(X ) f 0 (X ;X ) f 0 (X ;X ) Z i i ij i j ij i j nodes i2T1 edges (ij)2T1 edges (ij)62T1 1 Y Y Y = f 1(X ) f 1 (X ;X ) f 1 (X ;X ) Z i i ij i j ij i j nodes i2T1 edges (ij)2T1 edges (ij)62T1 pT1(X ;X ) 1 T1 1 i j 1 0 where fi (Xi) = p (Xi), fij(Xi;Xj) = T T , fij = fij. p 1(Xi)p 1(Xj) 1 Y Y Y = f 1(X ) f 1 (X ;X ) f 1 (X ;X ) Z i i ij i j ij i j nodes i2T2 edges (ij)2T2 edges (ij)62T2 ::: Loopy BP as Tree-based Reparametrization At convergence, loopy BP has reparametrized the joint distribution as: 1 Y Y p(X) = f 1(X ) f 1(X ;X ) Z i i ij i j nodes i edges (ij) where for any tree T embedded in the graph, 1 T fi (Xi) = p (Xi) T 1 p (Xi;Xj) fij (Xi;Xj) = T T p (Xi)p (Xj) In particular, all local marginals of all trees are locally consistent with each other: 1 Y Y bij(Xi;Xj) p(X) = bi(Xi) Z bi(Xi)bj(Xj) nodes i edges (ij) Loopy BP as Optimizing Bethe Free Energy 1 Y Y p(X) = f (X ) f (X ;X ) Z i i ij i j i (ij) Loopy BP can be derived as fixed point equations for finding stationary points of an objective function called the Bethe free energy. The Bethe free energy is not optimized wrt a full distribution over X, rather over locally consistent pseudomarginals or beliefs bi ≥ 0 and bij ≥ 0: X bi(Xi) = 1 8i Xi X bij(Xi;Xj) = bi(Xi) 8i; j 2 ne(i) Xj Loopy BP as Optimizing Bethe Free Energy Fbethe(b) = Ebethe(b) + Hbethe(b) The Bethe average energy is “exact”: X X X X Ebethe(b) = bi(Xi) log fi(Xi) + bij(Xi;Xj) log fij(Xi;Xj) i Xi (ij) Xi;Xj While the Bethe entropy is approximate: X X X X bij(Xi;Xj) Hbethe(b) = − bi(Xi) log bi(Xi) − bij(Xi;Xj) log bi(Xi)bj(Xj) i Xi (ij) Xi;Xj Factors in denominator are to account for overcount of entropy on edges, so that the Bethe entropy is exact on trees. Message updates in loopy BP can now derived by finding the stationary points of the La- grangian (with Lagrange multipliers included to enforce local consistency). Messages are related to the Lagrange multipliers. Loopy BP as Optimizing Bethe Free Energy • Fixed points of loopy BP are exactly the stationary points of the Bethe free energy. • Stable fixed points of loopy BP are local maximum of Bethe free enegy (note we used inverted notion of free energy to be consistent with the variational free energy). • For binary attractive networks, Bethe free energy at fixed points of loopy BP forms lower bound on log partition function log Z. Loopy BP vs Variational Approximation • Beliefs bi and bij in loopy BP are only locally consistent pseudomarginals and do not necessarily form a full joint distribution.

Load more