13 : Variational Inference: Loopy Belief Propagation and Mean Field
Total Page:16
File Type:pdf, Size:1020Kb
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction Inference problems involve answering a query that concerns the likelihood of observed data. For example, to P answer the query on a marginal p(xA), we can perform the marginalization operation to derive C=A p(x). Or, for queries concern the conditionals, such as p(xAjxB), we can first compute the joint, and divide by the marginals p(xB). Sometimes to answer a query, we might also need to compute the mode of density x^ = arg maxx2X m p(x). So far, in the class, we have covered the exact inference problem. To perform exact inference, we know that brute force search might be too inefficient for large graphs with complex structures, so a family of message passing algorithm such as forward-backward, sum-product, max-product, and junction tree, was introduced. However, although we know that these message-passing based exact inference algorithms work well for tree- structured graphical models, it was also shown in the class that they might not yield consistent results for loopy graphs, or, the convergence might not be guaranteed. Also, for complex graphical models such as the Ising model, we cannot run exact inference algorithm such as the junction tree algorithm, because it is computationally intractable. In this lecture, we look at two variational inference algorithms: loopy belief propagation (yww) and mean field approximation (pschulam). 2 Loopy Belief Propagation The general idea for loopy belief propagation is that even though we know graph contains loops and the messages might circulate indefinitely, we still let it run anyway and hope for the best. In this section, we first review the basic belief propagation algorithms. Then, we discuss an experimental study by Murphy et al. (1999), and show some empirical results on the effects of loopy belief propagation. Most importantly, we start from the notion of KL divergence, and show how to explain the LBP algorithm from the perspective of minimizing the Bethe free energy. 2.1 Belief Propagation: a Quick Review The basic idea of belief propagation is very simple: to update the belief on a node, we just need to calculate the doubleton potentials from its neighboring nodes, and multiply with the target node's singleton potential. To give a more concrete example, let's consider the example on Figure 1. On the part (a) of the figure, we see that in order to compute the message Mi!j(xj), we will need to calculate the message from all the neighboring nodes xk to xi. Then, multiply with the singletons and doubletons concerning xi and xi; xj: X Y Mi!j(xj) / Φij(xi; xj)Φi(xi) Mk!i(xi) (1) xi k 1 2 13 : Variational Inference: Loopy Belief Propagation and Mean Field Figure 1: Belief propagation: an example. Here the doubleton potential Φij(xi; xj) is also called compatibilities, and is used to model the interactions of the two nodes, whereas the singleton potential Φi(xi) is also called the external evidence. On the right-hand side (part b), we can simple update the belief of xi using the similar formulation: Y bi(xi) / Φi(xi) Mk(xk) (2) k Similarly, for factor graphs, we can also have the notion of \messages" and update the belief of node xi by multiplying its factor and the messages coming from neighboring nodes: Y bi(xi) / fi(xi) ma!i(xi) (3) a2N(i) If we want to calculate message from node Xa to node xi, we should sum up all the products: X Y ma!i(xi) / fa(Xa) mj!a(xj) (4) Xanxi j2N(a)ni In the class, we know that running BP on trees always converges to the exact solution. However, it is not always the case for loopy graphs. The problem is that when the message is sent into a loop structure, it might circulate indefinitely, so it does not guarantee the convergence or it might converge to the wrong solution. 2.2 Loopy Belief Propagation Algorithm The idea of loopy belief propagation algorithm is to use a fixed point iteration procedure to minimize the Bethe free energy. Basically, if the convergence criteria is not met, we can update the messages and the believes: Y bi(xi) / ma!i(xi) (5) a2N(i) Y ba(Xa) / fa(Xa) mi!a(xi) (6) i2N(a) new Y mi!a(xi) = mc!i(xi) (7) c2N(i)na new X Y ma!i(xi) = fa(Xa) mj!a(xj) (8) Xanxi j2N(a)ni 13 : Variational Inference: Loopy Belief Propagation and Mean Field 3 Therefore, we know that stationary properties are guaranteed when it converges. However, the big problem here is that the convergence is not guaranteed, and the reason is intuitive: when BP algorithm is running on graphs that include loops, the messages might be circulating in the loops forever. Interestingly, Murphy et al. (1999 UAI) has studied the empirical behaviors of the loopy belief propagation algorithm, and found that LBP can still achieve good approximations: • The program is stopped after a fixed number of iterations. • Stop when there is no significant difference in belief updates. • When the solution converges, it is usually a good approximation. And this is probably the reason why LBP is still a very popular inference algorithm, even though convergence might be guaranteed. Also, it was mentioned in class that, in order to test the empirical performance of an approximate inference algorithm on large intractable problems, one can always start simple by testing on a small example of the problem (e.g. a 20 x 20 nodes Ising model). 2.3 Understanding LBP: a FBethe Minimization Perspective To understand the LBP algorithm, let's first define the true distribution P as: 1 Y P (X) = f (X ) (9) Z a a fa2F where Z is the partition function, and we are interested in the product of factors. Since this is often intractable, we can approximate the distribution P with Q. To do this, we can utilize the KL-divergence method: X Q(X) KL(QjjP ) = Q(X) log (10) P (X) X Note that the KL divergence is asymmetric. The value from KL should be non-negative, and has the minimum value when P = Q. KL is very useful in our problem, because it means that we can know use KL to approximate Q. To do this, we can compute KL(QjjP ) without performing inference: X Q(X) KL(QjjP ) = Q(X) log (11) P (X) X X X = Q(X) log Q(X) − Q(X) log P (X) (12) X X = −HQ(X) − EQ log P (X) (13) If we replace P (X) with our earlier definition on the true probability, we can get: 1 Y KL(QjjP ) = −H (X) − E log f (X ) (14) Q Q Z a a fa2F 1 X = −H (X) − log − E log f (X ) (15) Q Z Q a a fa2F Note that if we re-arrange the terms on the right side of the equation, we can get X KL(QjjP ) = −HQ(X) − EQ log fa(Xa) + log Z (16) fa2F 4 13 : Variational Inference: Loopy Belief Propagation and Mean Field And the Physicists define the first two terms on the right side of equation as \(Gibbs) free energy" F (P; Q). So, now, our goal can be boiled down to compute the F (P; Q). In order to do this, we know that P E log f (X ) can be computed by summing up all the marginals, where as computing H (X) fa2F Q a a Q is a much harder task that needs to sum over all possible values, which is very expensive. However, we can always approximate F (P; Q) by computing F^(P; Q). Before we show how to approximate the Gibbs free energy, let's first consider the case with tree graphical models in Fig. 2 : Here, we know the probability can Figure 2: Calculating the tree energy: an example. Q Q 1−di be written as: b(x) = a ba(xa) i bi(xi) , and the Htree and Ftree can be written as: X X X X Htree = − ba(xa) log ba(xa) + (di − 1) bi(xi) log bi(xi) (17) a xa i xi X X ba(xa) X X Ftree = ba(xa) log + (1 − di) bi(xi) log bi(xi) (18) fa(xa) a xa i xi = F12 + F23 + ··· + F67 + F78 − F1 − F5 − F2 − F6 − F3 − F7 (19) It can be seen that from the above derivation, we only need to sum over the singletons and doubletons, which is easy to compute. Similarly, we can also use the above idea to approximate the Gibbs free energy. For example, in a general graph, such as Fig. 3, we also have: Figure 3: Calculting the loopy graph Bethe energy: an example. X X X X HBethe = − ba(xa) log ba(xa) + (di − 1) bi(xi) log bi(xi) (20) a xa i xi X X ba(xa) X X FBethe = ba(xa) log + (1 − di) bi(xi) log bi(xi) = −hfa(xa)i − HBethe (21) fa(xa) a xa i xi = F12 + F23 + ··· + F67 + F78 − F1 − F5 − 2F2 − 2F6 − F8 (22) 13 : Variational Inference: Loopy Belief Propagation and Mean Field 5 So, this is called the Bethe approximation of the Gibbs free energy. The idea is simple: we just need to sum over the singletons and the doubletons to derive the entropy. However, we need to notice that this approximation might not be well connected to the Gibbs free energy.