Divergence Measures and Message Passing

Divergence measures and message passing Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2005-173, December 7, 2005 Abstract has led to a class of approximation methods called variational methods (Jordan et al., 1999) which approximate a complex network p by a simpler network q, optimizing This paper presents a unifying view of message- the parameters of q to minimize information loss. (Jordan passing algorithms, as methods to approximate a et al. (1999) used convex duality and mean-field as the in- complex Bayesian network by a simpler network spiration for their methods, but other approaches are also with minimum information divergence. In this possible.) The payoff of variational approximation is that view, the difference between mean-field meth- the simpler network q can then act as a surrogate for p in ods and belief propagation is not the amount a larger inference process. This makes variational methods of structure they model, but only the measure well-suited to large networks, especially ones that evolve of loss they minimize (‘exclusive’ versus ‘inclu- through time. A large network can be divided into pieces, sive’ Kullback-Leibler divergence). In each case, each of which is approximated variationally, yielding an message-passing arises by minimizing a local- overall variational approximation to the whole network. ized version of the divergence, local to each fac- This decomposition strategy leads us directly to message- tor. By examining these divergence measures, passing algorithms. we can intuit the types of solution they prefer (symmetry-breaking, for example) and their suit- Message passing is a distributed method for fitting ability for different tasks. Furthermore, by con- variational approximations, which is particularly well- sidering a wider variety of divergence measures suited to large networks. Originally, variational meth- (such as alpha-divergences), we can achieve dif- ods used coordinate-descent schemes (Jordan et al., 1999; ferent complexity and performance goals. Wiegerinck, 2000), which do not scale to large heteroge- nous networks. Since then, a variety of scalable message- passing algorithms have been developed, each minimizing 1 Introduction a different cost function with different message equations. These include: Bayesian inference provides a mathematical framework for • Variational message-passing (Winn & Bishop, 2005), many artificial intelligence tasks, such as visual tracking, a message-passing version of the mean-field method estimating range and position from noisy sensors, classify- (Peterson & Anderson, 1987) ing objects on the basis of observed features, and learning. In principle, we simply draw up a belief network, instan- • Loopy belief propagation (Frey & MacKay, 1997) tiate the things we know, and integrate over the things we don’t know, to compute whatever expectation or probabil- • Expectation propagation (Minka, 2001b) ity we seek. Unfortunately, even with simplified models of • Tree-reweighted message-passing (Wainwright et al., reality and clever algorithms for exploiting independences, 2005b) exact Bayesian computations can be prohibitively expen- sive. For Bayesian methods to enjoy widespread use, there • Fractional belief propagation (Wiegerinck & Heskes, needs to be an array of approximation methods, which can 2002) produce decent results in a user-specified amount of time. • Power EP (Minka, 2004) Fortunately, many belief networks benefit from an averag- ing effect. A network with many interacting elements can One way to understand these algorithms is to view their behave, on the whole, like a simpler network. This insight cost functions as free-energy functions from statistical 1 physics (Yedidia et al., 2004; Heskes, 2003). From this 4 Message-passing 7 viewpoint, each algorithm arises as a different way to ap- 4.1 Fully-factorized case . 8 proximate the entropy of a distribution. This viewpoint can 4.2 Local vs. global divergence . 8 be very insightful; for example, it led to the development 4.3 Mismatched divergences . 9 of generalized belief propagation (Yedidia et al., 2004). 4.4 Estimating Z ................ 9 4.5 The free-energy function . 10 The purpose of this paper is to provide a complementary viewpoint on these algorithms, which offers a new set of 5 Mean-field 10 insights and opportunities. All six of the above algorithms 6 Belief Propagation and EP 11 can be viewed as instances of a recipe for minimizing in- 7 Fractional BP and Power EP 12 formation divergence. What makes algorithms different is 8 Tree-reweighted message passing 12 the measure of divergence that they minimize. Information divergences have been studied for decades in statistics and 9 Choosing a divergence measure 13 many facts are now known about them. Using the theory 10 Future work 14 of divergences, we can more easily choose the appropri- A Ali-Silvey divergences 15 ate algorithm for our application. Using the recipe, we can B Proof of Theorem 1 16 construct new algorithms as desired. This unified view also allows us to generalize theorems proven for one algorithm CHolder¨ inequalities 16 to apply to the others. D Alternate upper bound proof 17 The recipe to make a message-passing algorithm has four E Alpha-divergence and importance sampling 17 steps: 2 Divergence measures 1. Pick an approximating family for q to be chosen from. For example, the set of fully-factorized distributions, This section describes various information divergence mea- the set of Gaussians, the set of k-component mixtures, sures and illustrates how they behave. The behavior of di- etc. vergence measures corresponds directly to the behavior of message-passing algorithms. 2. Pick a divergence measure to minimize. For ex- Let our task be to approximate a complex univariate or mul- ample, mean-field methods minimize the Kullback- tivariate probability distribution p(x). Our approximation, Leibler divergence KL(q || p), expectation propaga- q(x), is required to come from a simple predefined family tion minimizes KL(p || q), and power EP minimizes F, such as Gaussians. We want q to minimize a divergence α-divergence D (p || q). α measure D(p || q), such as KL divergence. We will let p R R 3. Construct an optimization algorithm for the chosen di- be unnnormalized, i.e. x p(x)dx 6= 1, because x p(x)dx vergence measure and approximating family. Usually is usually one of the things we would like to estimate. For example, if p(x) is a Markov random field (p(x) = this is a fixed-point iteration obtained by setting the Q R gradients to zero. ij fij(xi, xj)) then x p(x)dx is the partition function. If x is a parameter in Bayesian learning and p(x) is the like- 4. Distribute the optimization across the network, by di- lihood times prior (p(x) ≡ p(x, D) = p(D|x)p0(x) where R viding the network p into factors, and minimizing lo- the data D is fixed), then x p(x)dx is the evidence for the cal divergence at each factor. model. Consequently, q will also be unnormalized, so that the integral of q provides an estimate of the integral of p. All six algorithms above can be obtained from this recipe, There are two basic divergence measures used in this paper. via the choice of divergence measure and approximating The first is the Kullback-Leibler (KL) divergence: family. Z p(x) Z KL(p || q) = p(x) log dx + (q(x) − p(x))dx x q(x) The paper is organized as follows: (1) This formula includes a correction factor, so that it ap- 1 Introduction 1 plies to unnormalized distributions (Zhu & Rohwer, 1995). 2 Divergence measures 2 Note this divergence is asymmetric with respect to p and q. 3 Minimizing α-divergence 4 The second divergence measure is a generalization of KL- 3.1 A fixed-point scheme . 4 divergence, called the α-divergence (Amari, 1985; Trottini 3.2 Exponential families . 5 & Spezzaferri, 1999; Zhu & Rohwer, 1995). It is actually 3.3 Fully-factorized approximations . 5 a family of divergences, indexed by α ∈ (−∞, ∞). Dif- 3.4 Equality example . 6 ferent authors use the α parameter in different ways. Using 2 q p p p p p q q q q α = −∞ α = 0 α = 0.5 α = 1 α = ∞ Figure 1: The Gaussian q which minimizes α-divergence to p (a mixture of two Gaussians), for varying α. α → −∞ prefers matching one mode, while α → ∞ prefers covering the entire distribution. 0 1.4 0.4 −0.2 1.3 0.2 −0.4 true mean true std dev 1.2 true log(Z) −0.6 0 −0.8 1.1 estimated mean estimated log(Z) −0.2 estimated std dev −1 1 −20 0 20 −20 0 20 −20 0 20 α α α Figure 2: The mass, mean, and standard deviation of the Gaussian q which minimizes α-divergence to p, for varying α. In each case, the true value is matched at α = 1. the convention of Zhu & Rohwer (1995), with α instead of different values of α, figure 1 plots the global minimum of δ: Dα(p || q) over q. The solutions vary smoothly with α, the R α 1−α most dramatic changes happening around α = 0.5. When x αp(x) + (1 − α)q(x) − p(x) q(x) dx Dα(p || q) = α is a large negative number, the best approximation rep- α(1 − α) resents only one mode, the one with largest mass (not the (2) mode which is highest). When α is a large positive num- As in (1), p and q do not need to be normalized. Both ber, the approximation tries to cover the entire distribution, KL-divergence and α-divergence are zero if p = q and eventually forming an upper bound when α → ∞. Fig- positive otherwise, so they satisfy the basic property of ure 2 shows that the mass of the approximation continually an error measure.

Load more