Particle Belief Propagation
Total Page:16
File Type:pdf, Size:1020Kb
Particle Belief Propagation Alexander Ihler David McAllester [email protected] [email protected] Dept. of Computer Science Toyota Technological Institite, Chicago University of California, Irvine Abstract passing” algorithms such as belief propagation (Pearl, 1988). Traditionally, most such work has focused on systems of many variables, each of which has a rela- The popularity of particle filtering for infer- tively small state space (number of possible values), ence in Markov chain models defined over or particularly nice parametric forms (such as jointly random variables with very large or contin- Gaussian distributions). For systems with continuous- uous domains makes it natural to consider valued variables, or discrete-valued variables with very sample–based versions of belief propagation large domains, one possibility is to reduce the effec- (BP) for more general (tree–structured or tive state space through gating, or discarding low- loopy) graphs. Already, several such al- probability states (Freeman et al., 2000; Coughlan and gorithms have been proposed in the litera- Ferreira, 2002), or through random sampling (Arulam- ture. However, many questions remain open palam et al., 2002; Koller et al., 1999; Sudderth et al., about the behavior of particle–based BP al- 2003; Isard, 2003; Neal et al., 2003). The best-known gorithms, and little theory has been devel- example of the latter technique is particle filtering, de- oped to analyze their performance. In this fined on Markov chains, in which each distribution is paper, we describe a generic particle belief represented using a finite collection of samples, or par- propagation (PBP) algorithm which is closely ticles. It is therefore only natural to consider general- related to previously proposed methods. We izations of particle filtering applicable to more general prove that this algorithm is consistent, ap- graphs (“particle” belief propagation); several varia- proaching the true BP messages as the num- tions have thus far been proposed, corresponding to ber of samples grows large. We then use different choices for certain fundamental questions. concentration bounds to analyze the finite- sample behavior and give a convergence rate As an example, consider the question of how to rep- for the algorithm on tree–structured graphs. resent the messages computed during inference using Our convergence rate is O(1/√n) where n is particles. Broadly speaking, one might consider two the number of samples, independent of the possible approaches: to draw a set of particles for domain size of the variables. each message in the graph (Arulampalam et al., 2002; Sudderth et al., 2003; Isard, 2003), or to create a set of particles for each variable, for example by drawing 1 Introduction samples from the estimated posterior marginal (Koller et al., 1999). This decision is closely related to the Graphical models provide a powerful framework for choice of proposal distribution in particle filtering; representing structure in distributions over many ran- indeed, choosing better proposal distributions from dom variables. This structure can then be used to which to draw the samples, or moving the samples via efficiently compute or approximate many quantities Markov chain Monte Carlo (MCMC) to match subse- of interest such as the posterior modes, means, or quent observations, comprises a large part of modern marginals of the distribution, often using “message- work on particle filters (Thrun et al., 2000; van der Merwe et al., 2001; Doucet et al., 2001; Khan et al., th Appearing in Proceedings of the 12 International Confe- 2005). rence on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: Either method can be made asymptotically consistent, W&CP 5. Copyright 2009 by the authors. i.e., will produce the correct answer in the limit as the 256 Particle Belief Propagation number of samples becomes infinite. However, consis- on the state space of the target node, s. These mes- tency is a very weak condition—fundamentally, we are sages are defined recursively as X interested in the behavior of particle belief propagation m (x ) = Ψ (x ,x )Ψ (x ) m (x ) for relatively small numbers of particles, ensuring com- t→s s t,s t s t t u→t t putational efficiency. So far, little theory describes the xXt∈Xt u∈YΓt\s (3) finite sample behavior of these algorithms. When G is a tree this recursion is well founded (loop- In this work, we give a convergence rate for the ac- free) and Equation (3) uniquely determines the mes- curacy of a relatively generic PBP algorithm, most sages. We define the unnormalized belief function as closely related to that described in Koller et al. (1999). Our convergence rate is O(1/√n) where n is the num- Bs(xs) = Ψs(xs) mt→s(xs). (4) ber of particles independent of the domain size of the tY∈Γs nodes of the graphical model. The convergence rate is When G is a tree the belief function is proportional reminiscent, and has a similar proof, to convergence to the marginal probability Ps defined by (2). It is rates for learning algorithms derived from Chernoff sometimes useful to define the “pre-message” Mt→s as bounds applied to IID samples. Mt→s(xt) = Ψt(xt) mu→t(xt) (5) u∈Γt\s 2 Definitions and Notation Y for x . Note that the pre-message M defines t ∈ Xt t→s a weighting on the state space of the source node t, Let G be an undirected graph consisting of nodes V = X while the message mt→s defines a weighting on the 1,...,k and edges E, and let Γs denote the set of { } state space of the destination, s. We can then re- neighbors of node s in G, i.e., the set of nodes t such express (3)–(4) as X that s,t is an edge of G. In a probabilistic graphical { } model, each node s V is associated with a random mt→s(xs) = Ψt,s(xt,xs)Mt→s(xt) ∈ variable X taking on values in some domain, . We xt∈Xt s Xs X assume that each node s and edge s,t are associated B (x ) = M (x )m (x ) { } t t t→s t s→t t with potential functions Ψs and Ψs,t respectively, and given these potential functions we define a probability Although we develop our results for tree–structured distribution on assignments of values to nodes as graphs, it is common to apply belief propagation to graphs with cycles as well (“loopy” belief propaga- 1 tion). In this case the belief functions (4) will in gen- P (~x) = Ψ (~x ) Ψ (~x , ~x ) (1) Z s s s,t s t eral not equal the true marginals, but often provide s ! Y {s,tY}∈E good approximations in practice. We discuss the ap- plication of our results to particle-based versions of Here ~x is an assignment of values to all k variables, loopy BP at the end of Section 5. ~xs is the value assigned to Xs by ~x, and Z is a scalar chosen to normalize the distribution P (also called the For reasons of numerical stability, it is common to nor- partition function). We consider the problem of com- malize each message mt→s so that it has unit sum. puting marginal probabilities, defined by However, such normalization of messages has no other effect on the (normalized) belief functions (4). Thus Ps(xs) = P (~x). (2) for conceptual simplicity in developing and analyzing ~x:~xs=xs particle belief propagation we avoid any explicit nor- X malization of the messages; such normalization can be Equation (1) defines a pairwise Markov random field included in the algorithms in practice. model. Our results are also directly applicable to more general graphical model formulations such as Additionally, for reasons of computational efficiency it Bayes’ nets (Pearl, 1988) and factor graphs (Kschis- is common to use the alternative expression chang et al., 2001), but for notational convenience we Bt(xt) mt→s(xs) = Ψt,s(xt,xs) (6) restrict our development to the pairwise form (1). ms→t(xt) X when computing the messages. By storing and updat- 3 Review of Belief Propagation ing the belief values Bt(xt) incrementally as incom- ing messages are re-computed, one can significantly In the case where G is a tree and the sets s are small, reduce the number of operations required. Although the marginal probabilities can be computedX efficiently our development of particle belief propagation uses the by belief propagation (Pearl, 1988). This is done by update form (3), this alternative formulation can be computing messages mt→s each of which is a function applied to improve its efficiency as well. 257 Ihler, McAllester 4 Particle Belief Propagation count c (x) for x as s ∈ Xs b We now consider the case where s is too large to = x : ix(i) = x enumerate in practice and define|X a| generic particle Xs { s ∈Xs ∃ s s} (i) (sample) based BP algorithm (PBP). This algorithm cs(xs) = i : xs = xs b |{ }| essentially corresponds to a non-iterative version of the ′ method described in Koller et al. (1999). (i) (i ) Equation (8) has the property that if xs = xs then ′ m(i) = m(i ) ; thus we can rewrite (8) as 4.1 Particle Belief Propagation Algorithm t→s t→s (1) (n) PBP samples a set of particles xs , . ., xs with 1 ct(xt) (i) 1 mt→s(xs) = Ψt,s(xt,xs)Ψt(xt) xs s at each node s of the network , drawn from n b Wt(xt) ∈X xt∈Xt a sampling distribution (or weighting) Ws(xs) > 0 X b (corresponding to the proposal distribution in parti- mu→t(xt) (9) cle filtering).