<<

Particle Belief Propagation

Alexander Ihler David McAllester [email protected] [email protected] Dept. of Computer Science Toyota Technological Institite, Chicago University of California, Irvine

Abstract passing” such as belief propagation (Pearl, 1988). Traditionally, most such work has focused on systems of many variables, each of which has a rela- The popularity of particle filtering for infer- tively small state space (number of possible values), ence in Markov chain models defined over or particularly nice parametric forms (such as jointly random variables with very large or contin- Gaussian distributions). For systems with continuous- uous domains makes it natural to consider valued variables, or discrete-valued variables with very sample–based versions of belief propagation large domains, one possibility is to reduce the effec- (BP) for more general (–structured or tive state space through gating, or discarding low- loopy) graphs. Already, several such al- probability states (Freeman et al., 2000; Coughlan and gorithms have been proposed in the litera- Ferreira, 2002), or through random sampling (Arulam- ture. However, many questions remain open palam et al., 2002; Koller et al., 1999; Sudderth et al., about the behavior of particle–based BP al- 2003; Isard, 2003; Neal et al., 2003). The best-known gorithms, and little theory has been devel- example of the latter technique is particle filtering, de- oped to analyze their performance. In this fined on Markov chains, in which each distribution is paper, we describe a generic particle belief represented using a finite collection of samples, or par- propagation (PBP) which is closely ticles. It is therefore only natural to consider general- related to previously proposed methods. We izations of particle filtering applicable to more general prove that this algorithm is consistent, ap- graphs (“particle” belief propagation); several varia- proaching the true BP messages as the num- tions have thus far been proposed, corresponding to ber of samples grows large. We then use different choices for certain fundamental questions. concentration bounds to analyze the finite- sample behavior and give a convergence rate As an example, consider the question of how to rep- for the algorithm on tree–structured graphs. resent the messages computed during inference using Our convergence rate is O(1/√n) where n is particles. Broadly speaking, one might consider two the number of samples, independent of the possible approaches: to draw a set of particles for domain size of the variables. each message in the graph (Arulampalam et al., 2002; Sudderth et al., 2003; Isard, 2003), or to create a set of particles for each variable, for example by drawing 1 Introduction samples from the estimated posterior marginal (Koller et al., 1999). This decision is closely related to the Graphical models provide a powerful framework for choice of proposal distribution in particle filtering; representing structure in distributions over many ran- indeed, choosing better proposal distributions from dom variables. This structure can then be used to which to draw the samples, or moving the samples via efficiently compute or approximate many quantities Markov chain Monte Carlo (MCMC) to match subse- of interest such as the posterior modes, means, or quent observations, comprises a large part of modern marginals of the distribution, often using “message- work on particle filters (Thrun et al., 2000; van der Merwe et al., 2001; Doucet et al., 2001; Khan et al., th Appearing in Proceedings of the 12 International Confe- 2005). rence on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: Either method can be made asymptotically consistent, W&CP 5. Copyright 2009 by the authors. i.e., will produce the correct answer in the limit as the

256 Particle Belief Propagation

number of samples becomes infinite. However, consis- on the state space of the target node, s. These mes- tency is a very weak condition—fundamentally, we are sages are defined recursively as X interested in the behavior of particle belief propagation m (x ) = Ψ (x ,x )Ψ (x ) m (x ) for relatively small numbers of particles, ensuring com- t→s s t,s t s t t u→t t putational efficiency. So far, little theory describes the xXt∈Xt u∈YΓt\s (3) finite sample behavior of these algorithms. When G is a tree this recursion is well founded (loop- In this work, we give a convergence rate for the ac- free) and Equation (3) uniquely determines the mes- curacy of a relatively generic PBP algorithm, most sages. We define the unnormalized belief function as closely related to that described in Koller et al. (1999). Our convergence rate is O(1/√n) where n is the num- Bs(xs) = Ψs(xs) mt→s(xs). (4) ber of particles independent of the domain size of the tY∈Γs nodes of the . The convergence rate is When G is a tree the belief function is proportional reminiscent, and has a similar proof, to convergence to the marginal probability Ps defined by (2). It is rates for learning algorithms derived from Chernoff sometimes useful to define the “pre-message” Mt→s as bounds applied to IID samples. Mt→s(xt) = Ψt(xt) mu→t(xt) (5) u∈Γt\s 2 Definitions and Notation Y for x . Note that the pre-message M defines t ∈ Xt t→s a weighting on the state space of the source node t, Let G be an undirected graph consisting of nodes V = X while the message mt→s defines a weighting on the 1,...,k and edges E, and let Γs denote the set of { } state space of the destination, s. We can then re- neighbors of node s in G, i.e., the set of nodes t such express (3)–(4) as X that s,t is an edge of G. In a probabilistic graphical { } model, each node s V is associated with a random mt→s(xs) = Ψt,s(xt,xs)Mt→s(xt) ∈ variable X taking on values in some domain, . We xt∈Xt s Xs X assume that each node s and edge s,t are associated B (x ) = M (x )m (x ) { } t t t→s t s→t t with potential functions Ψs and Ψs,t respectively, and given these potential functions we define a probability Although we develop our results for tree–structured distribution on assignments of values to nodes as graphs, it is common to apply belief propagation to graphs with cycles as well (“loopy” belief propaga- 1 tion). In this case the belief functions (4) will in gen- P (~x) = Ψ (~x ) Ψ (~x , ~x ) (1) Z s s  s,t s t  eral not equal the true marginals, but often provide s ! Y {s,tY}∈E good approximations in practice. We discuss the ap-   plication of our results to particle-based versions of Here ~x is an assignment of values to all k variables, loopy BP at the end of Section 5. ~xs is the value assigned to Xs by ~x, and Z is a scalar chosen to normalize the distribution P (also called the For reasons of numerical stability, it is common to nor- partition function). We consider the problem of com- malize each message mt→s so that it has unit sum. puting marginal probabilities, defined by However, such normalization of messages has no other effect on the (normalized) belief functions (4). Thus Ps(xs) = P (~x). (2) for conceptual simplicity in developing and analyzing ~x:~xs=xs particle belief propagation we avoid any explicit nor- X malization of the messages; such normalization can be Equation (1) defines a pairwise Markov random field included in the algorithms in practice. model. Our results are also directly applicable to more general graphical model formulations such as Additionally, for reasons of computational efficiency it Bayes’ nets (Pearl, 1988) and factor graphs (Kschis- is common to use the alternative expression chang et al., 2001), but for notational convenience we Bt(xt) mt→s(xs) = Ψt,s(xt,xs) (6) restrict our development to the pairwise form (1). ms→t(xt) X when computing the messages. By storing and updat- 3 Review of Belief Propagation ing the belief values Bt(xt) incrementally as incom- ing messages are re-computed, one can significantly In the case where G is a tree and the sets s are small, reduce the number of operations required. Although the marginal probabilities can be computedX efficiently our development of particle belief propagation uses the by belief propagation (Pearl, 1988). This is done by update form (3), this alternative formulation can be computing messages mt→s each of which is a function applied to improve its efficiency as well.

257 Ihler, McAllester

4 Particle Belief Propagation count c (x) for x as s ∈ Xs b We now consider the case where s is too large to = x : ix(i) = x enumerate in practice and define|X a| generic particle Xs { s ∈Xs ∃ s s} (i) (sample) based BP algorithm (PBP). This algorithm cs(xs) = i : xs = xs b |{ }| essentially corresponds to a non-iterative version of the ′ method described in Koller et al. (1999). (i) (i ) Equation (8) has the property that if xs = xs then ′ m(i) = m(i ) ; thus we can rewrite (8) as 4.1 Particle Belief Propagation Algorithm t→s t→s

(1) (n) PBP samples a set of particles xs , . . ., xs with 1 ct(xt) (i) 1 mt→s(xs) = Ψt,s(xt,xs)Ψt(xt) xs s at each node s of the network , drawn from n b Wt(xt) ∈X xt∈Xt a sampling distribution (or weighting) Ws(xs) > 0 X b (corresponding to the proposal distribution in parti- mu→t(xt) (9) cle filtering). The selection of an appropriate sam- · u∈YΓt\s pling distribution is discussed in detail in section 6. b First we note that (3) can be written as the following for x . Since we have assumed W (x ) > 0, in importance-sampling corrected expectation. s ∈ Xs t t the limit of an infinite sample becomes all of Xt Xt and the ratiob (ct(xt)/n) converges to Wt(xt). So for mt→s(xs) = sufficiently large samples the estimateb (9) approaches

Ψt(xt) the true message (3). Ext∼Wt Ψs,t(xs,xt) mu→t(xt) (7)  Wt(xt)  u∈YΓt\s   4.3 Connections to Non-parametric BP (1) (n) Given a sample xt , . . ., xt of points drawn from Wt (i) Another popular technique for approximating be- we can estimate mt→s(xs ) as lief propagation using sample-based messages is non- n (j) parametric belief propagation, or NBP (Sudderth et al., (i) 1 (j) (i) Ψt(xt ) (j) m = Ψt,s(x ,x ) m 2003). In NBP, each message is represented using a t→s n t s (j) u→t j=1 Wt(xt ) t X u∈YΓ \s collection of samples, which are smoothed by a Gaus- b b (8) sian kernel to ensure a well-defined product. Sam- Equation (8) represents a finite sample estimate for ples are drawn from the product of the incoming mes- (7). Alternatively, (8) defines a belief propagation al- sages (the pre-message) and are propagated stochas- gorithm where messages are defined on particles rather tically through Ψs,t to produce samples representing than the entire set s. As in classical belief propa- the new message from s to t. These samples are again X gation, for tree structured graphs and fixed particle smoothed to ensure a well-defined product, and the locations there is a unique set of messages satisfying process is repeated. (8). Equation (8) can also be applied for loopy graphs (again observing that message normalization can be The algorithm’s most computationally expensive step conceptually ignored). In this simple version, the sam- is sampling from the product of messages; thus, the alternative expression (6) suggests an alternative ap- ple values x(i) and weights W (x(i)) remain unchanged s s s proach in which one draws a single set of samples from as messages are updated. the product of all messages (the belief) and weights these samples by the inverse of each incoming mes- 4.2 Consistency of Particle BP sage (Ihler et al., 2005a); we refer to this procedure as belief-based sampling for NBP. The default approach, We now show that equation (8) is consistent—it agrees in which samples are drawn for each pre-message, we with (3) in the limit as n . For any finite collec- → ∞ refer to as message-based sampling. tion of samples, define the particle domain s and the X A key difference between NBP and PBP is that in NBP 1 x(i) It is also possible to sample a set of particlesb { st } for the incoming messages to each node do not share their each pre-message Ms→t in the network from potentially collection of samples. In other words, in NBP node s W x different distributions st( s), for which our analysis re- draws the samples it will use to represent its message mains essentially unchanged. However, for notational sim- plicity and to be able to apply the more computationally to t, while in PBP node s uses t’s samples to repre- efficient message expression described in Section 3, we use sent its outgoing message. This difference sidesteps the a single distribution and sample set for each node. need to smooth the sample set when taking products.

258 Particle Belief Propagation

5 Finite Sample Analysis Theorem 1. For a tree with k nodes, if we sample n 2 particles at each node with n > k RW ln(kn/δ), and Fundamentally, we are interested in particle-based ap- compute the message values defined by (8), then with proximations to belief propagation for their finite– probability at least 1 δ over the choice of particles sample behavior, i.e., we hope that a relatively small we have that the following− holds simultaneously for all (i) collection of samples will give rise to an accurate esti- nodes s and all particles xs . mate of the beliefs. To analyze particle belief propa- gation’s performance for finite numbers of samples, we B (x(i)) = Ψ (x(i)) m(i) apply a concentration bound on the estimated mes- s s s s t→s ∈ sages and beliefs. We use the shorthand x y(1 ǫ) t∈Γs ∈ ± Y to abbreviate the upper and lower bounds y(1 ǫ) b (i) b −1 Bs(xs ) 1 O k n RW ln(kn/δ) (13) x y(1 + ǫ). − ≤ ± ≤   p  We begin by stating a variant of Bernstein’s inequality. Consider n IID random variables x with mean x { i} Proof. Let ǫ(R, n, δ) be as defined in (11). We apply a and satisfying (with probability 1), 0 xi R x. union bound over all 2(k 1) messages and n particles Then, with probability at least (1 δ) over≤ the≤ choice − − evaluated by each message to (10)–(11). Then, with of values x1,...,xn we have that probability at least (1 δ) over the choice of particles − 1 n at neighbors t, the following holds simultaneously for x x (1 ǫ(R, n, δ)) (10) (i) n i ∈ ± all directed edges (t s) and particles xs : i=1 → X where n (j) 1 (i) (j) Mt→s(xt ) Ψs,t(xs ,xt ) R 2 n (j) ∈ ǫ(R, n, δ) = η + η + 2 ln(2/δ) (11) j=1 Wt(xt ) n X r   δ p m (x(i)) 1 ǫ R , n, (14) and t→s s ± W 2(k 1)n   −  ln(2/δ) R η = 3 n Now for each directed edge (t s) we define kts to be r the number of nodes connected→ to t (including t itself) Equation (11) can be derived by applying Bernstein’s by a path that does not go through s. Given (14), we inequality to upper and lower intervals of size x ǫ | | prove the following upper and lower bounds: and observing that the variance of each xi is bounded by Rx2. To ensure that ǫ 1 (the range of pri- ≪ (i) (i) δ mary interest) we require n R, in which case m mt→s(x ) 1 ktsǫ RW , n, ≫ t→s ≥ s − 2(k 1)n ǫ 2 ln(2/δ)R/n.   −  ≈ (15) We nowp consider the following form of (7). b (i) (i) δ m mt→s(x ) exp ktsǫ RW , n, M (x ) t→s ≤ s 2(k 1)n m (x ) = E Ψ (x ,x ) t→s t (12)   −  t→s s xt∼Wt s,t s t W (x ) (16)  t t  b Given that we intend to apply (10) to (12), it is natural The proof of (15)–(16) proceeds by induction on k . to define the constant ts If kts = 1, so that t is a leaf node, then Mt→s = Ψs,t(xs,xt)Mt→s(xt) (i) RW = max max ψt = Mt→s and thus the outgoing message mt→s is s,t∈E xs∈Xs, xt∈Xt Wt(xt)mt→s(xs) equal to the left-hand side of (14). In this casec (15) is so that immediate and (16) follows from the fact thatb 1 + ǫ exp(ǫ). ≤ Mt→s(xt) Ψs,t(xs,xt) RW mt→s(xs) Wt(xt) ≤ For the inductive step we assume that (15) and (16) hold for the messages coming into t from all nodes Mt→s(xt) = R E t t Ψ (x ,x ) other than s. This means that these incoming mes- W x ∼W s,t s t W (x )  t t  sages can be written as the true messages multiplied Note that the constant RW depends on the choice of by correction factors bounded by (15) and (16). Ap- d sampling distributions Ws; we discuss this relationship plying (14) along with the fact that (1 dǫ) (1 ǫ) further in Section 6. and exp(ǫ)d = exp(dǫ), we prove the inductive− ≤ step.− We can now state our first main result. By invoking (15)–(16) on all incoming messages to s,

259 Ihler, McAllester and using a similar argument, we have that We can then convert this probability bound into one on the two distributions’ L1 distance; intuitively, if the (i) k δ two functions substantially disagree only on a set of B Bs(x ) 1 (k 1)ǫ RW , n, s ≥ s − − 2(k 1)n small measure, they must also be close in an L sense.   −  1 We begin by dividing the beliefs by Z to make the true b(i) k δ B Bs(x ) exp (k 1)ǫ RW , n, s ≤ s − 2(k 1)n belief a , so that xs s,   −  ∀ ∈X b 2 Finally, if we require n > k RW ln(kn/δ) we have PS Ps(xs) Ps(xs) (1 ǫ) ǫ(RW , n, δ/(2(k 1)n)) O(1/k), which in turn im- 6∈ ± ≤ − ≤ h i nǫ2 plies that exp(kǫ) (1 + O(kǫ)). e exp Ω ≤ − R k2 ln kn   W  Theorem 1 argues that with high probability, the mes- Taking the expected value of both sides with respect sages in particle belief propagation will be accurate at to xs, we have the particle locations themselves. Alternatively how- ever, we need not restrict the domain of our estimated E P P (x ) P (x ) (1 ǫ) messages and beliefs at node s to the sampled val- xs∼Ps S s s 6∈ s s ± ≤ (i) 2 ues xs . The potential functions Ψ define functions h h iinǫ { } e exp Ω 2 valid at any xs s given samples at the neighboring − RW k ln kn nodes: ∈X    Noting that, for binary z, P [z] = E[z], these two ex- 1 n Ψ (x(j)) m (x ) = Ψ (x(j),x ) t t m(j) pectations commute and we can rewrite this as t→s s n t,s t s (j) u→t j=1 Wt(xt ) t X u∈YΓ \s ES Pxs∼Ps Ps(xs) Ps(xs) (1 ǫ) e Bs(xs) = Ψs(xs) mt→s(xs)b (17) 6∈ ± ≤ 2 t∈Γs h h iinǫ Y e exp Ω e 1 − R k2 ln kn P (x ) = B (x ) e   W  s s Z s s and applying Markov’s inequality, we obtain Notee that here wee have used the true normalizing con- stant Z (i.e., the normalizing constant for Bs(xs)) in PS Pxs∼Ps Ps(xs) Ps(xs) (1 ǫ) γ defining Ps(xs). 6∈ ± ≥ ≤ h h 1 i nǫ2 i e We can now state our second main result. exp Ω 2 . e γ − RW k ln kn Theorem 2. Under the same conditions as Theo-    rem 1, with probability at least 1 δ′ over the choice ′ − Now by taking ǫ to be O k RW ln(kn/(γδ ))/n , we of the particles we have for all nodes s that. ′ can set the right hand side p to be δ . Then with prob- ′ 3 ′ ability at least 1 δ over the choice of the sample k RW ln(knRW /δ ) − Ps Ps O (18) points at nodes other than s we have, − 1 ≤ n ! r

e RW kn Proof. The proof of Theorem 1 can be modified to Pxs∼Ps Ps(xs) Ps(xs) 1 O k ln ′ (i) " 6∈ ± s n γδ !!# show that if one fixes s and xs arbitrarily before draw- e γ. ing any samples, and then draws samples at all nodes ≤ other than s, then with probability at least 1 δ over the draw of the sample at the other nodes we have− the Because the constant RW bounds the ratio of a mes- bound (13). Now consider a given ǫ > 0. By setting sage sample to its expected value (the true message), δ = exp( Ω(nǫ2/(R k2 ln(kn))) into (13), we can set the product of incoming messages (of which there are − W the width of the confidence interval to ǫ, yielding the at most k) is also bounded, so that Ps(xs)/Ps(xs) (R )k. ≤ following (where the probability PS is over the draw W of the samples at nodes other than s): xs s, e ∀ ∈X Ps Ps Ps = Exs∼Ps 1 − 1 " − Ps # PS Bs(xs) Bs(xs) (1 ǫ) 6∈ ± ≤ e 2 e R kn h i nǫ W k e exp Ω (1 γ)O k ln ′ + γ(RW ) . − R k2 ln kn ≤ − s n γδ !   W 

260 Particle Belief Propagation

k Setting γ = 1/(n(RW ) ) now gives the result for the In Koller et al. (1999), the sampling distributions were single node s. To get the result simultaneously for all constructed using a density estimation step (fitting s we take a union bound over all k nodes (δ′ δ′/k), mixtures of Gaussians). However, the fact that the be- → which does not change to order of the bound. lief estimate Bt(xt) can be computed at any value of xt allows us to use another approach, which has also been Although Theorems 1 and 2 are formulated for tree- applied to particlee filters with success (Doucet et al., structured graphs, they can also be applied (in a lim- 2001; Khan et al., 2005). By running a short MCMC ited way) to loopy belief propagation. Loopy BP is an- simulation such as the Metropolis-Hastings algorithm, alyzed in terms of its Bethe tree, a tree-structured “un- we can draw samples directly from Bt without it need- rolling” of the graph to a depth equal to the number of ing to be explicitly constructed or fitted. We simply iterations of loopy BP (Ihler et al., 2005b). This tree apply the definition (17) to evaluatee the acceptance is then analyzed in the same way as before; although probability of each step. the random samples are correlated among nodes of the This approach manages to avoid any distributional Bethe tree, the union bound still applies and the end assumptions or biases inherent in density estimation result is unchanged. However, the potentially expo- methods. One disadvantage, however, is that it can nential growth of the number of nodes k in the Bethe be difficult to assess convergence during MCMC sam- tree causes additional complications, since our bounds pling, and the MCMC chain is run at each iteration of depend polynomially on k. BP and each node. On the other hand, many imple- mentations of NBP also use MCMC steps at each iter- 6 Selecting Sampling Distributions ation to draw samples from the message product (e.g., and Resampling Sudderth et al., 2003), and thus have similar cost and convergence assessment issues. The preceding section’s analysis of finite sample accu- racy is sensitive to the parameter RW , which is itself sensitive to the choice of the sampling distributions 7 Comparison to Previous Results Ws. Unfortunately, it appears difficult to optimize RW over the choice of the sampling distributions Ws (or Our results describe the consistency and accuracy of even compute its value) a priori. On the other hand, particle-based representations of belief propagation. the following observations seem worth noting: Considerable work has gone into analyzing particle representations on Markov chains (particle filtering); Ms→t(xt)Ψt,s(xt,xs) see for example Del Moral (2004) for details. Like our RW = max s,t,xs,xt Wt(xt)mt→s(xs) results, the convergence behavior of particle filters is P (x ,x ) also O(1/√n), but becomes independent of the num- = max s,t s t s,t,xs,xt W (x )P (x ) ber of nodes k as k . t t s s → ∞ P (x x ) = max s,t t| s (19) Fundamentally, these results are based on the mixing s,t,xs,xt Wt(xt) properties of the conditional distributions. For exam- ple, Del Moral (2004) applies Dobrushin’s contraction So we want Wt to simultaneously match all possible coefficient, which bounds the reduction in the total conditional distributions on xt given the value of a single neighboring node. The form of (19) suggests variation norm between two probability measures. The that, although not necessarily optimal in the sense of total variation is a natural distance for comparing dis- tributions, essentially equivalent to the L1 norm (by minimizing RW , a good choice may be to sample from Scheff´e’s theorem). These norms are also well-behaved the marginal distribution: Wt(xt) = Pt(xt). This idea was originally proposed in Koller et al. (1999), based with respect to sampling and are thus well-suited to on an intuitive description of the issues. analysis of particle filters and density estimation. Unfortunately, these measures are less well-suited to Unfortunately, the true marginal Pt(xt) is unavailable for sampling—indeed, this is the very quantity we are dealing with the product operation of BP. In work an- trying to compute. However, at any stage of BP we can alyzing the properties of BP, the norm of choice is use our current marginal estimate to construct a new Hilbert’s projective measure, to which generalizations sampling distribution for node t and draw a new set of of Birkhoff’s contraction coefficient are applied (Ihler et al., 2005b; Mooij and Kappen, 2007). Our results particles x(i) . This leads to an iterative algorithm t are stated in terms of multiplicative error, which is which continues{ } to improve its estimates as the sam- equivalent to the projective norm up to first-order; see pling distributions become more accurately targeted. e.g. Ihler et al. (2005b), Lemma 3. Unfortunately, such an iterative resampling process is significantly more difficult to analyze. Using the projective norm complicates the precise

261 Ihler, McAllester

PBP, Local PBP, True Belief PBP, Estimated Belief NBP, Msg Sampling NBP, Belief Sampling .

−1 10

(a) (b) Median L−1 Error Figure 1: “Sawtooth” example from the stereo dataset of Scharstein and Szeliski (2002). (a) Left-hand image; (b) ground truth disparity (depth) map. −2 10 1 2 3 10 10 10 Number of particles statement of the bound, but does not change its qual- itative form: rather than a simple additive recursion Figure 2: Log-log plot of the median L1 error be- outlined in Theorem 1, the projective norm would fol- tween the estimated beliefs Bs(xs) and the true beliefs low the recursion outlined in Ihler et al. (2005b) Sec. Bs(xs) as a function of the number of samples n used 5.4. For Markov chains, and assuming some degree of in the particle representation.e We show PBP under mixing as measured by Ihler et al. (2005b) Eq. (7), the three sampling distributions Ws: the local potentials errors decrease at an exponential rate so that there ex- Ψs, the current belief estimates Bs(xs) at each itera- ists a constant range k0(ǫ) beyond which errors can be tion, and the true beliefs Bs(xs). All three decrease at ignored without affecting the order of the bounds, and a rate of n−1/2. Also shown is NBPe using message- and for k k0(ǫ) renders the bounds independent of k. belief-based sampling; both have slightly higher error ≥ and appear to decrease at a slower rate corresponding 8 Experimental Results to NBP’s smoothing parameter.

To see how our theoretical results carry over into a practical setting, we evaluate the performance of par- Figure 2 shows a log-log plot of the median L1 error ticle BP on the problem of reconstructing depth (or between the true beliefs Bs(xs) at each pixel and the equivalently pixel disparity) from stereo image pairs. estimated beliefs Bs(xs) found via PBP and NBP us- Belief propagation was applied to the stereo vision ing various sampling distributions. For PBP, we show problem in Sun et al. (2003), and since then a num- three sampling distributions:e drawing samples from ber of more sophisticated models have also used BP the local potentials, Ws(xs) = ψs(xs); drawing sam- for inference (e.g., Sun et al., 2005; Klaus et al., 2006; ples from the true beliefs, Ws(xs) = Bs(x); and re- Yang et al., 2006). These methods are quite success- sampling at each iteration according to the currently ful; BP-based methods currently comprise half of the estimated beliefs, Ws(xs) = Bs(xs) as described in best ten algorithms2 for stereo disparity estimation. Section 6. Note that the second option (sampling from We use the original model of Sun et al. (2003) and the the true beliefs) is not possiblee in general. Moreover, “Sawtooth” image from Scharstein and Szeliski (2002) in our experiments, its performance is nearly identi- for our comparisons, shown in Figure 1. cal to that of sampling from the estimated beliefs at each iteration. Sampling from the local potentials per- Our experiments are not designed to showcase an formed slightly less well. In all three cases, the errors application requiring particle BP, since there are al- decrease at a rate of 1/√n. ready many such applications described in the litera- ture (e.g., Coughlan and Ferreira, 2002; Sigal et al., We also compare to two versions of NBP: sampling 2004; Sudderth et al., 2004; Ihler et al., 2005a). In- from the messages (message-based sampling), and stead, our purpose is to assess the accuracy of particle sampling from the beliefs and reweighting to form mes- BP compared to its deterministic counterpart. In the sages (belief-based sampling). To enable a fair com- stereo model each xs is univariate, allowing us to also parison with PBP, for NBP we evaluate Bs(xs) as in create a high-quality discretized solution. We com- (17) using samples drawn from the pre-message prod- pare this solution to the estimated beliefs found via uct M at the neighbors of node s. Both message-e and PBP and NBP with different sampling methods. belief-based sampling performed nearly identically on thisc problem. We note two things about NBP’s per- 2See http://vision.middlebury.edu/stereo/eval/. formance: first, that the error is slightly higher than

262 Particle Belief Propagation that for PBP with belief-based sampling; and second, and Interacting Particle Systems with Applications. that the rate of decrease appears slower than 1/√n. Springer-Verlag, New York, 2004. Both of these effects are likely due to the kernel-based A. Doucet, N. de Freitas, and N. Gordon, editors. Sequen- smoothing performed on the samples when messages tial Monte Carlo Methods in Practice. Springer-Verlag, are constructed in NBP. This step is necessary to make New York, 2001. NBP’s message products well-defined, but biases the W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learning low-level vision. Int. J. Comp. Vis., 40(1):25– estimated messages to be smoother than the true mes- 47, 2000. sages. The Gaussian kernel’s variance (smoothing pa- A. Ihler, J. Fisher, R. Moses, and A. Willsky. Nonpara- rameter) used in our experiments decreases at a rate metric belief propagation for self-calibration in sensor −2/5 of n , which visually matches the rate observed networks. IEEE JSAC, pages 809–819, Apr. 2005a. in NBP’s L1 error in Figure 2. Thus, avoiding the A. Ihler, J. Fisher, and A. Willsky. Loopy belief propa- smoothing step required by NBP appears to provide a gation: Convergence and effects of message errors. J. measureable improvement. Mach. Learn. Res., 6:905–936, May 2005b. M. Isard. PAMPAS: Real–valued graphical models for com- 9 Summary and Conclusions puter vision. In CVPR, 2003. Z. Khan, T. Balch, and F. Dellaert. MCMC-based parti- cle filtering for tracking a variable number of interact- In this paper we have described a generic algorithm ing targets. IEEE Trans. PAMI, pages 1805–1918, Nov. for sample–based or particle belief propagation in sys- 2005. tems of variables with large or continuous domains, A. Klaus, M. Sormann, and K. Karner. Segment-based and showed that the algorithm is consistent, i.e., ap- stereo matching using belief propagation and a self- proaches the true values of the message and belief adapting dissimilarity measure. In ICPR, 2006. functions as the number of samples grow large. We D. Koller, U. Lerner, and D. Angelov. A general algorithm then demonstrated a convergence rate, showing that for approximate inference and its application to hybrid Bayes nets. In UAI, pages 324–333, 1999. the beliefs obtained are accurate both at the particle locations themselves, and in an L sense, at a rate of F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs 1 and the sum-product algorithm. IEEE Trans. IT, 47(2): O(1/√n) where n is the number of particles. Finally, 498–519, Feb. 2001. we illustrated the algorithm on a stereo vision data J. M. Mooij and H. J. Kappen. Sufficient conditions for set, showing that the algorithm’s behavior in practice convergence of the sum-product algorithm. IEEE Trans. corresponds to the theory and comparing its perfor- IT, 53(12):4422–4437, Dec. 2007. mance across different sampling distributions and to R. Neal, M. Beal, and S. Roweis. Inferring state sequences two sampling approaches for NBP. for non-linear systems with embedded hidden Markov models. In NIPS, 2003. The relationship of the quantity R to the conver- W J. Pearl. Probabilistic Reasoning in Intelligent Systems. gence rate highlights the importance of selecting a Morgan Kaufman, San Mateo, 1988. good sampling distribution (as in any Monte Carlo es- D. Scharstein and R. Szeliski. A taxonomy and evaluation timation process). Although it is difficult to select of dense two-frame stereo correspondence algorithms. the optimal sampling distributions a priori, the form Int. J. Comp. Vis., 47(1/2/3):7–42, Apr. 2002. of (19) indicates that the quality of the sampling dis- L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. tribution can be evaluated as the inference process pro- Tracking loose-limbed people. In CVPR, 2004. gresses, and that a new sampling distribution could be E. Sudderth, A. Ihler, W. Freeman, and A. Willsky. Non- selected using the current marginal estimates for guid- parametric belief propagation. In CVPR, 2003. ance. Although such “adaptive” choices for the sam- E. Sudderth, M. Mandel, W. Freeman, and A. Willsky. pling distribution are much more difficult to analyze, Distributed occlusion reasoning for tracking with non- parametric belief propagation. In NIPS, 2004. the form of RW seems to support the notion of sam- pling from the current marginal estimates themselves; J. Sun, H. Y. Shum, and N. N. Zheng. Stereo matching IEEE Trans. PAMI in our experiments this technique performed just as using belief propagation. , 25(7):787– 800, July 2003. well as sampling from the true marginal distributions. J. Sun, Y. Li, S. Kang, and H.-Y. Shum. Symmetric stereo matching for occlusion handling. In CVPR, 2005. References S. Thrun, D. Fox, and W. Burgard. Monte carlo localiza- M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. tion with mixture proposal distribution. In AAAI, 2000. A tutorial on particle filters for online nonlinear/non- R. van der Merwe, N. de Freitas, A. Doucet, and E. Wan. Gaussian Bayesian tracking. IEEE Trans. SP, 50(2): The unscented particle filter. In NIPS, 2001. 174–188, Feb. 2002. Q. Yang, L. Wang, R. Yang, H. Stew´enius, and D. Nist´er. J. M. Coughlan and S. J. Ferreira. Finding deformable Stereo matching with color-weighted correlation, hier- shapes using loopy belief propagation. In ECCV, 2002. archical belief propagation and occlusion handling. In P. Del Moral. Feynman-Kac Formulae: Genealogical CVPR, 2006.

263