Hybrid Loopy Belief Propagation Changhe Yuan Marek J

Home , Belief propagation

Hybrid Loopy Belief Propagation Changhe Yuan Marek J. Druzdzel Department of Computer Science and Engineering Decision Systems Laboratory Mississippi State University School of Information Sciences Mississippi State, MS 39762 University of Pittsburgh [email protected] Pittsburgh, PA 15260 [email protected]

Abstract We propose an algorithm called Hybrid Loopy Belief Propagation (HLBP), which extends the Loopy Belief Propagation (LBP) (Murphy et al., 1999) and Nonparametric Belief Propagation (NBP) (Sudderth et al., 2003) algorithms to deal with general hybrid Bayesian networks. The main idea is to represent the LBP messages with mixture of Gaussians and formulate their calculation as Monte Carlo integration problems. The new algorithm is general enough to deal with hybrid models that may represent linear or nonlinear equations and arbitrary probability distributions.

1 Introduction Spiegelhalter, 1988). As Lerner et al. (Lerner et al., 2001) pointed out, it is important to have Some real problems are more naturally mod- alternative solutions in case that junction tree elled by hybrid Bayesian networks that con- algorithm-based methods are not feasible. tain mixtures of discrete and continuous vari- In this paper, we propose the Hybrid Loopy ables. However, several factors make inference Belief Propagation algorithm, which extends the in hybrid models extremely hard. First, linear Loopy Belief Propagation (LBP) (Murphy et or nonlinear deterministic relations may exist al., 1999) and Nonparametric Belief Propaga- in the models. Second, the models may con- tion (NBP) (Sudderth et al., 2003) algorithms tain arbitrary probability distributions. Third, to deal with general hybrid Bayesian networks. the orderings among the discrete and continu- The main idea is to represent LBP messages ous variables may be arbitrary. Since the gen- as Mixtures of Gaussians (MG) and formulate eral case is difficult, existing research often fo- their calculation as Monte Carlo integration cuses on special instances of hybrid models, such problems. The extension is far from trivial due as Conditional Linear Gaussians (CLG) (Lau- to the enormous complexity brought by deter- ritzen, 1992). However, one major assumption ministic equations and mixtures of discrete and behind CLG is that discrete variables cannot continuous variables. Another advantage of the have continuous parents. This limitation was algorithm is that it approximates the true pos- later addressed by extending CLG with logis- terior probability distributions, unlike most ex- tic and softmax functions (Lerner et al., 2001; isting approaches which only produce their first Murphy, 1999). More recent research begin to two moments for CLG models. develop methodologies for more general non- 2 Hybrid Loopy Belief Propagation Gaussian models, such as Mixture of Truncated Exponentials (MTE) (Moral et al., 2001; Cobb To extend LBP to hybrid Bayesian networks, and Shenoy, 2005), and junction tree algorithm we need to know how to calculate the LBP mes- with approximate clique potentials (Koller et al., sages in hybrid models. There are two kinds of 1999). However, most of these approaches rely messages defined in LBP. Let X be a node in on the junction tree algorithm (Lauritzen and a hybrid model. Let Yj be X’s children and Ui (t+1) be X’s parents. The λX (ui) message that X modifications and get sends to its parent Ui is given by: (t+1) πYj (x) = (t+1) (t) (t) λ u λX (x) λ (x)P (x|u) π (uk) X ( i) = Yk X  Qk Qk  (t) (t) α   . R (t)( ) α λX (x) λYj (x) P (x|u) πX (uk)(1). u λY x R Q R Q= j x j uk:k6=i k6 i  

( +1) Now, for the messages sent from X to π t x X and the message Yj ( ) that sends to its its different children, we can share most of child Yj is defined as: the calculation. We first get samples for

( +1) uis, and then sample from the product of π t x (t) Yj ( ) = λ x λ x P x u X ( ) Q Yk ( ) ( | ). For each different mes- (t) (t) k αλ (x) λ (x) P (x|u) π (u ) .(2) (t+1) X Q Yk R Q X k π x x k6=j u k sage Yj ( ), we use the same sample but (t) assign it different weights 1/λYi (x). where λX (x) is a message that X sends to itself Let us now consider how to calculate the representing evidence. For simplicity, I use only (t+1) λX (ui) message defined in Equation 1. First, integrations in the message definitions. It is ev- we rearrange the equation analogously: ident that no closed-form solutions exist for the (t+1) messages in general hybrid models. However, λX (ui) = α we observe that their calculations can be for- P (x,uk:k6=i|ui)   mulated as Monte Carlo integration problems.  (t) (t)  (t+1) λz (x) λ (x)P}|(x|u) π (u {)(3). First, let us look at the π (x) message de- R X Y Yj Y X k Yj x,uk:k6=i j k=i  6  fined in Equation 2. We can rearrange the equa-   tion in the following way: It turns out that here we are facing a quite dif- (t+1) ( +1) t ferent problem from the calculation of πYj (x). πYj (x) = Note that now we have P (x, uk : k 6= i|ui), a P (x,u)   joint distribution over x and uk(k 6= i) condi-  (t) (t)  α λz x λ x }|P x u π u { . tional on ui, so the whole expression is only a R  X ( ) Y Yk ( ) ( | ) Y X ( k) u k=j k likelihood function of ui, which is not guaran-  6    teed to be integrable. As in (Sudderth et al., Essentially, we put all the integrations out- 2003; Koller et al., 1999), we choose to restrict side of the other operations. Given the new for- our attention to densities and assume that the mula, we realize that we have a joint probability ranges of all continuous variables are bounded (maybe large). The assumption only solves part distribution over x and uis, and the task is to of the problem. Another difficulty is that com- integrate all the uis out and get the marginal probability distribution over x. Since P (x, u) position method is no longer applicable here, can be naturally decomposed into P (x|u)P (u), because we have to draw samples for all par- the calculation of the message can be solved us- ents ui before we can decide P (x|u). We note ing a Monte Carlo sampling technique called that for any fixed ui, Equation 3 is an integra- Composition method (Tanner, 1993). The idea tion over x and uk, k 6= i and can be estimated u using Monte Carlo methods. That means that is to first draw samples for each is from (t+1) (t) we can estimate λX (ui) up to a constant for πX (ui), and then sample from the product of (t) any value u , although we do not know how to λ x λ x P x u i X ( ) Q Yk ( ) ( | ). We will discuss how k6=j sample from it. This is exactly the time when to take the product in the next subsection. For importance sampling becomes handy. We can now, let us assume that the computation is pos- estimate the message by drawing a set of im- sible. To make life even easier, we make further portance samples as follows: sample ui from a

318 C. Yuan and M. J. Druzdzel ( +1) t Algorithm: HLBP chosen importance function, estimate λX (ui) using Monte Carlo integration, and assign the 1. Initialize the messages that evidence nodes send to ratio between the estimated value and I(ui) as themselves and their children as indicating messages the weight for the sample. A simple choice for with fixed values, and initialize all other messages to be uniform. the importance function is the uniform distribution over the range of ui, but we can improve the 2. while (stopping criterion not satisfied) accuracy of Monte Carlo integration by choos- Recompute all the messages using Monte Carlo ing a more informed importance function, the integration methods. (t) Normalize discrete messages. corresponding message λX (ui) from the last iteration. Because of the iterative nature of LBP, Approximate all continuous messages using MGs. messages usually keep improving over each it- end while eration, so they are clearly better importance 3. Calculate λ(x) and π(x) messages for each variable. functions. 4. Calculate the posterior probability distributions for Note that the preceding discussion is quite all the variables by sampling from the product of general. The variables under consideration can λ(x) and π(x) messages. be either discrete or continuous. We only need to know how to sample from or evaluate the Figure 1: The Hybrid Loopy Belief Propagation conditional relations. Therefore, we can use any algorithm. representation to model the case of discrete variables with continuous parents. For instance, we samples using a regularized version of expecta- can use general softmax functions (Lerner et al., tion maximization (EM) (Dempster et al., 1977; 2001) for the representation. Koller et al., 1999) algorithm. We now know how to use Monte Carlo inte- Given the above discussion, we outline the gration methods to estimate the LBP messages, final HLBP algorithm in Figure 1. We first ini- represented as sets of weighted samples. To tialize all the messages. Then, we iteratively complete the algorithm, we need to figure out recompute all the messages using the methods how to propagate these messages. Belief prop- described above. In the end, we calculate the agation involves operations such as sampling, messages λ(x) and π(x) for each node X and multiplication, and marginalization. Sampling use them to estimate the posterior probability from a message represented as a set of weighted distribution over X. samples may be easy to do; we can use resam- pling technique. However, multiplying two such 3 Product of Mixtures of Gaussians messages is not straightforward. Therefore, we choose to use density estimation techniques to One question that remains to be answered in approximate each continuous message using a the last section is how to sample from the prod- ( ) mixture of Gaussians (MG) (Sudderth et al., λ x λ t x P x u uct of X ( ) Q Yk ( ) ( | ). We address the 2003). A K component mixture of Gaussian k6=j has the following form problem in this section. If P (x|u) is a continuous probability distribution, we can approx- K imate P (x|u) with a MG. Then, the problem M(x) = w N(x; µ , σ ) , (4) X i i i becomes how to compute the product of several i=1 MGs. The approximation is often a reasonable K thing to do because we can approximate any where P wi = 1. MG has several nice prop- i=1 continuous probability distribution reasonably erties. First, we can approximate any contin- well using MG. Even when such approxima- uous distribution reasonably well using a MG. tion is poor, we can approximate the product of Second, it is closed under multiplication. Last, P (x|u) with an MG using another MG. One ex- we can estimate a MG from a set of weighted ample is that the product of Gaussian distribu-

Hybrid Loopy Belief Propagation 319 tion and logistic function can be approximated Algorithm: Sample from a product of MGs M1 × M2 × well with another Gaussian distribution (Mur- ... × MK . phy, 1999). 1. Randomly pick a component, say j1, form the first message M1 according to its weights w11, ..., w1J1 . Suppose we have messages M1, M2, ..., MK , each represented as an MG. Sudderth et al. use 2. Initialize cumulative parameters as follows ∗ ∗ Gibbs sampling algorithm to address the prob- µ1 ← µ1j1 ; σ1 ← σ1j1 . lem (Sudderth et al., 2003). The shortcoming 3. i ← 2; wImportance ← w1j1 . of the Gibbs sampler is its efficiency: We usually have to carry out several iterations of Gibbs 4. while (i ≤ K) sampling in order to get one sample. Compute new parameters for each component of ith message as follows ∗ ∗ −1 −1 −1 Here we propose a more efficient method σîk ← ((σi−1) + (σik) ) , µ∗ based on the chain rule. If we treat the se- ∗ µik i−1 ∗ µîk ← ( σ + σ∗ )σîk. lection of one component from each MG mes- ik i−1 Compute new weights for ith message with any x sage as a random variable, which we call a ∗ ∗ ∗ ∗ N(x;µi−1,σi−1)N(x;µik,σik) wîk = wikwî−1ji−1 N(x;µˆ∗ ,σˆ∗ ) . label, our goal is to draw a sample from the ik ik ∗ joint probability distribution of all the labels Calculate the normalized weights w¯ik. P (L1, L2, ..., LK ). We note that the joint prob- Randomly pick a component, say ji, from the ith ability distribution of the labels of all the mes- message using the normalized weights. ∗ ∗ ∗ ∗ sages P (L1, L2, ..., LK ) can be factorized using µi ← µîji ; σi ← σîji . the chain rule as follows: i ← i + 1; ∗ K wImportance = wImportance × w¯iji . P (L1, L2, ..., LK ) = P (L1) Y P (Li|L1, ..., Li−1) . end while i=2 ∗ 5. Sample from the Gaussian with mean µK and vari- (5) ∗ ance σK . Therefore, the idea is to sample from the w∗ /wImportance labels sequentially based on the prior or con- 6. Assign the sample weight ˆKjK . ditional probability distributions. Let wij be the weight for the jth component of ith mes- Figure 2: Sample from a product of MGs. sage and µij and σij be the component’s parameters. We can sample from the product of In our preceding discussion, we approximate messages M1, ..., MK using the algorithm pre- P (x|u) using MG if P (x|u) is a continuous prob- sented in Figure 2. The main idea is to cal- ability distribution. It is not the case if P (x|u) culate the conditional probability distributions is deterministic or if X is observed. We discuss cumulatively. Due to the Gaussian densities, the following several scenarios separately. the method has to correct the bias introduced during the sampling by assigning the samples Deterministic relation without evidence: We weights. The method only needs to go over the simply evaluate P (x|u) to get the value x as a messages once to obtain one importance sam- sample. Because we did not take into account ple and is more efficient than the Gibbs sam- the λ messages, we need to correct our bias us- pler in (Sudderth et al., 2003). Empirical re- ing weights. For the message πYi (x) sent from sults show that the precision obtained by the X to its child Yi, we take x as the sample and (t) importance sampler is comparable to the Gibbs assign it weight λX (x) λYk (x). For the mes- Q= sampler given a reasonable number of samples. k6 i sage λX (ui) sent from X to its parent Ui, we take value ui as a sample for Ui and assign it ( ) ( ) λ x λ t x /λ t u 4 Belief Propagation with Evidence weight X ( ) Q Yk ( ) X ( i). k Special care is needed for belief propaga- Stochastic relation with evidence: The mes- tion with evidence and deterministic relations. sages πYj (x) sent from evidence node X to its

320 C. Yuan and M. J. Druzdzel children are always indicating messages with proportional {N(2; 1, 1), N(1; 1, 1)}.

fixed values. The messages λYj (x) sent from For the message λC (b) message sent from C the children to X have no influence on X, so to B, since we know that B can only take two we need not calculate them. We only need to values, we choose a distribution that puts equal update the messages λX (ui), for which we take probabilities on these two values as the impor- the value ui as the sample and assign it weight tance function, from which we sample for B. ( ) ( ) λ e λ t e /λ t u e B A X ( ) Q Yk ( ) X ( i), where is the observed The state of will also determine as follows: k B A a B value of X. when = 2, we have = ; when = 1, we A a . Deterministic relation with evidence: This have = ¬ . We then assign weight 0 7 to the B . B case is the most difficult. To illustrate this case sample if = 2 and 0 3 if = 1. However, the B more clearly, we use a simple hybrid Bayesian magic of knowing feasible values for is impos- A network with one discrete node A and two con- sible in practice. Instead, we first sample for π A . , . tinuous nodes B and C (see Figure 3). from C ( ), in this case, {0 7 0 3}, given which we can solve P (C|A, B) for B and assign each sample weight 1.0. So λC (b) have probabilities a 0.7 B proportional to {0.7, 0.3} on two values 2.0 and a N B , ¬ 0.3 ( ; 1 1) 1.0. In general, in order to calculate λ messages P (C|A, B) a ¬a sent out from a deterministic node with evi- C C = B C = 2 ∗ B dence, we need to sample from all parents ex- cept one, and then solve P (x|u) for that parent. There are several issues here. First, since A B we want to use the values of other parents to solve for the chosen parent, we need an equa- @@R © C tion solver. We used an implementation of the Newton’s method for solving nonlinear set of equations (Kelley, 2003). However, not all equa- Figure 3: A simple hybrid Bayesian network. tions are solvable by this equation solver or any equation solver for that matter. We may want Example 1. Let C be observed at state 2.0. to choose the parent that is easiest to solve. Given the evidence, there are only two possible This can be tested by means of a preprocess- values for B: 2.0 when A = a and 1.0 when A = ing step. In more difficult cases, we have to re- ¬a. We need to calculate messages λC (a) and sort to users’ help and ask at the model build- λC (b). If we follow the routine and sample from ing stage for specifying which parent to solve A and B first, it is extremely unlikely for us or even manually specify the inverse functions. to hit a feasible sample; Almost all the samples When there are multiple choices, one heuristic that we get would have weight 0.0. Clearly we that we find helpful is to choose the continuous need a better way to do that. parent with the largest variance. First, let us consider how to calculate the message λC (a) sent from C to A. Suppose we 5 Lazy LBP choose uniform distribution as the importance function, we first randomly pick a state for A. We can see that HLBP involves repeated den- After the state of A is fixed, we note that we can sity estimation and Monte Carlo integration, solve P (C|A, B) to get the state for B. There- which are both computationally intense. Effi- fore, we need not sample for B from πC (b), in ciency naturally becomes a concern for the al- this case N(B; 1, 1). We then assign N(B; 1, 1) gorithm. To improve its efficiency, we propose as weight for the sample. Since A’s two states a technique called Lazy LBP, which is also ap- are equally likely, the message λC (A) would be plicable to other extensions of LBP. The tech-

Hybrid Loopy Belief Propagation 321 nique serves as a summarization that contains 6 Experimental Results both my original findings and some commonly used optimization methods. We tested the HLBP algorithm on two bench- After evidence is introduced in the network, mark hybrid Bayesian networks: emission net- we can pre-propagate the evidence to reduce work (Lauritzen, 1992) and its extension aug- computation in HLBP. First, we can plug in any mented emission network (Lerner et al., 2001) evidence to the conditional relations of its chil- shown in Figure 4(a). Note that HLBP is ap- dren, so we need not calculate the π messages plicable to more general hybrid models; We from the evidence to its children. We need not choose the networks only for comparison pur- calculate the λ messages from its children to the pose. To evaluate how well HLBP performs, we evidence either. Secondly, evidence may deter- discretized the ranges of continuous variables to mine the value of its neighbors because of deter- 50 intervals and then calculated the Hellinger’s ministic relations, in which case we can evaluate distance (Kokolakis and Nanopoulos, 2001) be- the deterministic relations in advance so that we tween the results of HLBP and the exact solu- need not calculate messages between them. tions obtained by a massive amount of computa- Furthermore, from the definitions of the LBP tion (likelihood weighting with 100M samples) messages, we can easily see that we do not have as the error for HLBP. All results reported are to recompute the messages all the time. For ex- the average of 50 runs of the experiments. ample, the λ(x) messages from the children of a 6.1 Parameter Selection node with no evidence as descendant are always uniform. Also, a message needs not be updated HLBP has several tunable parameters. We if the sender of the message has received no new have number of samples for estimating messages messages from neighbors other than the recipi- (number of message samples), number of sam- ent. ples for the integration (number of integration Based on Equation 1, λ messages should be samples) in Equation 3. The most dramatic updated if incoming π messages change. How- influence on precision comes from the number ever, we have the following result. of message samples, shown as in Figure 4(b). Theorem 1. The λ messages sent from a non- Counter intuitively, the number of integration evidence node to its parents remain uniform be- samples does not have as big impact as we might fore it receives any non-uniform messages from think (see Figure 4(c) with 1K message sam- its children, even though there are new π mes- ples). The reason we believe is that when we sages coming from the parents. draw a lot of samples for messages, the precision of each sample becomes less critical. In Proof. Since there are no non-uniform λ mes- our experiments, we set the number of message sages coming in, Equation 1 simplifies to samples to 1, 000 and the number of integration samples to 12. For the EM algorithm for (t+1) Z (t) λX (ui) = α P (x|u) Y πX (uk) estimating MGs, we typically set the regulariza- k6=i x,uk:k6=i tion constant for preventing over fitting to 0.8, stopping likelihood threshold to 0.001, and the = α Z P (x, uk : k 6= i|ui) number of components in mixtures of Gaussian x,uk:k6=i = α . to 2. We also compared the performance of two Finally for HLBP, we may be able to calculate samplers for product of MGs: Gibbs sam- some messages exactly. For example, suppose pler (Sudderth et al., 2003) and the importance a discrete node has only discrete parents. We sampler in Section 3. As we can see from Fig- can always calculate the messages sent from this ure 4(b,c), when the number of message sam- node to its neighbors exactly. In this case, we ples is small, Gibbs sampler has slight advan- should avoid using Monte Carlo sampling. tage over the importance sampler. As the num-

322 C. Yuan and M. J. Druzdzel W as te F ilter B u rn in g T y p e S tate R eg im e

0.7 0.1 2 1.2 Im p o rta n c e S a m p le r E s tim a te d P o s te rio r 0.6 G ib b s S a m p le r Metal 0.1 1 E x a c t P o s te rio r E ffic ien c y C O 2 e c N o rm a l A p p ro x im a tio n W as te 0.5 e n c n E m is s io n a 0.08 0.8 t a t s i y s t i

0.4 i l D i D

b s s ' 0.06 a

' 0.6 r r b e o 0.3 e D u s t r g g P n

Metal C O 2 n i i l l l E m is s io n l 0.04 0.4 e

E m is s io n S en s o r e 0.2 H H

0.1 0.02 Im p o rta n c e S a m p le r 0.2 G ib b s S a m p le r 0 0 0 Metal D u s t L ig h t 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1 0 0.7 6 1.7 2 2 .6 8 3 .6 4 4 .6 5 .5 6 6 .5 2 7 .4 8 8 .4 4 S en s o r S en s o r P en etrab ility N u m b e r o f M e s s a g e S a m p le s (1 0*2 ^ (i-1 )) N u m b e r o f M e s s a g e In te g ra tio n S a m p le s (2 ^ i) P o s te rio r D is trib u tio n o f D u s tE m is s io n (a) (b) (c) (d)

0.3 5 4 0.3 5 3.5 H L B P H L B P H L B P H L B P 3.5 0.3 L a z y H L B P L a z y H L B P 0.3 L a z y H L B P 3 L a z y H L B P

e 3 e c c

0.2 5 ) 0.2 5 ) 2.5 n n s s ( ( a a

t t

e 2.5 e s s i i

0.2 m 0.2 m 2 i i D D

T T

s 2 s ' ' g g r r n n e e 0.1 5 i 0.1 5 i 1.5 n n g g n n

n 1.5 n i i l l u u l l R R e 0.1 e 0.1 1

H 1 H

0.05 0.05 0.5 0.5

0 0 0 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 10 P ro p a g a tio n L e n g th P ro p a g a tio n L e n g th P ro p a g a tio n L e n g th P ro p a g a tio n L e n g th (e) (f) (g) (h)

Figure 4: (a) Emission network (without dashed nodes) and augmented emission network (with dashed nodes). (b,c) The influence of number of message samples and number of message integration samples on the precision of HLBP on augmented Emission network CO2Sensor and Dust- Sensor both observed to be true and Penetrability to be 0.5. (d) Posterior probability distribution of DustEmission when observing CO2Emission to be −1.3, Penetrability to be 0.5 and WasteType to be 1 in Emission network. (e,f,g,h) Results of HLBP and Lazy HLBP: (e) error on emission, (f) running time on emission, (g) error on augmented emission, (h) running time on augmented emission. ber of message samples increases, the difference by Lazy LBP) as a function of the propaga- becomes negligible. Since the importance sam- tion length. We can see that HLBP needs only pler is much more efficient, we use it in all our several steps to converge. Furthermore, HLBP other experiments. achieves better precision than its lazy version, but Lazy HLBP is much more efficient than 6.2 Results on Emission Networks HLBP. Theoretically, Lazy LBP should not af- We note that mean and variance alone provide fect the results of HLBP but only improve its only limited information about the actual poste- efficiency if the messages are calcualted exactly. rior probability distributions. Figure 4(d) shows However, we use importance sampling to esti- the posterior probability distribution of node mate the messages. Since we use the messages DustEmission when observing CO2Emission at from the last iteration as the importance func- −1.3, Penetrability at 0.5, and WasteType at tions, iterations will help improving the func- 1. We also plot in the same figure the corre- tions. sponding normal approximation with mean 3.77 and variance 1.74. We see that the normal approximation does not reflect the true posterior. While the actual posterior distribution has a We also tested the HLBP algorithm on the multimodal shape, the normal approximation augmented Emission network (Lerner et al., does not tell us where the mass really is. We 2001) with CO2Sensor and DustSensor both ob- also report the estimated posterior probability served to be true and Penetrability to be 0.5 distribution of DustEmission by HLBP. HLBP and report the results in Figures 4(g,h). We seemed able to estimate the shape of the actual again observed that not many iterations are distribution very accurately. needed for HLBP to converge. In this case, In Figures 4(e,f), we plot the error curves Lazy HLBP provides comparable results while of HLBP and Lazy HLBP (HLBP enhanced improving the efficiency of HLBP.

Hybrid Loopy Belief Propagation 323 7 Conclusion G. Kokolakis and P.H. Nanopoulos. 2001. Bayesian multivariate micro-aggregation under The contribution of this paper is two-fold. First, the Hellinger’s distance criterion. Research in of- we propose the Hybrid Loopy Belief Propagation ficial statistics, 4(1):117–126. algorithm (HLBP). The algorithm is general D. Koller, U. Lerner, and D. Angelov. 1999. A gen- enough to deal with general hybrid Bayesian eral algorithm for approximate inference and its networks that contain mixtures of discrete and application to hybrid Bayes nets. In Proceedings continuous variables and may represent linear of the Fifteenth Annual Conference on Uncer- tainty in Artificial Intelligence (UAI-99), pages or nonlinear equations and arbitrary probability 324–333. Morgan Kaufmann Publishers Inc. distributions and naturally accommodate the scenario where discrete variables have contin- S. L. Lauritzen and D. J. Spiegelhalter. 1988. Lo- cal computations with probabilities on graphical uous parents. Its another advantage is that it structures and their application to expert sys- approximates the true posterior distributions. tems. Journal of the Royal Statistical Society, Se- Second, we propose an importance sampler to ries B (Methodological), 50(2):157–224. sample from the product of MGs. Its accuracy is S.L. Lauritzen. 1992. Propagation of probabilities, comparable to the Gibbs sampler in (Sudderth means and variances in mixed graphical associa- et al., 2003) but much more efficient given the tion models. Journal of the American Statistical same number of samples. We anticipate that, Association, 87:1098–1108. just as LBP, HLBP will work well for many U. Lerner, E. Segal, and D. Koller. 2001. Ex- practical models and can serve as a promising act inference in networks with discrete children of approximate method for hybrid Bayesian net- continuous parents. In Proceedings of the Seven- teenth Annual Conference on Uncertainty in Arti- works. ficial Intelligence (UAI-01), pages 319–328. Mor- gan Kaufmann Publishers Inc. Acknowledgements S. Moral, R. Rumi, and A. Salmeron. 2001. Mix- This research was supported by the Air Force tures of truncated exponentials in hybrid Bayesian Office of Scientific Research grants F49620–03– networks. In Proceedings of Fifth European Con- ference on Symbolic and Quantitative Approaches 1–0187 and FA9550–06–1–0243 and by Intel Re- to Reasoning with Uncertainty (ECSQARU-01), search. We thank anonymous reviewers for sev- pages 156–167. eral insightful comments that led to improve- K. Murphy, Y. Weiss, and M. Jordan. 1999. Loopy ments in the presentation of the paper. All belief propagation for approximate inference: An experimental data have been obtained using empirical study. In Proceedings of the Fifteenth SMILE, a Bayesian inference engine developed Annual Conference on Uncertainty in Artificial at the Decision Systems Laboratory and avail- Intelligence (UAI–99), pages 467–475, San Fran- cisco, CA. Morgan Kaufmann Publishers. able at http://genie.sis.pitt.edu/. K. P. Murphy. 1999. A variational approximation for Bayesian networks with discrete and contin- References uous latent variables. In Proceedings of the Fif- teenth Annual Conference on Uncertainty in Arti- B. R. Cobb and P. P. Shenoy. 2005. Hybrid Bayesian ficial Intelligence (UAI-99), pages 457–466. Mor- networks with linear deterministic variables. In gan Kaufmann Publishers Inc. Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI–05), E. Sudderth, A. Ihler, W. Freeman, and A. Will- pages 136–144, AUAI Press Corvallis, Oregon. sky. 2003. Nonparametric belief propagation. In Proceedings of 2003 IEEE Computer Society Con- A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. ference on Computer Vision and Pattern Recog- Maximum likelihood from incomplete data via nition (CVPR-03). the EM algorithm. J. Royal Statistical Society, M. Tanner. 1993. Tools for Statistical Inference: B39:1–38. Methods for Exploration of Posterior Distribu- tions and Likelihood Functions. Springer, New C. T. Kelley. 2003. Solving Nonlinear Equations York. with Newton’s Method.

324 C. Yuan and M. J. Druzdzel