<<

Refractor Importance Sampling

Haohai Yu and Robert A. van Engelen Department of Computer Science Florida State University Tallahassee, FL 32306-4530 USA {hyu,engelen}@cs.fsu.edu

Abstract Stochastic simulation algorithms, also called stochas- tic sampling or Monte Carlo (MC) algorithms, In this paper we introduce Refractor Impor- form one of the most prominent subclasses of tance Sampling (RIS), an improvement to re- approximate inference algorithms of which Logic duce error variance in im- Sampling [Henrion, 1988] was the first and sim- portance sampling propagation under eviden- plest sampling algorithm. Likelihood weighting tial reasoning. We prove the existence of a [Fung and Chang, 1989] was designed to overcome collection of importance functions that are the poor performance of logic sampling under eviden- close to the optimal importance function un- tial reasoning with unlikely evidence. Markov Chain der evidential reasoning. Based on this theo- Monte Carlo (MCMC) forms another important group retic result we derive the RIS algorithm. RIS of stochastic sampling algorithms. Examples in this approaches the optimal importance function group are Gibbs sampling, Metropolis sampling and by applying localized arc changes to minimize hybrid-MC sampling [Geman and Geman, 1984, the divergence between the evidence-adjusted Gilks et al., 1996, MacKay, 1998, Pearl, 1987, importance function and the optimal impor- Chavez and Cooper, 1990]. Stratified sam- tance function. The validity and performance pling [Bouckaert, 1994], hypercube sampling of RIS is empirically tested with a large set [Cheng and Druzdzel, 2000c], and quasi-MC methods of synthetic Bayesian networks and two real- [Cheng and Druzdzel, 2000b] generate random sam- world networks. ples from uniform distributions using various methods to improve sampling results. The importance sam- pling methods [Rubinstein, 1981] are widely used in 1 Introduction Bayesian inference. Self Importance Sampling (SIS) [Shachter and Peot, 1990] and Adaptive Importance Sampling (AIS-BN) [Cheng and Druzdzel, 2000a] are The Bayesian Network (BN) [Pearl, 1988] formalism among the most effective algorithms. is one of the dominant representations for modeling uncertainty in intelligent systems [Neapolitan, 1990, In this paper we prove that the importance functions of Russell and Norvig, 1995]. A BN is a probabilistic an evidence-updated BN can only approach the opti- graphical model of a joint over mal importance function when the BN graph structure a set of statistical variables. Bayesian inference on a is modified according to the observed evidence. This BN answers probabilistic queries about the variables implies the existence of a collection of importance func- and their influence relationships. The posterior prob- tions with minimum divergence to the optimal impor- ability distribution is computed using belief updating tance function under evidential reasoning. Based on methods [Pearl, 1988, Guo and Hsu, 2002]. Exact in- this result we derive our Refractor Importance Sam- ference is NP-hard [Cooper, 1990]. Thus, exact meth- pling (RIS) class of algorithms. In contrast to AIS-BN ods only admit relatively small networks or simple and SIS methods, RIS removes the lower bound that network configurations in the worst case. Approx- prevents the updated importance function to approach imations are also NP-hard [Dagum and Luby, 1993]. the optimal importance function. This is achieved by However, approximate inference methods have any- a graphical structure “refractor”, consisting of a lo- time [Garvey and Lesser, 1994] and/or anywhere calized network structure change that minimizes the [Santos et al., 1995] properties that make these meth- divergence between the evidence-adjusted importance ods more attractive compared to exact methods. function and the optimal importance function. The remainder of this paper is organized as follows. 2.2 Importance Sampling Section 2 proves the existence of a lower bound on the divergence to the optimal importance function under Importance sampling is an MC method to improve the evidential reasoning with a BN. The lower bound is convergence speed and reduce the error variance with used to derive the class of RIS algorithms introduced probability density functions. Let g(X) be a func- in Section 3. Section 4 empirically verifies the proper- tion of m variables X = {X1,...,Xm} over domain m ties of the RIS algorithms on a large set of synthetic Ω ⊆ IR , such that computing g(X) for any X is networks and two real-world networks, and compares feasible. Consider the problem of approximating I = R the results to other importance sampling algorithms. Ω g(X)dX using a sampling technique. Importance Finally, Section 5 summarizes our conclusions and de- sampling approaches this problem by rewriting I = R g(X) scribes our future work. Ω f(X) f(X)dX, where f(X) is a probability density function over Ω, often referred to as the importance function. In order to achieve minimum error variance 2 Importance Function Divergence 2 R 2 2 equal to σf(X) = ( Ω |g(X)|dX) − I , the importance R −1 function should be f(X) = |g(X)|( Ω |g(X)|dX) , see In this section we first give BN definitions and briefly [Rubinstein, 1981]. Note that when g(X) > 0 the op- review importance sampling. We then give a KL- timal probability density function is f(X) = g(X)I−1 divergence lower bound for importance sampling er- 2 and σf(X) = 0. It is obvious that in most of cases it is ror variance. We prove the existence of a collection of impossible to obtain the optimal importance function. importance functions that approach the optimal im- portance function by adjusting both the quantitative The SIS [Shachter and Peot, 1990] and AIS-BN and qualitative components of a BN under dynamic [Cheng and Druzdzel, 2000a] sampling algorithms are updating with evidence. effective methods for approximate Bayesian inference. These methods attempt to approach the optimal im- portance function through learning by dynamically ad- 2.1 Definitions justing the importance function during sampling with The following definitions and notations are used. evidence. To this end, AIS-BN heuristically changes the CPT values of a BN, a technique that has been Def. 1 A Bayesian network BN = (G, Pr) is a DAG shown to significantly improve the convergence rate of G = (V, A) with vertices V and arcs A, A ⊆ V × V. the approximation to the exact solution. Pr is the joint probability distribution over the discrete We use the following definitions for sake of exposition. random variables (vertices) V defined by Pr(V) = Q V ∈V Pr(V | π(V )). The set of parents of a vertex Def. 3 Let BN = (G, Pr) be a Bayesian network with V is π(V ). The conditional probability tables (CPT) G = (V, A) and evidence e for variables E ⊆ V.A of the BN assign values to Pr(V | π(V )) for all V ∈ V. posterior BN e of the BN is some (new) network de- fined as BN e = (Ge, Pre) with graph Ge over variables The graph G induces the d-separation criterion V\E, such that BN e exactly models the posterior joint [Pearl, 1988], denoted by hX, Y | Zi, which implies probability distribution Pre = Pr(· | e). that X and Y are conditionally independent in Pr given Z, with X, Y, Z ⊆ V. A typical example of a posterior BN e is a BN com- bined with an updated posterior state as defined by Def. 2 Let BN = (G, Pr) be a Bayesian network. exact inference algorithms, e.g. using evidence absorp- tion [van der Gaag, 1996]. Approximations of BN e • The combined parent set of X ⊆ V is defined by are used by importance sampling algorithms. These π(X) = S π(X ) \ X. approximations consist of the original BN with all ev- X ∈X idence vertices ignored from further consideration.

• Let An(·) denote the transitive closure of π(·), Def. 4 Let BN = (G, Pr) be a Bayesian network with i.e. the ancestor set of a vertex. The combined G = (V, A) and evidence e for variables E ⊆ V. ancestor set of X ⊆ V is defined by An(X) = S The evidence-simplified ESBN e of BN is defined by X ∈X An(X) \ X. 0 0 0 0 0 0 ESBN e = (Ge, Pre), where Ge = (Ve, Ae), Ve = 0 V V \ E, and Ae = {(X,Y ) | (X,Y ) ∈ A X,Y/∈ E}. • Let δ : V → IN denote a topological order of the vertices such that Y ∈ An(X ) → δ(Y ) < δ(X ). 0 The joint probability distribution Pre of an evidence- The ahead set of a vertex X ∈ V given δ is defined simplified BN approximates Pre. For example, SIS by Ah(X) = {Y ∈ V | δ(Y ) < δ(X)}. and AIS-BN adjust the CPTs of the original BN. 2.3 KL-Divergence Bounds Theorem 2 Let BN e(Ge, Pre) be the posterior of a BN = (G, Pr) given evidence e for E ⊆ V. If X/∈ We give a lower bound on the KL-divergence An(E) for all X ∈ V \ E, then Pre(X | Ahe(X)) = 0 [Kullback, 1959] of the evidence-simplified Pre from Pr(X | π(X)). The evidence vertices in π(X) take con- the exact Pre. The lower bound is valid for all varia- figurations fixed by e, that is Pr(X | π(X)) = Pr(X | 0 tions of Pre, including those generated by importance π(X) \ E, e1, . . . , em) for all ei ∈ π(X) ∩ E. sampling algorithms that adjust the CPT. Proof. See [Cheng and Druzdzel, 2000a]. 2 0 0 Theorem 1 Let ESBN e = (Ge, Pre) be an evidence- 0 Hence, to compute the posterior probability of a vertex simplified BN given evidence e for E ⊆ V. If Pre(V | that is not an ancestor of an evidence vertex, there is πe(V )) = Pr(V | π(V ), e)) for all V ∈ V then the KL- 0 no need to change the parents of the vertex or its CPT. divergence between Pre and Pre is minimal and given by For vertices that are ancestors of evidence vertices, we use Bayes’ formula and d-separation to explore the X X Pr(x, π(x) | e) ln Pr(x | π(x)) + effects of evidence on those vertices. Without loss of X∈X Cfg(X,π(X)) generality, only one evidence vertex is considered. The X X 1 result applies to an evidence vertex set by transitivity. Pr(x, π(x) | e) ln 0 + Pr (x | πe(x)) X∈X Cfg(X,π(X)) e Lemma 1 Let BN e(Ge, Pre) be the posterior of a X Y BN = (G, Pr) given evidence e = {e} for E ∈ Pr(π(e) | e) ln Pr(e | π(e)) − ln Pr(e) (1) V. Let X ∈ An(E). Then, Pre(X | Ahe(X)) = Cfg(π(E)) e∈e Pr(e|X,Ah(X)) Pr(e|Ah(X)) Pr(X | π(X)). where X = V \ E. Proof. Because Ahe(X) = Ah(X) by As- Proof. See Appendix A. 2 sumption 1, we have Pre(X | Ahe(X)) = Pre(X,Ah(X)) = Pr(X,Ah(X)|e) = Pr(X,Ah(X),e) = Theorem 1 bounds the error variance from below, Pre(Ah(X)) Pr(Ah(X)|e) Pr(Ah(X),e) Pr(e|X,Ah(X)) Pr(X,Ah(X)) Pr(e|X,Ah(X)) which is empirically verified for SIS and AIS-BN in Pr(e|Ah(X)) Pr(Ah(X)) = Pr(e|Ah(X)) Pr(X | π(X)) the results Section 4. The divergence Eq. (1) is zero by using Theorem 2. 2 when specific conditions are met as stated below. Theorem 2 and Lemma 1 show that if we have Pr(e | 0 0 X, Ah(X)) and Pr(e | Ah(X)) for all X ∈ An(E) we Corollary 1 Let ESBN e = (Ge, Pre) be an evidence- simplified BN given evidence e for E ⊆ V. If π(E) ∩ can derive Pre(X | Ahe(X)) to compute Pre(V) = Q Q 0 Pr(V | π(V )) Pre(V | Ahe(V )) (V \ E) = ∅, then Pre = Pre. V ∈An(E) V 6∈An(E) for the optimal importance function. However, there Proof. See Appendix B. 2 are two problems to derive Pre. Firstly, Ah(X) is too large to construct a posterior BN for Pr in practice. Hence, the optimal importance function is obtained e e Secondly, instead of the exact Pr (X | Ah (X)) we when all evidence vertices are clustered as roots in G. e e have an estimate by importance sampling. We will now show how Pr0 can approach the optimal e The parent sets of X ∈ An(E) can be minimized by Pr without restrictions. For sake of explanation, the e exploiting d-separation. Let α (X) ⊆ X ∪ Ah(X) de- following widely-held assumptions are reiterated: e note the minimal vertex set that d-separates evidence Assumption 1 The topological order δ of a BN and E and X ∪ Ah(X), thus hE,X ∪ Ah(X) | αe(X)i. We will refer to αe(X) as the “shield” of X given its posterior version δe of BN e are consistent. That is, E. Let βe(X) ⊆ Ah(X) denote the minimal ver- δe(Y ) < δe(X) → δ(Y ) < δ(X) for all X,Y ∈ V \ E. tex set that d-separates evidence E and Ah(X), thus Assumption 1 is reasonable for the following facts: hE, Ah(X) | βe(X)i. We explore the relationship be- tween αe(X) and βe(X) below. 1. According to chain rule, a BN can be built up in Lemma 2 Let BN (G , Pr ) be the posterior of a any topological order and all of them describe the e e e BN = (G, Pr) given evidence e = {e} for E ∈ V. same joint probability distribution. Then, βe(X) ⊆ (αe(X)\X)∪π(X) for all X ∈ An(E). 2. Although there has never been a widely accepted definition of what causality is, it is widely ac- Proof. See Appendix C. 2 cepted that the fact of observing evidence for ran- Therefore, we can approach the optimal importance dom variables should not change the causality re- function Pre(X | Ahe(X)) by estimation of Pre(X | lationship between the variables. (αe(X) \ X) ∪ π(X)) from importance samples. Input: Evidence E ∈ V and X ∈ An(E) Input: BN = (G, Pr), evidence e for E ⊆ V

Output: The set S = αe(X) Output: refractored BN e Data: array A, queue Q foreach E ∈ E do foreach X ∈ An(E) do A ← topSortδ(Ah(X)); S ← {X}; expand πe(X) = (αe(X) \{X}) ∪ π(X); for i ← |A| to 1 do update the CPT of X; Q ← ∅; end end push(Q, A[i]); while Q 6= ∅ do Algorithm 2: Refractor Procedure V ← pop(Q); if V = E then S ← S ∪ {A[i]}; break; A B A B if V 6∈ S ∧ V ∈ An(E) ∧ X 6∈ An(V ) then push(children(V )); C C end end E end

Algorithm 1: Computing the Shield αe(X) D D

Figure 1: Refractor Example 3 Refractor Importance Sampling 3.3 General RIS Procedure The RIS algorithm modifies the BN structure accord- RIS utilizes both the qualitative and quantitative ing to the shield α (X) for vertices X ∈ An(E) by e properties of a BN to approach the optimal impor- expanding the parent set of X and adjusting its CPT tance function. The general procedure of RIS is: accordingly. Visually in the graph, RIS refracts arcs from the evidence vertices, which inspired the choice of name for the method. The algorithms and general 1. The structure of the BN is modified by Algo- procedure of RIS are introduced in this section. rithms 1 and 2. The CPTs of (a subset of) ances- tor vertices of evidence vertices are expanded. 2. Update the CPT values through some specific 3.1 Computing the Shield learning algorithm (see Section 3.4 for details). 3. Sample the BN with an importance sampling al- Alg. 1 computes αe(X) in O(|A|) worst-case time, as- gorithm using the new importance function. suming An(·) is determined in unit time (e.g. using a lookup table). Function topSortδ topologically sorts 3.4 Variations of RIS the set Ah(X) by topological order δ over V of the BN. Note that the shield αe(X) can be computed in Step 1 modifies the BN structure significantly, espe- advance for each X ∈ V given evidence nodes E. cially when the ancestor sets of evidence vertices are large, e.g. when evidence vertices are leafs. This in- creases the complexity of the BN. However, the effect 3.2 Refractor Procedure of evidence on other vertices is attenuated when the path length between the evidence and the vertices is Alg. 2 modifies the graphical structure of BN. The time increased [Henrion, 1989]. Therefore, instead of modi- complexity of this algorithm is O(|V||A|) if |E|  |V|, fying all ancestors An(E) of evidence E in Step 1, it is otherwise it is O(|V|2|A|). The CPT of a vertex X is generally sufficient to select a subset of ancestors such updated by populating the expanded entries αe(X) \ as the combined parent set π(E). {X} using sampling data (described in Section 3.4). Steps 1 and 2 are independent, because any impor- Fig. 1 shows an example refractored BN using Alg. 2. tance function learning algorithm can be applied in E is the evidence node. Here, αe(C) = {A} and Step 2. Steps 2 and 3 can be combined by using the αe(B) = {A}. Arcs A → B and A → C are added. same importance sampling algorithm for learning and Note that arc A → B adjusts for the fact that the inference. In our experiments, we used SIS and AIS- influence relationship between A and B has changed BN for both learning and inference (steps 2 and 3), through evidence E. Arc E → D is no longer required referred to as RISSIS and RISAIS, respectively. AIS- and can be removed as in [van der Gaag, 1996]. BN will be referred to by AIS. 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% AIS 50% SIS 40% RISAIS 40% 30% 30% RISSIS 20% 20% 10% 10% Cases with Lower MSE Cases with Lower MSE 0% 0% 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 Sampling Size x 1000 Sampling Size x 1000

Figure 2: Synthetic BN Results: Ratio of Lowest MSE for RISAIS Versus AIS, and RISSIS Versus SIS.

4 Results 4.2 Test Cases

This section presents the experimental results of RIS- Because computing the MSE is expensive and SIS and RISAIS compared to SIS and AIS for synthetic PostKLD is exponential in the number of vertices, networks and two real-world networks. small-sized synthetic BNs with random variables with two or three states and |V| = 20 vertices and |A| = 30 4.1 Measurement arcs were evaluated in our experiments. The CPT for each variable is randomly generated with uniform dis- The MSE (mean squared error) metric was used to tribution for the probability interval [0.1, 0.9] with bias measure the error of the importance sampling results for the extreme probabilities in intervals (0, 0.1) and compared to the exact solution: (0.9, 1). For the experiments we generated 100 differ- v ent synthetic BNs with these characteristics. u ni u 1 X X 0 2 MSE = tP (Pre(xij) − Pre(xij)) , We also verified RIS with two real-world BNs: X ∈X ni i Xi∈X j=1 Alarm-37 [Beinlich et al., 1989] and HeparII-70 [Onisko, 2003]. The probability distributions of these where X = V \ E. We also measured the KL- networks are more extreme compared to the synthetic divergence of the approximate and exact posterior BNs. For each of the two BNs, 20 sets of evidence probability distributions: variables are randomly chosen, each with 10 evidence X Pre(x) variables. For the Alarm-37 and HeparII-70 we choose KL-divergence = Pre(x) ln 0 . Pre(x) Cfg(X) to limit the refractoring to the parents nodes of the evidence set π(E) instead of An(E), see Section 3.4. Recall that Theorem 1 gives a lower bound for the KL- divergence of the posterior probability distributions of SIS and AIS, which is indicated in the results by the 4.3 Results for Synthetic Test Cases PostKLD lower bound from Eq. (1). We compared the MSE of four algorithms, AIS, RI- The number of samples is taken as a measure of run- SAIS, SIS, and RISSIS. For this comparison a se- ning time instead of CPU time in our experimental lection of 21 BNs from the generated synthetic test implementation. Recall that the overhead of RIS is case suite was made. The other 79 test cases have fixed at startup when the evidence set can be prede- PostKLD ≤ 0.1, which means according to Theorem 1 termined. Furthermore, the RIS overhead is limited that the RIS advantage is limited. to collecting the updated CPT values during sampling Fig. 2 shows the results for the 21 synthetic BNs, where (and learning in the case of RISAIS). the sample frequency is varied from 1,000 to 19,000 in The reported sampling frequencies for AIS (and RI- increments of 1,000. The dark column in the figures SAIS) are for calculating the posterior results. Because represent the ratio of lowest MSE cases for RISAIS AIS separates the importance function learning stage versus AIS and RISSIS versus SIS. A ratio of 50% or from the sampling stage, the actual number of samples higher indicates that the RIS algorithm has lower error taken for AIS (total sampling for importance function variance than the non-RIS algorithm. For RISAIS this and sampling the results) is twice that of SIS. Recom- is the case for all but one of the 19 measurements taken mended parameters [Cheng and Druzdzel, 2000a] are In total, the MSE is lowest for RISAIS in 61.4% on used in AIS and RISAIS. average over all samples. For RISSIS this is the case 0.45 0.40 0.35 0.30 0.25 0.20 PostKLD 0.15 AIS

KL-Divergence 0.10 RISAIS 0.05 0.00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Figure 3: Synthetic BN Results: KL-Divergence of RISAIS and AIS with PostKLD Lower Bound

0.45 0.40 0.35 0.30 0.25 PostKLD 0.20 0.15 SIS

KL-Divergence 0.10 RISSIS 0.05 0.00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Figure 4: Synthetic BN Results: KL-Divergence of RISSIS and SIS with PostKLD Lower Bound for all but four of the 19 measurements taken. In total, the RISSIS and SIS results are better on average than the MSE is lowest for RISSIS in 58.4% on average over RISAIS and AIS. Note that the PostKLD lower bound all samples. is the same for AIS and SIS. However, in this study SIS appears to approach the PostKLD closer than AIS. In fact, it is to be expected that the higher the Also here we can conclude that SIS does not approach PostKLD lower bound the better the RIS algorithms the exact solution for a significant number of test cases, should perform. In order to determine the impact whereas RISSIS is not limited by the bound due to the with increasing PostKLD, we selected all 100 synthetic BN refractoring. BN test cases and measured the KL-divergence after 11,000 samples. 4.4 Results for Alarm-37 and HeparII-70 Fig. 3 shows the result for RISAIS, where the 100 BNs are ranked according to the PostKLD. Recall that the Fig. 5 shows the results for Alarm-37 and HeparII- PostKLD is the lower bound on the KL-divergence of 70, where the sample frequency is varied from 1,000 AIS. From the figure it can be concluded that AIS to 19,000 in increments of 1,000. The dark column does not approach the exact solution for a significant in the figures represent the ratio of lowest MSE cases number of test cases, whereas RISAIS is not limited for RISAIS versus AIS and RISSIS versus SIS. A ra- by the bound due to the BN refractoring. tio of 50% or higher indicates that the RIS algorithm It should be noted that around points 1 and 26 in has lower error variance than the non-RIS algorithm. Fig. 3 the KL-divergence of RISAIS is worse compared For RISAIS this is the case for all but one of the 19 to AIS. We believe the reason is that AIS heuristically measurements taken. In total, the MSE is lowest for changes the original CPT which has a negative impact RISAIS in 56.7% on average over all samples. For RIS- on the RIS algorithm’s ability to adjust the CPT to SIS this is the case for all 19 measurements taken. In the optimal importance function. total, the MSE is lowest for RISSIS in 60.3% on aver- age over all samples. Fig. 4 shows the result for RISSIS, where the 100 BNs are ranked according to the PostKLD. Interestingly, The combined results show that the RIS algorithms have reduced error variance for the synthetic networks 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% AIS 30% 30% SIS 20% RISAIS 20% RISSIS Cases with Lower MSE 10% Cases with Lower MSE 10% 0% 0% 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 Sampling Size x 1000 Sampling Size x 1000

Figure 5: Ratio of Lowest MSE for RISAIS Versus AIS, and RISSIS Versus SIS (Alarm-37 and HeparII-70). and the two real-world networks. The PostKLD lower Proceedings of the 2nd European Conference on AI bound limits the ability of AIS and SIS to approach and Medicine, Berlin. Springer-Verlag. the optimal importance function. By contrast, the RIS approach successfully eliminates this limitation of AIS [Bouckaert, 1994] Bouckaert, R. R. (1994). A strati- and SIS, thereby providing an improvement to reduce fied simulation scheme for inference in Bayesian be- error variance in BN importance sampling propagation lief networks. In Proceedings of the 10th Conference under evidential reasoning. on Uncertainty in Artificial Intelligence, pages 110– 117. 5 Conclusions [Chavez and Cooper, 1990] Chavez, R. M. and Cooper, G. F. (1990). A randomized approximation In order to approach the optimal importance function algorithm for probabilistic inference on Bayesian for importance sampling propagation under evidential belief networks. Networks, 20:661–685. reasoning with a Bayesian network, a modification of [Cheng and Druzdzel, 2000a] Cheng, J. and Druzdzel, the network’s structure is necessary to eliminate the M. J. (2000a). Ais-bn: An adaptive importance lower bound on the error variance. To this end, the sampling algorithm for evidential reasoning in large proposed RIS algorithms refractor the network and ad- Bayesian networks. Journal of Artificial Intelligence just the conditional probability tables to minimize the Research, 13:155–188. divergence to the optimal importance function. The validity and performance of the RIS approach was em- [Cheng and Druzdzel, 2000b] Cheng, J. and Druzdzel, pirically tested with a set of synthetic networks and M. J. (2000b). Computational investigations of two real-world networks. low-discrepancy sequences in simulation algorithms Additional improvements of RIS are possible to for Bayesian networks. In Proceedings of the 16th achieve a better accuracy/cost ratio. The goal is to Conference on Uncertainty in Artificial Intelligence, find an effective subset of the full shield size of an an- pages 72–81. Morgan Kaufmann Publishers. cestor vertex of an evidence vertex or select a limited [Cheng and Druzdzel, 2000c] Cheng, J. and Druzdzel, subset of the ancestors of evidence vertices that are M. J. (2000c). Latin hypercube sampling in refractored. Also some of the additionally introduced Bayesian networks. In Proceedings of the 13th In- arcs could be removed with the arc removal algorithm ternational Florida Artificial Intelligence Research [van Engelen, 1997] when they present a weak influ- Symposium Conference (FLAIRS-2000), pages 287– ence. Such strategies would reduce the complexity of 292, Orlando, Florida. AAAI Publishers. the refractored network while still ensuring higher ac- curacy over current importance sampling algorithms. [Cooper, 1990] Cooper, G. F. (1990). The compu- tational complexity of probabilistic inference us- References ing Bayesian belief networks. Artificial Intelligence, 42(2-3):393–405.

[Beinlich et al., 1989] Beinlich, I., Suermondt, G., [Dagum and Luby, 1993] Dagum, P. and Luby, M. Chavez, R., and Cooper, G. (1989). The ALARM (1993). Approximating probabilistic inference in monitoring system: A case study with two proba- Bayesian belief networks is NP-hard. Artificial In- bilistic inference techniques for belief networks. In telligence, 60:141–153. [Fung and Chang, 1989] Fung, R. and Chang, K. C. [Pearl, 1987] Pearl, J. (1987). Evidential reasoning us- (1989). Weighting and integrating evidence for ing stochastic simulation of causal models. Artificial stochastic simulation in Bayesian networks. In Pro- Intelligence, 32(2):245–257. ceedings of the 5th Conference on Uncertainty in Ar- tificial Intelligence, pages 209–219, New York, N. Y. [Pearl, 1988] Pearl, J. (1988). Probabilistic Reasoning Elsevier Science Publishing Company, Inc. in Intelligent Systems: Networks of Plausible Infer- ence. Morgan Kaufmann Publishers, Inc., San Ma- [Garvey and Lesser, 1994] Garvey, A. J. and Lesser, teo, CA. V. R. (1994). A survey of research in deliberative [Rubinstein, 1981] Rubinstein, R. Y. (1981). Simu- real-time artificial intelligence. Real-Time Systems, lation and . John Wiley and 6(3):317–347. Sons, Hoboken, NJ. [Geman and Geman, 1984] Geman, S. and Geman, D. [Russell and Norvig, 1995] Russell, S. and Norvig, P. (1984). Stochastic relaxation, Gibbs distribution (1995). Artificial intelligence: A modern approach. and the Bayesian restoration of images. IEEE In Prentice Hall Series in Artificial Intelligence. Transactions on Pattern Analysis and Machine In- Prentice Hall. telligence, 6(6):721–741. [Santos et al., 1995] Santos, E. J., Shimony, S. E., [Gilks et al., 1996] Gilks, W. R., Richardson, S., and Solomon, E., and Williams, E. (1995). On a dis- Spiegelhalter, D. J. (1996). Markov Chain Monte tributed anytime architecture for probabilistic rea- Carlo in Practice. Chapman and Hall, London. soning. Technical report, AFIT/EN/TR95-02, De- partment of Electrical and Computer Engineering, [Guo and Hsu, 2002] Guo, H. and Hsu, W. (2002). A Air Force Institute of Technology. survey on algorithms for real-time Bayesian network inference. In In the joint AAAI-02/KDD-02/UAI- [Shachter and Peot, 1990] Shachter, R. D. and Peot, 02 workshop on Real-Time Decision Support and M. A. (1990). Simulation approaches to general Diagnosis Systems, Edmonton, Alberta, Canada. probabilistic inference on belief networks. In Pro- ceedings of the 5th Conference on Uncertainty in [Henrion, 1988] Henrion, M. (1988). Propagating un- Artificial Intelligence, volume 5. certainty in Bayesian networks by probabilistic logic sampling. In Proceedings of the 2nd Conference [Shannon, 1956] Shannon, C. E. (1956). The band- on Uncertainty in Artificial Intelligence, pages 149– wagon. IRE Transactions on Information Theory, 163, New York. Elsevier Science. IT-2:3.

[Henrion, 1989] Henrion, M. (1989). Some practical [van der Gaag, 1996] van der Gaag, L. (1996). On ev- issues in constructing belief networks. In Kanal, L., idence absorption for belief networks. International Levitt, T., and Lemmer, J., editors, Proceedings of Journal of Approximate Reasoning, 15(3):265–286. the 3rd Conference on Uncertainty in Artificial In- telligence, pages 161–173, North Holland. Elsevier [van Engelen, 1997] van Engelen, R. (1997). Approx- Science. imating Bayesian belief networks by arc removal. IEEE Transactions on Pattern Analysis and Ma- [Kullback, 1959] Kullback, S. (1959). Information chine Intelligence, 19(8):916–920. Theory and . John Wiley and Sons, New York.

[MacKay, 1998] MacKay, D. (1998). Introduction to Monte Carlo methods. In Jordan, M., editor, Learn- ing in Graphical Models. MIT Press.

[Neapolitan, 1990] Neapolitan, R. E. (1990). Proba- bilistic Reasoning in Expert Systems. John Wiley and Sons, NY.

[Onisko, 2003] Onisko, A. (2003). Probabilistic Causal Models in Medicine: Application to Diagnosis of Liver Disorders. PhD thesis, Institute of Biocyber- netics and Biomedical Engineering, Polish Academy of Science, Warsaw. Appendix = P Pr0 (x | π(x ) \ e) = 1. According to Cfg(Xj ) e j j Shannon’s information theory [Shannon, 1956], to P 1 minimize Pr(xj | π(xj), e) ln 0 Cfg(Xj ) Pre(xj |π(xj )\e) A Proof of Theorem 1 0 we should set Pre(xj | π(xj) \ e) = Pr(xj | π(xj), e). This proves the Theorem 1. 2 Proof. We use the KL-divergence (Cross Entropy) [Kullback, 1959] to measure the difference between the B Proof of Corollary 1 posterior BN e and ESBN e. The KL-divergence = Pr1(V) P Pr1(v) E1[ln ] = Pr1(v) ln . Hence, the Pr2(V) Cfg(V) Pr2(v) Proof. Let X = V \ E, then KL-divergence between posterior BN e and ESBN e is P Pre(v\e) Pr (x) Pre(v \ e) ln 0 = P e Cfg(V\E) Pre(v\e) Cfg(X) Pre(x) ln 0 where X = V \ E. This is Pre(x) P Pr(x|e) further simplified as follows Cfg(X) Pr(x | e) ln Pr0 (x) = Qe Q Pr(xj |π(xj )) Pr(ei|π(ei)) P Pre(x) P xj ∈x ei∈e Cfg(X) Pre(x) ln Pr0 (x) = Cfg(X) Pr(x | e) ln Q 0 . e Pr(e) Pre(xj |πe(xj )) xj ∈x P Pr(x | e) ln Pr(x|e) = Cfg(X) Pr0 (x) Since π(E) ∩ (V \ E) = ∅ ⇒ ∀Xj ∈ Qe Q Pr(xj |π(xj )) Pr(ei|π(ei)) x ∈x e ∈e X, Pr(xj | π(xj), e) = Pr(xj | π(xj)), from P j i 0 Pr(x | e) ln Q 0 Cfg(X) Pr(e) Pr (xj |πe(xj )) Theorem 1, set Pr (x | π (x )) = Pr(x | x ∈x e e j e j j Q j Q π(x )) to minimize the divergence, then Pr(xj |π(xj )) Pr(ei|π(ei)) j P xj ∈x ei∈e Q Q = Pr(x | e) ln Pr(xj |π(xj )) Pr(ei|π(ei)) Cfg(X) Q 0 xj ∈x ei∈e Pre(xj |πe(xj )) P xj ∈x Cfg(X) Pr(x | e) ln Q 0 = Pr(e) Pre(xj |πe(xj )) 1 P xj ∈x +ln Pr(e) Cfg(X) Pr(x | e) = Q Pr(ei|π(ei)) Q Q P ei∈e Pr(xj |π(xj )) Pr(ei|π(ei)) Pr(x | e) ln . Also P xj ∈x ei∈e Cfg(X) Pr(e) Cfg(X) Pr(x | e) ln Q 0 Pre(xj |πe(xj )) from π(E) ∩ (V \ E) = ∅, ∀Ei ∈ E, xj ∈x Q P P Pr(xj |π(xj )) π(Ei) ⊆ E ⇒ Pr(ei | π(ei)) = Pr(e), so −ln Pr(e)= Pr(x | e) ln 0 ei∈e Cfg(X) xj ∈x Pre(xj |πe(xj )) Q Pr(ei|π(ei)) P Q e ∈e + Pr(x | e) ln Pr(ei | π(ei)) −ln Pr(e) P i Cfg(X) ei∈e Cfg(X) Pr(x | e) ln Pr(e) = 0. The P P Pr(xj |π(xj )) 0 = X ∈X Cfg(X ,π(X )) ln Pr0 (x |π (x )) KL-divergence between Pre and Pre is zero, thus j j j e j e j 0 P Pr(x | e) +P Pre(v \ e) = Pre(v \ e) according to [Kullback, 1959]. Cfg(X\{Xj ,π(Xj )}) Cfg(π(E)) Pr(π(e) | e) ln Q Pr(e | π(e )) −ln Pr(e)= 2 ei∈e i i P P Pr(xj |π(xj )) Pr(xj, π(xj) | e) ln 0 Xj ∈X Cfg(Xj ,π(Xj )) Pre(xj |πe(xj )) +P Pr(π(e) | e) ln Q Pr(e | π(e )) C Proof of Lemma 2 Cfg(π(E)) ei∈e i i −ln Pr(e) = P P Pr(x , π(x ) | e) Xj ∈X Cfg(Xj ,π(Xj )) j j P P Proof. ∀Xk ∈ Ah(Xj)\βe(Xj), consider the following ln Pr(xj | π(xj)) + Xj ∈X Cfg(Xj ,π(Xj )) three cases. 1 Pr(xj, π(xj) | e) ln 0 Pre(xj |πe(xj )) P Q Case 1: If a path Xk → E exists then we show that + Pr(π(e) | e)ln Pr(ei | π(ei)) Cfg(π(E)) ei∈e this path is d-separated by α (X ). There are two −ln Pr(e) (Eq. 1) e j possibilities. First, Xk → E bypasses Xj, so it must The first, third, and fourth terms in Eq. 1 are pass one of the parents of Xj. Then π(Xj) d-separates decided by the original probability distribution, so the path. Second, Xk → E does not pass Xj. Then the P P Pr(x , π(x ) | e) ln Pr(x | π(x )) path must be d-separated by α (X )\X , so (α (X )\ Xj ∈X Cfg(Xj ,π(Xj )) j j j j e j j e j +P Pr(π(e) | e) ln Q Pr(e | π(e )) Xj) ∪ π(Xj) d-separates the path. Cfg(π(E)) ei∈e i i −ln Pr(e) is constant. To minimize the difference Case 2: If paths N → Xk and N → E exist, so N ∈ between posterior BN and ESBN the only choice is e e Ah(Xj), and N d-separate the Xk and E, according to to minimize the following term: Case 1, (αe(Xj)\Xj)∪π(Xj) d-separates path N → E. P P 1 Pr(xj, π(xj) | e) ln 0 Xj ∈X Cfg(Xj ,π(Xj )) Pre(xj |πe(xj )) Case 3: If paths Xk → B and E → B exist, according = P P P Pr(x , π(x ) | e) Xj ∈X Cfg(π(Xj )) Cfg(Xj ) j j to topological order {B, descendants ofB}∩((αe(Xj)\ 1 ln 0 Xj)∪π(Xj)) = ∅, so (αe(Xj)\Xj)∪π(Xj) d-separates Pr (xj |πe(xj )) e this path. This is equal to minimizing the term for each Xj ∈ X and each possible configuration of From cases 1 to 3 we see that hE, Ah(Xj) | ((αe(Xj) \ P 1 X ) ∪ π(X ))i, so β (X ) ⊆ (α (X ) \ X ) ∪ π(X ). 2 π(xj). Pr(xj, π(xj) | e) ln 0 = j j e j e j j j Cfg(Xj ) Pre(xj |πe(xj )) P 1 Pr(π(xj) | e) Pr(xj | π(xj), e)ln 0 . Cfg(Xj ) Pre(xj |π(xj )\e) We have P Pr0 (x | π (x )) Cfg(Xj ) e j e j