Refractor Importance Sampling

$Refractor Importance Sampling$ Refractor Importance Sampling Haohai Yu and Robert A. van Engelen Department of Computer Science Florida State University Tallahassee, FL 32306-4530 USA fhyu,[email protected] Abstract Stochastic simulation algorithms, also called stochastic sampling or Monte Carlo (MC) algorithms, In this paper we introduce Refractor Impor- form one of the most prominent subclasses of tance Sampling (RIS), an improvement to re- approximate inference algorithms of which Logic duce error variance in Bayesian network im- Sampling [Henrion, 1988] was the first and sim- portance sampling propagation under eviden- plest sampling algorithm. Likelihood weighting tial reasoning. We prove the existence of a [Fung and Chang, 1989] was designed to overcome collection of importance functions that are the poor performance of logic sampling under eviden- close to the optimal importance function un- tial reasoning with unlikely evidence. Markov Chain der evidential reasoning. Based on this theo- Monte Carlo (MCMC) forms another important group retic result we derive the RIS algorithm. RIS of stochastic sampling algorithms. Examples in this approaches the optimal importance function group are Gibbs sampling, Metropolis sampling and by applying localized arc changes to minimize hybrid-MC sampling [Geman and Geman, 1984, the divergence between the evidence-adjusted Gilks et al., 1996, MacKay, 1998, Pearl, 1987, importance function and the optimal impor- Chavez and Cooper, 1990]. Stratified sam- tance function. The validity and performance pling [Bouckaert, 1994], hypercube sampling of RIS is empirically tested with a large set [Cheng and Druzdzel, 2000c], and quasi-MC methods of synthetic Bayesian networks and two real- [Cheng and Druzdzel, 2000b] generate random sam- world networks. ples from uniform distributions using various methods to improve sampling results. The importance sampling methods [Rubinstein, 1981] are widely used in 1 Introduction Bayesian inference. Self Importance Sampling (SIS) [Shachter and Peot, 1990] and Adaptive Importance Sampling (AIS-BN) [Cheng and Druzdzel, 2000a] are The Bayesian Network (BN) [Pearl, 1988] formalism among the most effective algorithms. is one of the dominant representations for modeling uncertainty in intelligent systems [Neapolitan, 1990, In this paper we prove that the importance functions of Russell and Norvig, 1995]. A BN is a probabilistic an evidence-updated BN can only approach the opti- graphical model of a joint probability distribution over mal importance function when the BN graph structure a set of statistical variables. Bayesian inference on a is modified according to the observed evidence. This BN answers probabilistic queries about the variables implies the existence of a collection of importance func- and their influence relationships. The posterior prob- tions with minimum divergence to the optimal impor- ability distribution is computed using belief updating tance function under evidential reasoning. Based on methods [Pearl, 1988, Guo and Hsu, 2002]. Exact in- this result we derive our Refractor Importance Sam- ference is NP-hard [Cooper, 1990]. Thus, exact meth- pling (RIS) class of algorithms. In contrast to AIS-BN ods only admit relatively small networks or simple and SIS methods, RIS removes the lower bound that network configurations in the worst case. Approx- prevents the updated importance function to approach imations are also NP-hard [Dagum and Luby, 1993]. the optimal importance function. This is achieved by However, approximate inference methods have any- a graphical structure \refractor", consisting of a lo- time [Garvey and Lesser, 1994] and/or anywhere calized network structure change that minimizes the [Santos et al., 1995] properties that make these meth- divergence between the evidence-adjusted importance ods more attractive compared to exact methods. function and the optimal importance function. The remainder of this paper is organized as follows. 2.2 Importance Sampling Section 2 proves the existence of a lower bound on the divergence to the optimal importance function under Importance sampling is an MC method to improve the evidential reasoning with a BN. The lower bound is convergence speed and reduce the error variance with used to derive the class of RIS algorithms introduced probability density functions. Let g(X) be a func- in Section 3. Section 4 empirically verifies the proper- tion of m variables X = fX1;:::;Xmg over domain m ties of the RIS algorithms on a large set of synthetic Ω ⊆ IR , such that computing g(X) for any X is networks and two real-world networks, and compares feasible. Consider the problem of approximating I = R the results to other importance sampling algorithms. Ω g(X)dX using a sampling technique. Importance Finally, Section 5 summarizes our conclusions and de- sampling approaches this problem by rewriting I = R g(X) scribes our future work. Ω f(X) f(X)dX, where f(X) is a probability density function over Ω, often referred to as the importance function. In order to achieve minimum error variance 2 Importance Function Divergence 2 R 2 2 equal to σf(X) = ( Ω jg(X)jdX) − I , the importance R −1 function should be f(X) = jg(X)j( Ω jg(X)jdX) , see In this section we first give BN definitions and briefly [Rubinstein, 1981]. Note that when g(X) > 0 the op- review importance sampling. We then give a KL- timal probability density function is f(X) = g(X)I−1 divergence lower bound for importance sampling er- 2 and σf(X) = 0. It is obvious that in most of cases it is ror variance. We prove the existence of a collection of impossible to obtain the optimal importance function. importance functions that approach the optimal importance function by adjusting both the quantitative The SIS [Shachter and Peot, 1990] and AIS-BN and qualitative components of a BN under dynamic [Cheng and Druzdzel, 2000a] sampling algorithms are updating with evidence. effective methods for approximate Bayesian inference. These methods attempt to approach the optimal importance function through learning by dynamically ad- 2.1 Definitions justing the importance function during sampling with The following definitions and notations are used. evidence. To this end, AIS-BN heuristically changes the CPT values of a BN, a technique that has been Def. 1 A Bayesian network BN = (G; Pr) is a DAG shown to significantly improve the convergence rate of G = (V; A) with vertices V and arcs A, A ⊆ V × V. the approximation to the exact solution. Pr is the joint probability distribution over the discrete We use the following definitions for sake of exposition. random variables (vertices) V defined by Pr(V) = Q V 2V Pr(V j π(V )). The set of parents of a vertex Def. 3 Let BN = (G; Pr) be a Bayesian network with V is π(V ). The conditional probability tables (CPT) G = (V; A) and evidence e for variables E ⊆ V.A of the BN assign values to Pr(V j π(V )) for all V 2 V. posterior BN e of the BN is some (new) network defined as BN e = (Ge; Pre) with graph Ge over variables The graph G induces the d-separation criterion VnE, such that BN e exactly models the posterior joint [Pearl, 1988], denoted by hX; Y j Zi, which implies probability distribution Pre = Pr(· j e). that X and Y are conditionally independent in Pr given Z, with X; Y; Z ⊆ V. A typical example of a posterior BN e is a BN combined with an updated posterior state as defined by Def. 2 Let BN = (G; Pr) be a Bayesian network. exact inference algorithms, e.g. using evidence absorp- tion [van der Gaag, 1996]. Approximations of BN e • The combined parent set of X ⊆ V is defined by are used by importance sampling algorithms. These π(X) = S π(X ) n X. approximations consist of the original BN with all ev- X 2X idence vertices ignored from further consideration. • Let An(·) denote the transitive closure of π(·), Def. 4 Let BN = (G; Pr) be a Bayesian network with i.e. the ancestor set of a vertex. The combined G = (V; A) and evidence e for variables E ⊆ V. ancestor set of X ⊆ V is defined by An(X) = S The evidence-simplified ESBN e of BN is defined by X 2X An(X) n X. 0 0 0 0 0 0 ESBN e = (Ge; Pre), where Ge = (Ve, Ae), Ve = 0 V V n E, and Ae = f(X; Y ) j (X; Y ) 2 A X; Y2 = Eg. • Let δ : V ! IN denote a topological order of the vertices such that Y 2 An(X ) ! δ(Y ) < δ(X ). 0 The joint probability distribution Pre of an evidence- The ahead set of a vertex X 2 V given δ is defined simplified BN approximates Pre. For example, SIS by Ah(X) = fY 2 V j δ(Y ) < δ(X)g. and AIS-BN adjust the CPTs of the original BN. 2.3 KL-Divergence Bounds Theorem 2 Let BN e(Ge; Pre) be the posterior of a BN = (G; Pr) given evidence e for E ⊆ V. If X2 = We give a lower bound on the KL-divergence An(E) for all X 2 V n E, then Pre(X j Ahe(X)) = 0 [Kullback, 1959] of the evidence-simplified Pre from Pr(X j π(X)). The evidence vertices in π(X) take con- the exact Pre. The lower bound is valid for all varia- figurations fixed by e, that is Pr(X j π(X)) = Pr(X j 0 tions of Pre, including those generated by importance π(X) n E; e1; : : : ; em) for all ei 2 π(X) \ E. sampling algorithms that adjust the CPT. Proof. See [Cheng and Druzdzel, 2000a]. 2 0 0 Theorem 1 Let ESBN e = (Ge; Pre) be an evidence- 0 Hence, to compute the posterior probability of a vertex simplified BN given evidence e for E ⊆ V.

Load more