Learning Bayesian Network Models from Incomplete Data Using Importance Sampling

Learning Bayesian Network Models from Incomplete Data using Importance Sampling Carsten Riggelsen and Ad Feelders Institute of Information & Computing Sciences Utrecht University P.O. Box 80098, 3508TB Utrecht The Netherlands Abstract practical learning problems one frequently has to deal with missing values however. The presence of incomplete data leads to analytical intractability and high We propose a Bayesian approach to learning computational complexity compared to the complete Bayesian network models from incomplete data case. It is very tempting to \make the problem data. The objective is to obtain the posterior go away" either by deleting observations with missing distribution of models, given the observed values or using ad-hoc methods to fill in (impute) the part of the data. We describe a new algo- missing data. Such procedures may however lead to rithm, called eMC4, to simulate draws from biased results, and, in case of imputing a single value this posterior distribution. One of the new for the missing data, to an overconfidence in the results ideas in our algorithm is to use importance of the analysis. sampling to approximate the posterior distribution of models given the observed data and We avoid such ad-hoc approaches and use a method the current imputation model. The impor- that takes all observed data into account, and cor- tance sampler is constructed by defining an rectly reflects the increased uncertainty due to miss- approximate predictive distribution for the ing data. We do assume however that the missing unobserved part of the data. In this way data mechanism is ignorable as defined by Little and existing (heuristic) imputation methods can Rubin (1987). Essentially this means that the proba- be used that don't require exact inference for bility that some component is missing may depend on generating imputations. observed components, but not on unobserved compo- We illustrate eMC4 by its application to mod- nents. eling the risk factors of coronary heart dis- Our approach is Bayesian in the sense that we are ease. In the experiments we consider different not aiming for a single best model, but want to ob- missing data mechanisms and different frac- tain (draws from) a posterior distribution over possi- tions of missing data. ble models. We show how to perform model averaging over Bayesian network models, or alternatively, how to get a range of good models, when we have incom- 1 Introduction plete data. We develop a method that can handle a broad range of imputation methods without violating Bayesian networks are probabilistic models that can the validity of the models returned. Our approach is represent complex interrelationships between random not restricted to any imputation technique in particu- variables. It is an intuitively appealing formalism lar, and therefore allows for imputation methods that for reasoning with probabilities that can be employed do not require expensive inference in a Bayesian net- for diagnosis and prediction purposes. Furthermore, work. learning Bayesian networks from data may provide This paper is organised as follows. In section 2 we valuable insight into the (in)dependences between the briefly review previous research in this area and show variables. how our work fits in. In section 3 we describe model In the last decade, learning Bayesian networks from learning from complete data. In sections 4 and 5 we data has received considerable attention in the re- introduce a new algorithm, called eMC4, for Bayesian search community. Most learning algorithms work un- network model learning from incomplete data. We per- der the assumption that complete data is available. In formed a number of experiments to test eMC4 using real life data. The results of those experiments are re- model selection step is employed. To select the next ported in section 6. Finally, we summarize our work model, a model search is performed, using the expected and draw conclusions. sufficient statistics obtained from the current model and current parameter values. 2 Previous research Ramoni and Sebastiani (1997) describe how BC can be used in a model selection setting. As remarked before Here we briefly review relevant literature on learning however, BC is not guaranteed to give valid results Bayesian networks from incomplete data. Two pop- for ignorable mechanisms in general, and the risk of ular iterative approaches for learning parameters are obtaining invalid results unfortunately increases when Expectation-Maximization (EM) by Dempster et al. the model structure is not fixed. (1977) and a simulation based Gibbs sampler (Geman and Geman, 1984) called Data Augmentation (DA) in- In contrast to SEM, our aim is not to select a sin- troduced by Tanner and Wong (1987). For Bayesian gle model, but to obtain a posterior probability distri- networks EM was studied by Lauritzen (1995). The bution over models that correctly reflects uncertainty, Expectation step (E-step) involves the performance of including uncertainty due to missing data. Therefore inference in order to obtain sufficient statistics. The our approach is more related to the simulation based E-step is followed by a Maximization step (M-step) in DA described above. which the Maximum Likelihood (ML) estimates are computed from the sufficient statistics. These two 3 Learning from complete data steps are iterated until the parameter estimates con- verge. In this section we discuss the Bayesian approach to learning Bayesian networks from complete data. First Data Augmentation (DA) is quite similar but is non- we introduce some notation. Capital letters denote deterministic. Instead of calculating expected statis- discrete random variables, and lower case denotes a tics, a value is drawn from a predictive distribution state. Boldface denote random vectors and vector and imputed. Similarly, instead of calculating the ML states. We use Pr(·) to denote probability distribu- estimates, one draws from the posterior distribution tions (or densities) and probabilities. D = (d ; : : : ; d ) on the parameter space (conditioned on the sufficient 1 c denotes the multinomial data sample with c i.i.d. cases. statistics of the most recent imputed data set). Based A Bayesian network (BN) for X = (X 1; : : : ; Xp) on Markov chain Monte Carlo theory this will eventu- represents a joint probability distribution. It con- ally return realizations from the posterior parameter sists of a directed acyclic graph (DAG) m, called the distribution. There are also EM derivatives that in- model, where every vertex corresponds to a variable clude a stochastic element quite similar to DA (see Xi, and a vector of conditional probabilities θ, called McLachlan and Krishnan, 1997). the parameter, corresponding to that model. The Bound and Collapse (BC) introduced by Ramoni and joint distribution factors recursively according to m as X θ p i i θ i Sebastiani (2001) is a two-phase algorithm. The bound Pr( jm; ) = i=1 Pr(X jΠ(X ); ), where Π(X ) is phase considers possible completions of the data sam- the parent set Qof Xi in m. ple, and based on that computes an interval for each Since we learn BNs from a Bayesian point of view, parameter estimate of the Bayesian network. The col- model and parameter are treated as random variables lapse phase computes a convex combination of the in- M and Θ. We define distributions on parameter space terval bounds, where the weights in the convex com- Θ M bination are computed from the available cases. The Pr (·) and model space Pr (·). The superscript is collapse phase seems to work quite well for particu- omitted and we simply write Pr(·) for both. The dis- lar missing data mechanisms but unfortunately is not tribution on the parameter space is a product Dirich- guaranteed to give valid results for ignorable mecha- let distribution which is conjugate for the multinomial nisms in general. sample D, i.e. Bayesian updating is easy because the posterior once D has been taken into consideration is Learning models from incomplete data so to speak again Dirichlet, but with updated hyper parameters. adds a layer on top of the parameter learning methods The MAP model is found by maximizing with respect described above. For EM, Friedman (1998) showed to M that doing a model selection search within EM will re- Pr(MjD) / Pr(DjM) · Pr(M) (1) sult in the best model in the limit accoring to some where Pr(DjM) is the normalizing term in Bayes the- model scoring criterion. The Structural EM (SEM) orem when calculating the posterior Dirichlet algorithm is in essence similar to EM, but instead of computing expected sufficient statistics from the same Pr(DjM) = Pr(DjM; Θ) Pr(ΘjM)dΘ (2) Bayesian network model throughout the iterations, a Z where Pr(DjM; Θ) is the likelihood, and Pr(ΘjM) is ing data, a prediction èngine' (predicting missing com- the product Dirichlet prior. In Cooper and Herskovits ponents) so to speak has to be wrapped around eMC3. (1992) a closed formula is derived for (2) as a function Obtaining a prediction engine which will always make of the sufficient statistics for D and prior hyper pa- the correct predictions is infeasible to construct, and rameters of the Dirichlet. This score can be written as when the engine itself has to adapt to the ever chang- a product of terms each of which is a function of a ver- ing model this becomes even worse. An approximate tex and its parents. This decomposability allows local predictive engine is usually easier to construct, but will changes of the model to take place without having to obviously sometimes make slightly wrong predictions. recompute the score for the parts that stay unaltered, In this section we show how approximation can be used that is, only the score for vertices whose parents set together with eMC3 to obtain realizations from the has changes needs to be recomputed.

Learning Bayesian Network Models from Incomplete Data Using Importance Sampling

Common Quandaries and Their Practical Solutions in Bayesian Network Modeling

Bayesian Network Modelling with Examples

Empirical Evaluation of Scoring Functions for Bayesian Network Model Selection

Bayesian Networks

Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute

A Widely Applicable Bayesian Information Criterion

Checking Individuals and Sampling Populations with Imperfect Tests

A Bayesian Network Approach to Making Inferences in Causal Maps

Learning Bayesian Networks: Naïve and Non-Naïve Bayes

Bayesian Networks: Probabilistic Programming

The Hidden Structure of Fact-Finding

Arxiv:2002.00269V2 [Cs.LG] 8 Mar 2021