Data Analysis with Bayesian Networks: a Bootstrap Approach

196 Data Analysis with Bayesian Networks: A Bootstrap Approach Nir Friedman MoisesGoldszmidt Abraham Wyner The Institute of Computer Science SRI International Department of Statistics, Wharton School The HebrewUniversity 333 Ravenswood Ave. University of Pennsylvania Jerusalem 91904, ISRAEL Menlo Park, CA 94025 Philadelphia, PA [email protected] [email protected] [email protected]. upenn.edu Abstract in tum translated into a protein molecule. Recent techni cal breakthroughs in molecular biology enable biologists to In recent years there ha� been significant mea�ure of the expression levels of thousands of genes in progress in algorithms and methods for inducing one experiment [6, 17, 21]. The data generated from these Bayesian networks from data. However, in com experiments consists of instances, each one of which ha� plex data analysis problems, we need to go be thousands of attributes. However, the largest data�ets avail yond being satisfiedwith inducing networks with able todaycontain only few hundreds of instances. We can high scores. We need to provide confidence mea not expect to learn a detailed model from such a sparse data sures on features of these networks: Is the exis set. However, these data sets clearly contain valuable infor tence of an edge between two nodes warranted? mation. For example, we would like to induce correlation Is the Markov blanket of a given node robust? and causation relations among genes (e.g., high expression Can we say something about the ordering of the levels of one gene "cause" the suppression of another) [ 16]. variables? We should be able to address these The challenge is then, to separate the mea�urable "signal" questions, even when the amount of data is not in this data from the "noise," that is, the genuine corre enough to induce a high scoring network. In this lations and causations properties from spurious (random) paper we propose Efron's Bootstrap a� a compu correlations. tationally efficient approach for answering these Analysis of such data poses many challenges. In this pa questions. In addition, we propose to use these per we examine how we can determine the level of con confidence measures to induce better structures fidence about various structural features of the Bayesian from the data, and to detect the presence of latent networks we induce from data sets. We consider an ap variables. proach and methodology ba�ed on the Bootstrap method of Efron [7] for addressing this type of challenges. The Boot 1 Introduction strap is a computer-ba�ed method for a�signing mea�ures of accuracy to statistics estimates and performing statisti In the la�t decade there ha� been a great deal of research cal inference. We regard these mea�ures of accuracy a� focused on learning Bayesian networks from data [2, 12]. establishing a level of confidence on the estimates, where With few exceptions, these results have concentrated on confidence can be interpreted in two ways. The more im computationally efficient induction methods and, more re portant (and more elusive) notion a�sesses the likelihood cently, on the issue of hidden variables and missing data. that a given feature is actually true. This confidence wtll, The main concern in this line of work is the induction of ultimately, stand or fall by the method of estimation. The high scoring networks, where the score of the network re second notion is more akin to an a�sessment of the degree flects how well does the network fits the data. A Bayesian of support of a particular technique towards a given fea network, however, also contains structural and qualitative ture. This latter idea nicely separates the variation in the information about the domain. We should be able to ex data from the shortcomings of the algorithm. It is this latter ploit this information in complex data analysis problems, interpretation of confidence that wa� pursued in [ 10]. The even in situations where the available data is sparse. methods introduced in this paper encompa�s both types of Part of our motivation comes from our ongoing work on confidence, and focuses on the former (more below). an application of Bayesian networks to molecular biology Although the Bootstrap is conceptually ea�y to imple [ 11]. One of the central goals of molecular biology is to ment and apply in our context, there are open question in understand the mechanisms that control and regulate gene the theoretical foundations. The main difficulty (a� com expression. A gene is expressed via a process that tran pared to cla�sic statistical estimation methods) is the lack scribes it into an RNAsequence, and this RNA sequence is Data Analysis with Bayesian Networks 197 of closed fonn expressions for the events under study (e.g., these results provide strong evidence for the bootstrap a� that an edge appears in a network). Still, the widespread an appropriate method for extracting qualitative infonna use of the bootstrap despite such difficulties reflects the tion about the domain of study from features in the induced general conditions under which bootstrap distributions are Bayesian network. consistent, even when the statistics cannot be concisely The study of methods for establishing the quality of in defined in a simple expression (see [7]). An example is duced Bayesian networks ha� not been totally ignored in the application of the bootstrap in evolutionary biology to the literature. Cowell et al. [5] present a method ba�ed on mea�ure confidence in inferences from phylogenetic trees. the log-loss scoring function to monitor each variable in a Felsenstein [9], ha� applied re-sampling tools to estimate given network. These monitors check the deviation of the uncertainty in edges (clades) of evolutionary trees (which predictions by these variables from the observations in the specify the phylogenetic evolution of a gene over time). data. Heckerman et al. [14] present an approach, ba�ed on Similar to phylogenies, we test re-sampling strategies Bayesian considerations, to establish the belief that a causal for Bayesian networks, experimentally, by beginning with edge is part of the underlying generating model. The prob an explicit probability distribution and a known network lem of confidence estimation that we study in this paper, model (the "golden model"). In [10], we report prelimi is similar in spirit to the one investigated by Heckennan nary results that indicate that, in practice, high confidence et al. Yet, the basis of the approach and the algorithmic estimates on certain structural features are indicative of the implementation is completely different. The relation is fur existence of these features in the generating model. In these ther explored in [10] where we propose (and show results) experiments, we used edges in partially directed graphs how the Bootstrap can be used to implement a "practical" (PDAGs) a� the feature of interest. These edges describe Bayesian estimate of the confidence on features of models. features of equivalence cla�ses of networks (see below). For completeness we summarized this relation in Section 6. This paper extends the results in [I 0] in three fundamen tal ways: First it includes other important features of the in 2 Learning Bayesian Networks duced models such a� the Markov neighborhood of a node (i.e., with what confidence can we a�sert that X is in Y's We briefly review learningof Bayesian networks from data. Markov Blanket), and ordering relations between variables For a more complete exposition we refer the reader to [ 12]. Consider a finite set = {X 1, ..., X n in the PDAGS (with what confidence can we a�sert that X X } of discrete ran is an ancestor of Y). Second, we focus on examining to dom variables where each variable X; may take on values what extend the degree of confidence returned by the boot from a finite set. We use capital letters, such a� X, Y, Z, strap can be interpreted a� establishing the likelihood of a for variable names and lowerca�e letters x, y, z to denote feature being actually true in the generating model. To this speci fie values taken by those variables. Sets of variables end we perfonned an extensive set of experiments varying are denoted by boldface capital letters X, Y, z. and a�sign various parameters such a� the search method in the learn ments of values to the variables in these sets are denoted by ing algorithms, the sizes of the data�ets, and the bootstrap boldface lowerca�e letters x, y, z. method. Third, we also examine the bootstrap a� provid A Bayesian network is an annotated directed acyclic ing infonnation to guide the induction process. We look at graph that encodes a joint probability distribution of a the increa�e in perfonnance when the learning procedure is set of random variables X. Fonnally, a Bayesian net bia�ed with infonnation from the bootstrap estimates. work for X is a pair B = (G, 8). The first component, Our experiments, in Section 4, yield the following re namely G, is a directed acyclic graph whose vertices cor sults, that to the best or our knowledge are unknown on the respond to the random variables X 1 , .••, X n, and whose application of the bootstrap for establishing the likelihood edges represent direct dependencies between the variables. that a particular feature is in the generating model: The graph G encodes the following set of independence statements: each variable X; is independent of its non I. The bootstrap estimates are quite cautious. Features descendants given its parents in G. The second compo induced with high confidence are rarely false posi nent of the pair, namely 8, represents the set of param tives. eters that quantifies the network. It contains a parameter ---; 2.

Data Analysis with Bayesian Networks: a Bootstrap Approach

Common Quandaries and Their Practical Solutions in Bayesian Network Modeling

Bayesian Network Modelling with Examples

Empirical Evaluation of Scoring Functions for Bayesian Network Model Selection

Bayesian Networks

Deep Regression Bayesian Network and Its Applications Siqi Nie, Meng Zheng, Qiang Ji Rensselaer Polytechnic Institute

A Widely Applicable Bayesian Information Criterion

Checking Individuals and Sampling Populations with Imperfect Tests

A Bayesian Network Approach to Making Inferences in Causal Maps

Learning Bayesian Networks: Naïve and Non-Naïve Bayes

Bayesian Networks: Probabilistic Programming

The Hidden Structure of Fact-Finding

Arxiv:2002.00269V2 [Cs.LG] 8 Mar 2021