Data Analysis with Bayesian Networks: a Bootstrap Approach

Data Analysis with Bayesian Networks: a Bootstrap Approach

196 Data Analysis with Bayesian Networks: A Bootstrap Approach Nir Friedman MoisesGoldszmidt Abraham Wyner The Institute of Computer Science SRI International Department of Statistics, Wharton School The HebrewUniversity 333 Ravenswood Ave. University of Pennsylvania Jerusalem 91904, ISRAEL Menlo Park, CA 94025 Philadelphia, PA [email protected] [email protected] [email protected]. upenn.edu Abstract in tum translated into a protein molecule. Recent techni­ cal breakthroughs in molecular biology enable biologists to In recent years there ha� been significant mea�ure of the expression levels of thousands of genes in progress in algorithms and methods for inducing one experiment [6, 17, 21]. The data generated from these Bayesian networks from data. However, in com­ experiments consists of instances, each one of which ha� plex data analysis problems, we need to go be­ thousands of attributes. However, the largest data�ets avail­ yond being satisfiedwith inducing networks with able todaycontain only few hundreds of instances. We can­ high scores. We need to provide confidence mea­ not expect to learn a detailed model from such a sparse data sures on features of these networks: Is the exis­ set. However, these data sets clearly contain valuable infor­ tence of an edge between two nodes warranted? mation. For example, we would like to induce correlation Is the Markov blanket of a given node robust? and causation relations among genes (e.g., high expression Can we say something about the ordering of the levels of one gene "cause" the suppression of another) [ 16]. variables? We should be able to address these The challenge is then, to separate the mea�urable "signal" questions, even when the amount of data is not in this data from the "noise," that is, the genuine corre­ enough to induce a high scoring network. In this lations and causations properties from spurious (random) paper we propose Efron's Bootstrap a� a compu­ correlations. tationally efficient approach for answering these Analysis of such data poses many challenges. In this pa­ questions. In addition, we propose to use these per we examine how we can determine the level of con­ confidence measures to induce better structures fidence about various structural features of the Bayesian from the data, and to detect the presence of latent networks we induce from data sets. We consider an ap­ variables. proach and methodology ba�ed on the Bootstrap method of Efron [7] for addressing this type of challenges. The Boot­ 1 Introduction strap is a computer-ba�ed method for a�signing mea�ures of accuracy to statistics estimates and performing statisti­ In the la�t decade there ha� been a great deal of research cal inference. We regard these mea�ures of accuracy a� focused on learning Bayesian networks from data [2, 12]. establishing a level of confidence on the estimates, where With few exceptions, these results have concentrated on confidence can be interpreted in two ways. The more im­ computationally efficient induction methods and, more re­ portant (and more elusive) notion a�sesses the likelihood cently, on the issue of hidden variables and missing data. that a given feature is actually true. This confidence wtll, The main concern in this line of work is the induction of ultimately, stand or fall by the method of estimation. The high scoring networks, where the score of the network re­ second notion is more akin to an a�sessment of the degree flects how well does the network fits the data. A Bayesian of support of a particular technique towards a given fea­ network, however, also contains structural and qualitative ture. This latter idea nicely separates the variation in the information about the domain. We should be able to ex­ data from the shortcomings of the algorithm. It is this latter ploit this information in complex data analysis problems, interpretation of confidence that wa� pursued in [ 10]. The even in situations where the available data is sparse. methods introduced in this paper encompa�s both types of Part of our motivation comes from our ongoing work on confidence, and focuses on the former (more below). an application of Bayesian networks to molecular biology Although the Bootstrap is conceptually ea�y to imple­ [ 11]. One of the central goals of molecular biology is to ment and apply in our context, there are open question in understand the mechanisms that control and regulate gene the theoretical foundations. The main difficulty (a� com­ expression. A gene is expressed via a process that tran­ pared to cla�sic statistical estimation methods) is the lack scribes it into an RNAsequence, and this RNA sequence is Data Analysis with Bayesian Networks 197 of closed fonn expressions for the events under study (e.g., these results provide strong evidence for the bootstrap a� that an edge appears in a network). Still, the widespread an appropriate method for extracting qualitative infonna­ use of the bootstrap despite such difficulties reflects the tion about the domain of study from features in the induced general conditions under which bootstrap distributions are Bayesian network. consistent, even when the statistics cannot be concisely The study of methods for establishing the quality of in­ defined in a simple expression (see [7]). An example is duced Bayesian networks ha� not been totally ignored in the application of the bootstrap in evolutionary biology to the literature. Cowell et al. [5] present a method ba�ed on mea�ure confidence in inferences from phylogenetic trees. the log-loss scoring function to monitor each variable in a Felsenstein [9], ha� applied re-sampling tools to estimate given network. These monitors check the deviation of the uncertainty in edges (clades) of evolutionary trees (which predictions by these variables from the observations in the specify the phylogenetic evolution of a gene over time). data. Heckerman et al. [14] present an approach, ba�ed on Similar to phylogenies, we test re-sampling strategies Bayesian considerations, to establish the belief that a causal for Bayesian networks, experimentally, by beginning with edge is part of the underlying generating model. The prob­ an explicit probability distribution and a known network lem of confidence estimation that we study in this paper, model (the "golden model"). In [10], we report prelimi­ is similar in spirit to the one investigated by Heckennan nary results that indicate that, in practice, high confidence et al. Yet, the basis of the approach and the algorithmic estimates on certain structural features are indicative of the implementation is completely different. The relation is fur­ existence of these features in the generating model. In these ther explored in [10] where we propose (and show results) experiments, we used edges in partially directed graphs how the Bootstrap can be used to implement a "practical" (PDAGs) a� the feature of interest. These edges describe Bayesian estimate of the confidence on features of models. features of equivalence cla�ses of networks (see below). For completeness we summarized this relation in Section 6. This paper extends the results in [I 0] in three fundamen­ tal ways: First it includes other important features of the in­ 2 Learning Bayesian Networks duced models such a� the Markov neighborhood of a node (i.e., with what confidence can we a�sert that X is in Y's We briefly review learningof Bayesian networks from data. Markov Blanket), and ordering relations between variables For a more complete exposition we refer the reader to [ 12]. Consider a finite set = {X 1, ..., X n in the PDAGS (with what confidence can we a�sert that X X } of discrete ran­ is an ancestor of Y). Second, we focus on examining to dom variables where each variable X; may take on values what extend the degree of confidence returned by the boot­ from a finite set. We use capital letters, such a� X, Y, Z, strap can be interpreted a� establishing the likelihood of a for variable names and lowerca�e letters x, y, z to denote feature being actually true in the generating model. To this speci fie values taken by those variables. Sets of variables end we perfonned an extensive set of experiments varying are denoted by boldface capital letters X, Y, z. and a�sign­ various parameters such a� the search method in the learn­ ments of values to the variables in these sets are denoted by ing algorithms, the sizes of the data�ets, and the bootstrap boldface lowerca�e letters x, y, z. method. Third, we also examine the bootstrap a� provid­ A Bayesian network is an annotated directed acyclic ing infonnation to guide the induction process. We look at graph that encodes a joint probability distribution of a the increa�e in perfonnance when the learning procedure is set of random variables X. Fonnally, a Bayesian net­ bia�ed with infonnation from the bootstrap estimates. work for X is a pair B = (G, 8). The first component, Our experiments, in Section 4, yield the following re­ namely G, is a directed acyclic graph whose vertices cor­ sults, that to the best or our knowledge are unknown on the respond to the random variables X 1 , .••, X n, and whose application of the bootstrap for establishing the likelihood edges represent direct dependencies between the variables. that a particular feature is in the generating model: The graph G encodes the following set of independence statements: each variable X; is independent of its non­ I. The bootstrap estimates are quite cautious. Features descendants given its parents in G. The second compo­ induced with high confidence are rarely false posi­ nent of the pair, namely 8, represents the set of param­ tives. eters that quantifies the network. It contains a parameter ---; 2.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us