Using Bayesian Networks to Analyze Expression Data
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OFCOMPUTATIONAL BIOLOGY Volume7, Numbers 3/4,2000 MaryAnn Liebert,Inc. Pp. 601–620 Using Bayesian Networks to Analyze Expression Data NIR FRIEDMAN, 1 MICHAL LINIAL, 2 IFTACH NACHMAN, 3 andDANA PE’ER 1 ABSTRACT DNAhybridization arrayssimultaneously measurethe expression level forthousands of genes.These measurementsprovide a “snapshot”of transcription levels within the cell. Ama- jorchallenge in computationalbiology is touncover ,fromsuch measurements,gene/ protein interactions andkey biological features ofcellular systems. In this paper,wepropose a new frameworkfor discovering interactions betweengenes based onmultiple expression mea- surements. This frameworkbuilds onthe use of Bayesian networks forrepresenting statistical dependencies. ABayesiannetwork is agraph-basedmodel of joint multivariateprobability distributions thatcaptures properties ofconditional independence betweenvariables. Such models areattractive for their ability todescribe complexstochastic processes andbecause theyprovide a clear methodologyfor learning from(noisy) observations.We start byshowing howBayesian networks can describe interactions betweengenes. W ethen describe amethod forrecovering gene interactions frommicroarray data using tools forlearning Bayesiannet- works.Finally, we demonstratethis methodon the S.cerevisiae cell-cycle measurementsof Spellman et al. (1998). Key words: geneexpression, microarrays, Bayesian methods. 1.INTRODUCTION centralgoal of molecularbiology isto understand the regulation of protein synthesis and its Areactionsto external and internal signals. All the cells in an organism carry the same genomic data, yettheir protein makeup can be drastically different both temporally and spatially, due to regulation. Proteinsynthesis is regulated by many mechanisms at its different stages. These include mechanisms for controllingtranscription initiation, RNA splicing,mRNA transport,translation initiation, post-translational modi cations,and degradation of mRNA/ protein.One of the main junctions at which regulation occurs ismRNA transcription.A majorrole in this machinery is played by proteins themselves that bind to regulatoryregions along the DNA, greatlyaffecting the transcription of the genes they regulate. Inrecent years, technical breakthroughs in spotting hybridization probes and advances in genome se- quencingefforts leadto developmentof DNA microarrays ,whichconsist of manyspecies of probes, either oligonucleotidesor cDNA, thatare immobilized in a prede nedorganization to a solidphase. By using 1Schoolof Computer Science and Engineering, Hebrew University, Jerusalem, 91904, Israel. 2Instituteof Life Sciences, Hebrew University, Jerusalem, 91904, Israel. 3Centerfor Neural Computation and School of ComputerScience and Engineering, Hebrew University, Jerusalem, 91904,Israel. 601 602 FRIEDMAN ETAL. DNA microarrays,researchers are now able to measure the abundance of thousands of mRNA targets simultaneously(DeRisi et al.,1997;Lockhart et al., 1996; Wen et al.,1998).Unlike classical experiments, wherethe expression levels of only a few geneswere reported,DNA microarrayexperiments can measure all thegenes of an organism, providing a “genomic”viewpoint on gene expression. As aconsequence, thistechnology facilitates new experimental approaches for understandinggene expression and regulation (Iyer et al.,1999;Spellman et al., 1998). Earlymicroarray experiments examined few samplesand mainly focused on differential display across tissuesor conditionsof interest.The design of recentexperiments focuses on performing a largernumber ofmicroarray assays ranging in size from adozento a few hundredsof samples. In the near future, data setscontaining thousands of samples will become available. Such experiments collect enormous amounts ofdata,which clearly re ect many aspects of the underlying biological processes. An important challenge isto develop methodologies that are both statistically sound and computationally tractable for analyzing suchdata sets and inferring biological interactions from them. Mostof the analysis tools currently used are based on clustering algorithms.These algorithms attempt tolocate groups of genes that have similar expression patterns over a setof experiments (Alon et al., 1999;Ben-Dor et al.,1999;Eisen et al.,1999;Michaels et al.,1998;Spellman et al.,1998).Such analysis hasproven to be useful in discovering genes that are co-regulated and/ orhave similar function. A more ambitiousgoal for analysisis toreveal the structure of the transcriptional regulation process (Akutsu, 1998; Chen et al.,1999;Somogyi et al.,1996;W eaver et al.,1999).This is clearlya hardproblem. The current datais extremely noisy. Moreover, mRNA expressiondata alone only gives a partialpicture that does not reect key events, such as translation and protein (in)activation. Finally, the amount of samples, even in thelargest experiments in the foreseeable future, does not provide enough information to constructa fully detailedmodel with high statistical signi cance. Inthis paper, we introducea newapproach for analyzinggene expression patterns that uncovers prop- ertiesof the transcriptional program by examining statistical properties of dependence and conditional independence inthe data. W ebaseour approach on the well-studied statistical tool of Bayesiannetworks (Pearl,1988). These networks represent the dependence structure between multiple interacting quantities (e.g.,expression levels of different genes). Our approach,probabilistic in nature, is capable of handling noiseand estimating the con dencein the different features of the network. W eare,therefore, able to focuson interactions whose signal in the data is strong. Bayesiannetworks are apromisingtool for analyzinggene expression patterns. First, they are particularly usefulfor describingprocesses composed of locally interactingcomponents; that is, the value of each component directly dependson the values of arelativelysmall number of components. Second, statistical foundationsfor learningBayesian networks from observations,and computational algorithms to do so, arewell understood and have been used successfully in many applications. Finally, Bayesian networks providemodels of causal in uence: Although Bayesian networks are mathematically de nedstrictly in termsof probabilities and conditional independence statements, a connectioncan be made between this characterizationand the notion of directcausal in uence .(Heckerman et al.,1999;Pearl and V erma,1991; Spirtes et al.,1993).Although this connection depends on severalassumptions that do not necessarily hold ingene expression data, the conclusions of Bayesian network analysis might be indicativeof somecausal connectionsin the data. Theremainder of thispaper is organized as follows. In Section 2, we reviewkey concepts of Bayesian networks,learning them from observationsand using them to infer causality. In Section 3, we describe howBayesian networks can be appliedto model interactions among genes and discuss the technical issues thatare posed by this type of data. In Section 4, we applyour approach to the gene-expression data of Spellman et al. (1998),analyzing the statistical signi canceof the results and their biological plausibility. Finally,in Section 5, we concludewith a discussionof related approaches and future work. 2.BA YESIANNETWORKS 2.1.Representing distributions with Bayesiannetworks Consider a nite set 5 X1; : : : ; Xn ofrandom variables where each variable Xi may take on a X f g value xi from thedomain V al .Xi /.Inthis paper, we usecapital letters, such as X; Y; Z,for variablenames andlowercase letters, such as x; y; z,todenote speci cvaluestaken by those variables. Sets of variables aredenoted by boldface capital letters ; ; ,andassignments of values to the variables in these sets USING BAYESIAN NETWORKS 603 FIG. 1. Anexample of a simpleBayesian network structure. This network structure implies several con- ditionalindependence statements: I .A; E/, I .B; D A; E/, I .C; A; D; E B/, I .D; B; C; E A/, and j j j I .E; A; D/.Thenetwork structure also implies that the joint distribution has the product form: P.A;B; C;D;E/ 5 P .A/P .B A; E/P .C B/P .D A/P .E/. j j j aredenoted by boldface lowercase letters x; y; z. We denote I . ; / to mean isindependent of j conditionedon : P . ; / 5 P . /. j j A Bayesiannetwork isa representationof a jointprobability distribution. This representation consists oftwocomponents. The rst component, G, is a directedacyclic graph (DAG) whosevertices correspond tothe random variables X1; : : : ; Xn.Thesecond component, µ describesa conditionaldistribution for eachvariable, given its parents in G.Together,these two components specify a uniquedistribution on X1; : : : ; Xn. The graph G representsconditional independence assumptions that allow the joint distribution to be decomposed,economizing on the number of parameters. The graph G encodes the MarkovAssumption : (*) Eachvariable Xi isindependent of its nondescendants, given its parents in G. Byapplying the chain rule of probabilities and properties of conditional independencies, any joint distributionthat satis es(*) canbe decomposed into the productform n G P .X1; : : : ; Xn/ 5 P .Xi Pa .Xi//; (1) j Yi5 1 G where Pa .Xi / isthe set of parents of Xi in G.Figure1 showsan example of a graph G andlists the Markovindependencies it encodes and the product form theyimply. A graph G speci esa productform asin (1). T ofullyspecify a jointdistribution, we alsoneed to specifyeach of theconditional probabilities in theproduct form. Thesecond