Bayesian Networks Without Tears
Total Page:16
File Type:pdf, Size:1020Kb
AI Magazine Volume 12 Number 4 (1991) (© AAAI) Articles …making Bayesian Bayesian Networks networks more accessible to without Tears the probabilis- tically Eugene Charniak unsophis- ticated. Over the last few I give an introduction to Bayesian networks for understanding years, a method of AI researchers with a limited grounding in prob- (Charniak and Gold- reasoning using ability theory. Over the last few years, this man 1989a, 1989b; method of reasoning using probabilities has probabilities, vari- Goldman 1990), become popular within the AI probability and ously called belief uncertainty community. Indeed, it is probably vision (Levitt, Mullin, networks, Bayesian fair to say that Bayesian networks are to a large and Binford 1989), networks, knowl- segment of the AI-uncertainty community what heuristic search edge maps, proba- resolution theorem proving is to the AI-logic (Hansson and Mayer bilistic causal community. Nevertheless, despite what seems to 1989), and so on. It networks, and so on, be their obvious importance, the ideas and is probably fair to has become popular techniques have not spread much beyond the say that Bayesian within the AI proba- research community responsible for them. This is networks are to a bility and uncertain- probably because the ideas and techniques are large segment of not that easy to understand. I hope to rectify this ty community. This the AI-uncertainty situation by making Bayesian networks more method is best sum- accessible to the probabilistically unsophisticated. community what marized in Judea resolution theorem Pearl’s (1988) book, proving is to the AI- but the ideas are a logic community. product of many hands. I adopted Pearl’s Nevertheless, despite what seems to be their name, Bayesian networks, on the grounds obvious importance, the ideas and techniques that the name is completely neutral about have not spread much beyond the research the status of the networks (do they really rep- community responsible for them. This is resent beliefs, causality, or what?). Bayesian probably because the ideas and techniques are networks have been applied to problems in not that easy to understand. I hope to rectify medical diagnosis (Heckerman 1990; Spiegel- this situation by making Bayesian networks halter, Franklin, and Bull 1989), map learning more accessible to (Dean 1990), lan- the probabilis- guage tically unso- 50 AI MAGAZINE 0738-4602/91/$4.00 ©1991 AAAI Articles phisticated. That is, this article tries to make the basic ideas and intuitions accessible to someone with a limited grounding in proba- family-out bowel-problem bility theory (equivalent to what is presented in Charniak and McDermott [1985]). An Example Bayesian Network light-on dog-out The best way to understand Bayesian networks is to imagine trying to model a situation in which causality plays a role but where our hear-bark understanding of what is actually going on is incomplete, so we need to describe things probabilistically. Suppose when I go home at Figure 1. A Causal Graph. night, I want to know if my family is home The nodes denote states of affairs, and the arcs can be interpreted as causal connections. before I try the doors. (Perhaps the most con- venient door to enter is double locked when nobody is home.) Now, often when my wife leaves the house, she turns on an outdoor light. However, she sometimes turns on this states of affairs, and the variables have two light if she is expecting a guest. Also, we have possible values, true and false. However, this a dog. When nobody is home, the dog is put need not be the case. We could, say, have a in the back yard. The same is true if the dog node denoting the intensity of an earthquake has bowel troubles. Finally, if the dog is in the with values no-quake, trembler, rattler, major, backyard, I will probably hear her barking (or and catastrophe. Indeed, the variable values what I think is her barking), but sometimes I do not even need to be discrete. For example, can be confused by other dogs barking. This the value of the variable earthquake might be example, partially inspired by Pearl’s (1988) a Richter scale number. (However, the algo- earthquake example, is illustrated in figure 1. rithms I discuss only work for discrete values, There we find a graph not unlike many we see so I stick to this case.) in AI. We might want to use such diagrams to In what follows, I use a sans serif font for predict what will happen (if my family goes the names of random variables, as in earth- out, the dog goes out) or to infer causes from quake. I use the name of the variable in italics observed effects (if the light is on and the dog to denote the proposition that the variable is out, then my family is probably out). takes on some particular value (but where we The important thing to note about this are not concerned with which one), for exam- example is that the causal connections are ple, earthquake. For the special case of Boolean not absolute. Often, my family will have left variables (with values true and false), I use the without putting out the dog or turning on a variable name in a sans serif font to denote light. Sometimes we can use these diagrams the proposition that the variable has the anyway, but in such cases, it is hard to know value true (for example, family-out). I also what to infer when not all the evidence points show the arrows pointing downward so that the same way. Should I assume the family is “above” and “below” can be understood to out if the light is on, but I do not hear the indicate arrow direction. dog? What if I hear the dog, but the light is The arcs in a Bayesian network specify the out? Naturally, if we knew the relevant proba- independence assumptions that must hold bilities, such as P(family-out | light-on, ¬ hear- between the random variables. These inde- bark), then we would be all set. However, pendence assumptions determine what prob- typically, such numbers are not available for ability information is required to specify the all possible combinations of circumstances. probability distribution among the random Bayesian networks allow us to calculate them variables in the network. The reader should from a small set of probabilities, relating only note that in informally talking about DAG, I neighboring nodes. said that the arcs denote causality, whereas in Bayesian networks are directed acyclic graphs the Bayesian network, I am saying that they (DAGs) (like figure 1), where the nodes are specify things about the probabilities. The random variables, and certain independence next section resolves this conflict. assumptions hold, the nature of which I dis- To specify the probability distribution of a cuss later. (I assume without loss of generality Bayesian network, one must give the prior that DAG is connected.) Often, as in figure 1, probabilities of all root nodes (nodes with no the random variables can be thought of as predecessors) and the conditional probabilities WINTER 1991 51 Articles …the complete P(fo) = .15 P(bp) = .01 specification bowel-problem (bp) of a family-out (fo) probability distribution P(do fo bp) = .99 P(do fo ¬bp) = .90 requires ¬ light-on (lo) dog-out (do) P(do fo bp) = .97 absurdly P(do ¬fo bp) = .3 many P(lo fo) = .6 numbers. P(lo ¬ fo) = .05 hear-bark(hb) P(hb do) = .7 P(hb ¬ do) = .01 Figure 2. A Bayesian Network for the family-out Problem. I added the prior probabilities for root nodes and the posterior probabilities for nonroots given all possible values of their parents. of all nonroot nodes given all possible combi- harmless provided that one keeps in mind nations of their direct predecessors. Thus, that here, belief is simply the conditional figure 2 shows a fully specified Bayesian net- probability given the evidence. work corresponding to figure 1. For example, In the remainder of this article, I first it states that if family members leave our describe the independence assumptions house, they will turn on the outside light 60 implicit in Bayesian networks and show how percent of the time, but the light will be turned they relate to the causal interpretation of arcs on even when they do not leave 5 percent of (Independence Assumptions). I then show the time (say, because someone is expected). that given these independence assumptions, Bayesian networks allow one to calculate the numbers I specified are, in fact, all that the conditional probabilities of the nodes in are needed (Consistent Probabilities). Evaluat- the network given that the values of some of ing Networks describes how Bayesian net- the nodes have been observed. To take the works are evaluated, and the next section earlier example, if I observe that the light is describes some of their applications. on (light-on = true) but do not hear my dog (hear-bark = false), I can calculate the condi- tional probability of family-out given these Independence Assumptions pieces of evidence. (For this case, it is .5.) I talk of this calculation as evaluating the One objection to the use of probability Bayesian network (given the evidence). In theory is that the complete specification of a more realistic cases, the networks would con- probability distribution requires absurdly sist of hundreds or thousands of nodes, and many numbers. For example, if there are n they might be evaluated many times as new binary random variables, the complete distri- information comes in. As evidence comes in, bution is specified by 2n-1 joint probabilities. it is tempting to think of the probabilities of (If you do not know where this 2n-1 comes the nodes changing, but, of course, what is from, wait until the next section, where I changing is the conditional probability of the define joint probabilities.) Thus, the complete nodes given the changing evidence.