Bayesian Networks
Total Page:16
File Type:pdf, Size:1020Kb
In Michael A. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks, Second Edition, TECHNICAL REPORT Cambridge, MA: The MIT Press, 157-160, 2003. R-216 appropriate with how likely each model would be to have generated Road Map: Learning in Artificial Networks the data. This results in an elegant, general framework for fitting Related Reading: Bayesian Networks; Gaussian Processes; Graphical models to data, which, however, may be compromised by com- Models: Probabilistic Inference; Support Vector Machines putational difficulties in carrying out the ideal procedure. There are many approximate Bayesian implementations, using methods such References as sampling, perturbation techniques, and variational methods. Of- Barber, D., and Bishop, C., 1997, Ensemble learning in Bayesian neural ten these enable the successful approximate realization of practical networks, in Neural Networks and Machine Learning (C. Bishop, Ed.), Bayesian schemes. An attractive, built-in effect of the Bayesian NATO ASI Series, New York: Springer-Verlag. Berger, J. O., 1985, Statistical Decision Theory and Bayesian Analysis, 2nd approach is an automatic procedure for combining predictions from ed., New York: Springer-Verlag. ࡗ several different models, the combination strength of a model being Bishop, C. M., 1995, Neural Networks for Pattern Recognition, Oxford, given by the posterior likelihood of the model. In the case of mod- Engl.: Oxford University Press. ࡗ els linear in their parameters, Bayesian neural networks are closely Box, G., and Tiao, G., 1973, Bayesian Inference in Statistical Analysis, related to Gaussian processes, and many of the computational dif- Reading, MA: Addison-Wesley. ficulties of dealing with more general stochastic nonlinear systems Cover, M., and Thomas, J., 1991, Elements of Information Theory, New can be avoided. York: Wiley. Bayesian methods are readily extendable to other areas, in par- Jordan, M., Ed., 1998, Learning in Graphical Models, Cambridge, MA: ticular density estimation, and the benefits of dealing with uncer- MIT Press. MacKay, D. J. C., 1992, Bayesian interpolation, Neural Computat., 4(3): tainty are again to be found (see Bishop in Jordan, 1998). Tradi- 415–447. tionally, neural networks are graphical representations of functions, MacKay, D. J. C., 1995, Probable networks and plausible predictions: A in which the computations at each node are deterministic. In the review of practical Bayesian methods for supervised neural networks, classification discussion, however, the final output represents a sto- Netw. Computat. Neural Syst., 6(3). ࡗ chastic variable. We can consider such stochastic variables else- Neal, R. M., 1992, Connectionist learning of belief networks, Artif. Intell., where in the network, and the sigmoid belief network is an early 56:71–113. example of a stochastic network (Neal, 1992). There is a major Neal, R. M., 1993, Probabilistic Inference Using Markov Chain Monte conceptual difference between such models and conventional neu- Carlo Methods, Technical Report CRG-TR-93-1, Department of Com- ral networks. Networks in which nodes represent stochastic vari- puter Science, University of Toronto, Toronto, Canada. Neal, R. M., 1996, Bayesian Learning for Neural Networks, Lecture Notes ables are called graphical models (see BAYESIAN NETWORKS) and in Statistics 118, New York: Springer-Verlag. are graphical representations of distributions (GRAPHICAL MOD- Tipping, M. E., 2001, Sparse Bayesian learning and the relevance vector ELS:PROBABILISTIC INFERENCE). Such models evolve naturally machine, J. Machine Learn. Res., no. 1, 211–244. ࡗ from the desire of incorporating uncertainty and nonlinearity in Walker, A. M., 1969, On the asymptotic behaviour of posterior distribu- networked systems. tions, J. R. Statist. Soc. B, 31:80–88. Bayesian Networks Judea Pearl and Stuart Russell Introduction Figure 1 illustrates a simple yet typical Bayesian network. It describes the causal relationships among five variables: the season Probabilistic models based on directed acyclic graphs have a long of the year (X X and rich tradition, beginning with work by the geneticist Sewall 1), whether it’s raining or not ( 2), whether the sprin- X X Wright in the 1920s. Variants have appeared in many fields. Within kler is on or off ( 3), whether the pavement is wet or dry ( 4), and X statistics, such models are known as directed graphical models; whether the pavement is slippery or not ( 5). Here, the absence of X X within cognitive science and artificial intelligence (AI), they are a direct link between 1 and 5, for example, captures our under- known as Bayesian networks. The name honors the Reverend standing that there is no direct influence of season on slipperiness; Thomas Bayes (1702–1761), whose rule for updating probabilities in light of new evidence is the foundation of the approach. The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-down (semantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirectional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks as the method of choice for uncertain reasoning in AI and expert systems, replacing earlier, ad hoc rule-based schemes (Pearl, 1988; Shafer and Pearl, 1990; Jensen, 1996). The nodes in a Bayesian network represent propositional vari- ables of interest (e.g., the temperature of a device, the sex of a patient, a feature of an object, the occurrence of an event) and the links represent informational or causal dependencies among the variables. The dependencies are quantified by conditional proba- bilities for each node, given its parents in the network. The network Figure 1. A Bayesian network representing causal influences among five supports the computation of the probabilities of any subset of vari- variables. Each arc indicates a causal influence of the “parent” node on the ables given evidence about any other subset. “child” node. 158 Part III: Articles the influence is mediated by the wetness of the pavement. (If freez- Evidential Reasoning ing is a possibility, then a direct link could be added.) From the product specification in Equation 1, one can express the Perhaps the most important aspect of Bayesian networks is that probability of any desired proposition in terms of the conditional they are direct representations of the world, not of reasoning pro- probabilities specified in the network. For example, the probability cesses. The arrows in the diagram represent real causal connections that the sprinkler is on, given that the pavement is slippery, is and not the flow of information during reasoning (as in rule-based (trueס on, Xס P(X 35 ס (trueס on | Xס systems and neural networks). Reasoning processes can operate on P(X ס 35 Bayesian networks by propagating information in any direction. P(X5 true) ס ס For example, if the sprinkler is on, then the pavement is probably ͚ P(x12, x , X 3on, x 4, X 5true) x124,x ,x ס -wet (prediction); if someone slips on the pavement, that also pro ס vides evidence that it is wet (abduction, or reasoning to a probable ͚ P(x1234, x , x , x , X 5true) cause). On the other hand, if we see that the pavement is wet, that x1234,x ,x ,x ס ס ס makes it more likely that the sprinkler is on or that it is raining ͚ P(x1213)P(x | x )P(X on|x 1423)P(x | x , X on)P(X 5true | x 4) x ,x ,x 124 ס abduction); but if we then observe that the sprinkler is on, that) ס ͚ P(x121314235)P(x | x )P(x | x )P(x | x , x )P(X true | x 4) reduces the likelihood that it is raining (explaining away). It is this x ,x ,x ,x last form of reasoning, explaining away, that is especially difficult 1234 to model in rule-based systems and neural networks in any natural These expressions can often be simplified in ways that reflect the way, because it seems to require the propagation of information in structure of the network itself. The first algorithms proposed for two directions. probabilistic calculations in Bayesian networks used a local, dis- tributed message-passing architecture, typical of many cognitive activities (Kim and Pearl, 1983). Initially this approach was limited Probabilistic Semantics to tree-structured networks, but it was later extended to general Any complete probabilistic model of a domain must, either explic- networks in Lauritzen and Spiegelhalter’s (1988) method of join- itly or implicitly, represent the joint distribution—the probability tree propagation. A number of other exact methods have been de- of every possible event as defined by the values of all the variables. veloped and can be found in recent textbooks (Jensen, 1996; Jor- There are exponentially many such events, yet Bayesian networks dan, 1999). achieve compactness by factoring the joint distribution into local, It is easy to show that reasoning in Bayesian networks subsumes conditional distributions for each variable given its parents. If xi the satisfiability problem in propositional logic and, hence, is NP- denotes some value of the variable Xi and pai denotes some set of hard. Monte Carlo simulation methods can be used for approximate values for Xi’s parents, then P(xi | pai) denotes this conditional inference (Pearl, 1988), giving gradually improving estimates as distribution. For example, P(x4 | x2, x3) is the probability of wetness sampling proceeds. (These methods use local message propagation given the values of sprinkler and rain. The global semantics of on the original network structure, unlike join-tree methods.) Alter- Bayesian networks specifies that the full joint distribution is given natively, variational methods provide bounds on the true probabil- by the product ity (Jordan, 1999). ס P(x1,...,xn) ͟ P(xii| pa ) (1) i Uncertainty over Time In our example network, we have Entities that live in a changing environment must keep track of variables whose values change over time.