Mean Field Theory for Sigmoid Belief Networks
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Arti cial Intelligence Research 4 1996 61|76 Submitted 11/95; published 3/96 Mean Field Theory for Sigmoid Belief Networks Lawrence K. Saul [email protected] Tommi Jaakkola [email protected] Michael I. Jordan [email protected] Center for Biological and Computational Learning Massachusetts Institute of Technology 79 Amherst Street, E10-243 Cambridge, MA 02139 Abstract We develop a mean eld theory for sigmoid b elief networks based on ideas from statistical mechanics. Our mean eld theory provides a tractable approximation to the true probability dis- tribution in these networks; it also yields a lower b ound on the likeliho o d of evidence. We demon- strate the utility of this framework on a b enchmark problem in statistical pattern recognition|the classi cation of handwritten digits. 1. Intro duction Bayesian b elief networks Pearl, 1988; Lauritzen & Spiegelhalter, 1988 provide a rich graphical representation of probabilistic mo dels. The no des in these networks represent random variables, while the links represent causal in uences. These asso ciations endow directed acyclic graphs DAGs with a precise probabilistic semantics. The ease of interpretation a orded by this semantics explains the growing app eal of b elief networks, now widely used as mo dels of planning, reasoning, and uncertainty. Inference and learning in b elief networks are p ossible insofar as one can eciently compute or approximate the likeliho o d of observed patterns of evidence Buntine, 1994; Russell, Binder, Koller, & Kanazawa, 1995. There exist provably ecient algorithms for computing likeliho o ds in b elief networks with tree or chain-like architectures. In practice, these algorithms also tend to p erform well on more general sparse networks. However, for networks in which no des have many parents, the exact algorithms are to o slow Jensen, Kong, & Kjaeful , 1995. Indeed, in large networks with dense or layered connectivity, exact metho ds are intractable as they require summing over an exp onentially large numb er of hidden states. One approach to dealing with such networks has b een to use Gibbs sampling Pearl, 1988, a sto chastic simulation metho dology with ro ots in statistical mechanics Geman & Geman, 1984. Our approach in this pap er relies on a di erent to ol from statistical mechanics|namely, mean eld theory Parisi, 1988. The mean eld approximation is well known for probabilistic mo dels that can b e represented as undirected graphs|so-called Markov networks. For example, in Boltzmann machines Ackley, Hinton, & Sejnowski, 1985, mean eld learning rules have b een shown to yield tremendous savings in time and computation over sampling-based metho ds Peterson & Anderson, 1987. The main motivation for this work was to extend the mean eld approximation for undirected graphical mo dels to their directed counterparts. Since b elief networks can b e transformed to Markov networks, and mean eld theories for Markov networks are well known, it is natural to ask whya new framework is required at all. The reason is that probabilistic mo dels whichhave compact representations as DAGs mayhaveunwieldy representations as undirected graphs. As we shall see, avoiding this complexity and working directly on DAGs requires an extension of existing metho ds. In this pap er we fo cus on sigmoid b elief networks Neal, 1992, for which the resulting mean eld theory is most straightforward. These are networks of binary random variables whose lo cal c 1996 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Saul, Jaakkola, & Jordan conditional distributions are based on log-linear mo dels. We develop a mean eld approximation for these networks and use it to compute a lower b ound on the likeliho o d of evidence. Our metho d applies to arbitrary partial instantiations of the variables in these networks and makes no restrictions on the network top ology. Note that once a lower b ound is available, a learning pro cedure can maximize the lower b ound; this is useful when the true likeliho o d itself cannot b e computed eciently. A similar approximation for mo dels of continous random variables is discussed by Jaakkola et al 1995. The idea of b ounding the likeliho o d in sigmoid b elief networks was intro duced in a related architecture known as the Helmholtz machine Hinton, Dayan, Frey, & Neal 1995. A fundamental advance of this work was to establish a framework for approximation that is esp ecially conduciveto learning the parameters of layered b elief networks. The close connection b etween this idea and the mean eld approximation from statistical mechanics, however, was not develop ed. In this pap er we hop e not only to elucidate this connection, but also to convey a sense of which approximations are likely to generate useful lower b ounds while, at the same time, remaining analytically tractable. We develop here what is p erhaps the simplest such approximation for b elief networks, noting that more sophisticated metho ds Jaakkola & Jordan, 1996a; Saul & Jordan, 1995 are also available. It should b e emphasized that approximations of some form are required to handle the multilayer neural networks used in statistical pattern recognition. For these networks, exact algorithms are hop elessly intractable; moreover, Gibbs sampling metho ds are impractically slow. The organization of this pap er is as follows. Section 2 intro duces the problems of inference and learning in sigmoid b elief networks. Section 3 contains the main contribution of the pap er: a tractable mean eld theory. Here we present the mean eld approximation for sigmoid b elief networks and derivealower b ound on the likeliho o d of instantiated patterns of evidence. Section 4 lo oks at a mean eld algorithm for learning the parameters of sigmoid b elief networks. For this algorithm, we give results on a b enchmark problem in pattern recognition|the classi cation of handwritten digits. Finally, section 5 presents our conclusions, as well as future issues for research. 2. Sigmoid Belief Networks The great virtue of b elief networks is that they clearly exhibit the conditional dep endencies of the underlying probability mo del. Consider a b elief network de ned over binary random variables S =S ;S ;:::;S . We denote the parents of S by paS fS ; S ;:::S g; this is the smallest 1 2 N i i 1 2 i1 set of no des for which P S jS ;S ;:::;S =P S jpaS : 1 i 1 2 i1 i i In sigmoid b elief networks Neal, 1992, the conditional distributions attached to each no de are based on log-linear mo dels. In particular, the probability that the ith no de is activated is given by 1 0 X A @ J S + h ; 2 P S =1jpaS = ij j i i i j where J and h are the weights and biases in the network, and ij i 1 z = 3 z 1+ e is the sigmoid function shown in Figure 1. In sigmoid b elief networks, wehave J = 0 for S 62 paS ; ij j i moreover, J = 0 for j i since the network's structure is that of a directed acyclic graph. ij The sigmoid function in eq. 2 provides a compact parametrization of the conditional probability 1 distributions in eq. 2 used to propagate b eliefs. In particular, P S jpaS dep ends on paS i i i only through a sum of weighted inputs, where the weights may b e viewed as the parameters in a 1. The relation to noisy-OR mo dels is discussed in app endix A. 62 Mean Field Theoryfor Sigmoid Belief Networks 1 0.8 0.6 σ(z) 0.4 0.2 0 −6 −4 −2 0 2 4 6 z z 1 Figure 1: Sigmoid function z = [1 + e ] .Ifz is the sum of weighted inputs to no de S , then P S =1jz = z is the conditional probability that no de S is activated. logistic regression McCullagh & Nelder, 1983. The conditional probability distribution for S may i b e summarized as: h i P exp J S + h S ij j i i j h i : 4 P S jpaS = i i P 1 + exp J S + h ij j i j Note that substituting S = 1 in eq. 4 recovers the result in eq. 2. Combining eqs. 1 and 4, i wemay write the joint probability distribution over the variables in the network as: Y P S jpaS 5 P S = i i i h i 8 9 P < = exp J S + h S Y ij j i i j h i = : 6 P : ; 1 + exp J S + h i ij j i j The denominator in eq. 6 ensures that the probability distribution is normalized to unity. Wenow turn to the problem of inference in sigmoid b elief networks. Absorbing evidence divides the units in the b elief network into twotyp es, visible and hidden. The visible units or \evidence no des" are those for whichwehave instantiated values; the hidden units are those for whichwedo not. When there is no p ossible ambiguity,we will use H and V to denote the subsets of hidden and visible units. Using Bayes' rule, inference is done under the conditional distribution P H; V P H jV = ; 7 P V where X P V = P H; V 8 H is the likeliho o d of the evidence V . In principle, the likeliho o d may b e computed by summing over jH j all 2 con gurations of the hidden units. Unfortunately, this calculation is intractable in large, densely connected networks. This intractability presents a ma jor obstacle to learning parameters for these networks, as nearly all pro cedures for statistical estimation require frequent estimates of the likeliho o d.