An Introduction to Graphical Models
Total Page:16
File Type:pdf, Size:1020Kb
AN INTRODUCTION TO GRAPHICAL MODELS Michael I. Jordan Center for Biological and Computational Learning Massachusetts Institute of Technology http://www.ai.mit.edu/projects/jordan.html Acknow ledgments: Zoubin Ghahramani, Tommi Jaakkola, Marina Meila Lawrence Saul December, 1997 GRAPHICAL MODELS Graphical mo dels are a marriage between graph theory and probability theory They clarify the relationship between neural networks and related network-based mo dels such as HMMs, MRFs, and Kalman lters Indeed, they can be used to give a fully probabilistic interpretation to many neural network architectures Some advantages of the graphical mo del p oint of view { inference and learning are treated together { sup ervised and unsup ervised learning are merged seamlessly { missing data handled nicely { a fo cus on conditional indep endence and computational issues { interpretability if desired Graphical mo dels cont. There are two kinds of graphical mo dels; those based on undirected graphs and those based on directed graphs. Our main fo cus will be directed graphs. Alternative names for graphical mo dels: belief networks, Bayesian networks, probabilistic independence networks, Markov random elds, loglinear models, in uence diagrams A few myths ab out graphical mo dels: { they require a lo calist semantics for the no des { they require a causal semantics for the edges { they are necessarily Bayesian { they are intractable Learning and inference A key insight from the graphical mo del p oint of view: It is not necessary to learn that which can be inferred The weights in a network make lo cal assertions ab out the relationships between neighb oring no des Inference algorithms turn these lo cal assertions into global assertions ab out the relationships between no des { e.g., correlations between hidden units conditional on an input-output pair { e.g., the probability of an input vector given an output vector This is achieved by asso ciating a joint probability distribution with the network Directed graphical mo dels|basics Consider an arbitrary directed acyclic graph, where each no de in the graph corresp onds to a random variable scalar or vector: D A F C B E There is no a priori need to designate units as \inputs," \outputs" or \hidden" We want to asso ciate a probability distribution P A; B ; C ; D ; E ; F with this graph, and we want all of our calculations to resp ect this distribution e.g., P A; B ; C ; D ; E ; F C D E P F jA; B = P A; B ; C ; D ; E ; F C D E F Some ways to use a graphical mo del Prediction: Diagnosis, control, optimization: Supervised learning: we want to marginalize over the unshaded no des in each case i.e., integrate them out from the joint probability \unsup ervised learning" is the general case Sp eci cation of a graphical mo del There are two comp onents to any graphical mo del: { the qualitative sp eci cation { the quantitative sp eci cation Where do es the qualitative sp eci cation come from? { prior knowledge of causal relationships { prior knowledge of mo dular relationships { assessment from exp erts { learning from data { we simply like a certain architecture e.g., a layered graph Qualitative sp eci cation of graphical mo dels ABC ABC A and B are marginally dependent A and B are conditionally independent C C A B A B A and B are marginally dependent A and B are conditionally independent Semantics of graphical mo dels cont A B A B C C A and B are marginally independent A and B are conditionally dependent This is the interesting case... \Explaining away" Burglar Earthquake Alarm Radio All connections in b oth directions are \excitatory" But an increase in \activation" of Earthquake leads to a decrease in \activation" of Burglar Where do es the \inhibition" come from? Quantitative sp eci cation of directed mo dels Question: how do we sp ecify a joint distribution over the no des in the graph? Answer: asso ciate a conditional probability with each no de: P(D|C) P(A) P(F|D,E) P(C|A,B) P(B) P(E|C) and take the pro duct of the local probabilities to yield the global probabilities Justi cation In general, let fS g represent the set of random variables corresp onding to the N no des of the graph For any no de S , let paS represent the set of i i parents of no de S i Then P S = P S P S jS P S jS ;:::;S 1 2 1 N N 1 1 Y = P S jS ;:::;S i i1 1 i Y = P S jpaS i i i where the last line is by assumption It is possible to prove a theorem that states that if arbitrary probability distributions are utilized for P S jpaS in the formula above, then the family i i of probability distributions obtained is exactly that set which respects the qualitative speci cation the conditional independence relations described earlier Semantics of undirected graphs ABC ABC A and B are marginally dependent A and B are conditionally independent Comparative semantics D A B A B C C The graph on the left yields conditional indep endencies that a directed graph can't represent The graph on the right yields marginal indep endencies that an undirected graph can't represent Quantitative sp eci cation of undirected mo dels D A F C B E identify the cliques in the graph: A, C C, D, E D, E, F B, C de ne a con guration of a clique as a sp eci cation of values for each no de in the clique de ne a potential of a clique as a function that asso ciates a real number with each con guration of the clique Φ A, C Φ Φ C, D, E D, E, F Φ B, C Quantitative sp eci cation cont. Consider the example of a graph with binary no des A \p otential" is a table with entries for each combination of no des in a clique B 0 1 AB 0 1.5 .4 A 1 .7 1.2 \Marginalizing" over a p otential table simply means collapsing summing the table along one or more dimensions marginalizing over B marginalizing over A B 0 1.9 0 1 A 2.2 1.6 1 1.9 Quantitative sp eci cation cont. nally, de ne the probability of a global con guration of the no des as the pro duct of the local p otentials on the cliques: P A; B ; C ; D ; E ; F = A;B B;C C;D;E D;E ;F where, without loss of generality, we assume that the normalization constant if any has b een absorb ed into one of the p otentials It is then possible to prove a theorem that states that if arbitrary potentials are utilized in the product formula for probabilities, then the family of probability distributions obtained is exactly that set which respects the qualitative speci cation the conditional independence relations described earlier This theorem is known as the Hammersley-Cli ord theorem Boltzmann machine The Boltzmann machine is a sp ecial case of an undirected graphical mo del For a Boltzmann machine all of the p otentials are formed by taking pro ducts of factors of the form expfJ S S g ij i j Si Jij Sj Setting J equal to zero for non-neighb oring no des ij guarantees that we resp ect the clique b oundaries But we don't get the full conditional probability semantics with the Boltzmann machine parameterization { i.e., the family of distributions parameterized by a Boltzmann machine on a graph is a prop er subset of the family characterized by the conditional indep endencies Evidence and Inference \Absorbing evidence" means observing the values of certain of the no des Absorbing evidence divides the units of the network into two groups: visible units those for which we have fV g instantiated values \evidence no des". hidden units those for which we do not fH g have instantiated values. \Inference" means calculating the conditional distribution P H; V P H jV = P H; V fH g { prediction and diagnosis are sp ecial cases Inference algorithms for directed graphs There are several inference algorithms; some of which op erate directly on the directed graph The most p opular inference algorithm, known as the junction tree algorithm which we'll discuss here, op erates on an undirected graph It also has the advantage of clarifying some of the relationships between the various algorithms To understand the junction tree algorithm, we need to understand how to \compile" a directed graph into an undirected graph Moral graphs Note that for b oth directed graphs and undirected graphs, the joint probability is in a pro duct form So let's convert lo cal conditional probabilities into p otentials; then the pro ducts of p otentials will give the right answer Indeed we can think of a conditional probability, e.g., P C jA; B as a function of the three variables A; B , and C we get a real number for each con guration: A P(C|A,B) B C Problem: A no de and its parents are not generally in the same clique Solution: Marry the parents to obtain the \moral graph" Φ = P(C|A,B) A A,B,C B C Moral graphs cont. De ne the p otential on a clique as the pro duct over all conditional probabilities contained within the clique Now the pro ducts of p otentials gives the right answer: P A; B ; C ; D ; E ; F = P AP B P C jA; B P D jC P E jC P F jD; E = A;B ;C C;D;E D;E ;F where = P AP B P C jA; B A;B ;C and = P D jC P E jC C;D;E and = P F jD; E D;E ;F D D A F A F C C B E B E Propagation of probabilities Now supp ose that some evidence has b een absorb ed.