Machine Learning What Is a Graphical Model?

Machine Learning 10-701/15-781, Spring 2008 Graphical Models Eric Xing Visit to Asia Smoking X1 X2 Lecture 18, March 26, 2008 X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 Tuberculosis X or Cancer 6 XRay Result X X 7 Dyspnea 8 Reading: Chap. 8, C.B book Eric Xing 1 What is a graphical model? --- example from medical diagnostics z A possible world for a patient with lung problem: Visit to Asia Smoking X1 X2 X Lung Cancer Tuberculosis 3 X4 Bronchitis X5 Tuberculosis X or Cancer 6 XRay Result X 7 Dyspnea X8 Eric Xing 2 1 Recap of Basic Prob. Concepts z Representation: what is the joint probability dist. on multiple variables? P(X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 ,) z How many state configurations in total? --- 28 z Are they all needed to be represented? z Do we get any scientific/medical insight? z Learning: where do we get all this probabilities? z Maximal-likelihood estimation? but how many data do we need? z Where do we put domain knowledge in terms of plausible relationships between variables, and plausible values of the probabilities? z Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence? Eric Xing 3 Dependencies among variables Visit to Asia Smoking X1 X2 Patient Information X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 Medical Difficulties Tuberculosis X or Cancer 6 XRay Result X 7 Dyspnea X8 Diagnostic Tests Eric Xing 4 2 Probabilistic Graphical Models z Represent dependency structure with a graph z Node <-> random variable z Edges encode dependencies z Absence of edge -> conditional independence z Directed and undirected versions z Why is this useful? z A language for communication z A language for computation z A language for development z Origins: z Wright 1920’s z Independently developed by Spiegelhalter and Lauritzen in statistics and Pearl in computer science in the late 1980’s Eric Xing 5 Probabilistic Graphical Models, con'd If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g., Visit to Asia Smoking X1 X2 P(X1, X2, X3, X4, X5, X6, X7, X8) X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 = P(X ) P(X ) P(X | X ) P(X | X ) P(X | X ) Tuberculosis 1 2 3 1 4 2 5 2 X or Cancer 6 P(X6| X3, X4) P(X7| X6) P(X8| X5, X6) XRay Result X 7 Dyspnea X8 Why we may favor a PGM? Representation cost: how many probability statements are needed? 2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28! Algorithms for systematic and efficient inference/learning computation • Exploring the graph structure and probabilistic (e.g., Bayesian, Markovian) semantics Incorporation of domain knowledge and causal (logical) structures Eric Xing 6 3 Two types of GMs z Directed edges give causality relationships (Bayesian Network or Directed Graphical Model): z Undirected edges simply give (physical or symmetric) correlations between variables (Markov Random Field or Undirected Graphical model): Eric Xing 7 Bayesian Network: Factorization Theorem Visit to Asia Smoking X1 X2 P(X1, X2, X3, X4, X5, X6, X7, X8) X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 = P(X ) P(X ) P(X | X ) P(X | X ) P(X | X ) Tuberculosis 1 2 3 1 4 2 5 2 X or Cancer 6 P(X6| X3, X4) P(X7| X6) P(X8| X5, X6) XRay Result X 7 Dyspnea X8 z Theorem: Given a DAG, The most general form of the probability distribution that is consistent with the graph factors according to “node given its parents”: P(X) = P(X | X ) ∏ i π i i X where π i is the set of parents of xi. d is the number of nodes (variables) in the graph. Eric Xing 8 4 Bayesian Network: Conditional Independence Semantics Structure: DAG Ancestor • Meaning: a node is conditionally independent of every other node in the Parent Y1 Y2 network outside its Markov blanket X • Local conditional distributions (CPD) and the DAG completely determine the joint dist. Child Children's co-parent •Give causality relationships, and facilitate a generative Descendent process Eric Xing 9 Local Structures & Independencies z Common parent B z Fixing B decouples A and C "given the level of gene B, the levels of A and C are independent" A C z Cascade z Knowing B decouples A and C A B C "given the level of gene B, the level gene A provides no extra prediction value for the level of gene C" z V-structure A B z Knowing C couples A and B because A can "explain away" B w.r.t. C C "If A correlates to C, then chance for B to also correlate to B will decrease" z The language is compact, the concepts are rich! Eric Xing 10 5 A simple justification B A C Eric Xing 11 Graph separation criterion z D-separation criterion for Bayesian networks (D for Directed edges): Definition: variables x and y are D-separated (conditionally independent) given z if they are separated in the moralized ancestral graph z Example: Eric Xing 12 6 Global Markov properties of DAGs z X is d-separated (directed-separated) from Z given Y if we can't send a ball from any node in X to any node in Z using the "Bayes- ball" algorithm illustrated bellow (and plus some boundary conditions): • Defn: I(G)=all independence properties that correspond to d- separation: I(G) = {X ⊥ Z Y : dsepG (X ; Z Y)} • D-separation is sound and complete Eric Xing 13 Example: z Complete the I(G) of this x4 graph: x1 x3 x2 Eric Xing 14 7 Towards quantitative specification of probability distribution z Separation properties in the graph imply independence properties about the associated variables z For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents z The Equivalence Theorem For a graph G, Let D1 denote the family of all distributions that satisfy I(G), Let D2 denote the family of all distributions that factor according to G, Then D1≡D2. Eric Xing 15 Example A B p(A,B,C) = C Eric Xing 16 8 Example, con'd z Evolution ancestor ? T years Q Qh m A A CA G G A C Tree Model Eric Xing 17 Example, con'd z Speech recognition Y1 Y2 Y3 ... YT XA1 XA2 XA3 ... XAT Hidden Markov Model Eric Xing 18 9 Example, con'd z Genetic Pedigree A0 B0 Ag Bg A1 B1 F0 Fg M 0 F1 Sg M 1 C C 0 g C 1 Eric Xing 19 Conditional probability tables (CPTs) a0 0.75 b0 0.33 P(a,b,c.d) = a1 0.25 b1 0.67 P(a)P(b)P(c|a,b)P(d|c) A B a0b0 a0b1 a1b0 a1b1 c0 0.45 1 0.9 0.7 C c1 0.55 0 0.1 0.3 c0 c1 D d0 0.3 0.5 d1 07 0.5 Eric Xing 20 10 Conditional probability density func. (CPDs) P(a,b,c.d) = A~N(µ , Σ ) a a B~N(µb, Σb) P(a)P(b)P(c|a,b)P(d|c) A B C C~N(A+B, Σc) P(D| C) D D~N(µa+C, Σa) C D Eric Xing 21 Conditionally Independent Observations θ Model parameters y1 y2 yn-1 yn Data Eric Xing 22 11 “Plate” Notation θ Model parameters Data = {y ,…y } yi 1 n i=1:n Plate = rectangle in graphical model variables within a plate are replicated in a conditionally independent manner Eric Xing 23 Example: Gaussian Model µ σ Generative model: p(y1,…yn | µ, σ) = P p(yi | µ, σ) = p(data | parameters) y i = p(D | θ) i=1:n where θ = {µ, σ} Likelihood = p(data | parameters) = p( D | θ ) = L (θ) Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters Often easier to work with log L (θ) Eric Xing 24 12 Example: Bayesian Gaussian Model α µ σ β yi i=1:n Note: priors and parameters are assumed independent here Eric Xing 25 Markov Random Fields Structure: an undirected graph • Meaning: a node is conditionally independent of Y1 Y2 every other node in the network given its Directed neighbors X • Local contingency functions (potentials) and the cliques in the graph completely determine the joint dist. •Give correlations between variables, but no explicit way to generate samples Eric Xing 26 13 Semantics of Undirected Graphs z Let H be an undirected graph: z B separates A and C if every path from a node in A to a node in C passes through a node in B: sepH (A;C B) z A probability distribution satisfies the global Markov property if for any disjoint A, B, C, such that B separates A and C, A is independent of C given B: I(H ) = {A ⊥ C B) :sepH (A;C B)} Eric Xing 27 Representation z Defn: an undirected graphical model represents a distribution P(X1 ,…,Xn) defined by an undirected graph H, and a set of positive potential functions yc associated with cliques of H, s.t. 1 P(x1 ,K, xn ) = ∏ψ c (xc ) Z c∈C where Z is known as the partition function: Z = ∑ ∏ψ c (xc ) x1,K,xn c∈C z Also known as Markov Random Fields, Markov networks … z The potential function can be understood as an contingency function of its arguments assigning "pre-probabilistic" score of their joint configuration.

Machine Learning What Is a Graphical Model?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support