Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2
Total Page:16
File Type:pdf, Size:1020Kb
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 22, 2007 Yasemin Altun Lecture 2 Graphical Models A framework for multivariate statistical models Dependency between variables encoded in the graph structure A graphical model is associated with a family of probability distributions characterized by conditional independencies numerical representations Captures a joint probability distribution from a set of local pieces 2 types Bayesian Networks: Graphical models with DAGs Markov Random Fields: Graphical models with undirected graphs Yasemin Altun Lecture 2 Directed Graphical Models Bayes(ian) Net(work)s Example DAG Consider directed acyclic graphs over n variables. Consider this six node network: The joint probability is now: • • X 4 Each node has (possibly empty) set of parents πi. • X 2 X P(x , x , x , x , x , x ) = Each node maintains a function fi(xi; xπ ) such that X 6 1 2 3 4 5 6 • i 1 f > 0 and f (x ; xπ ) = 1 π . P(x1)P(x2 x1)P(x3 x1) i xi i i i ∀ i | | Define the joint probability to be: P(x4 x2)P(x5 x3)P(x6 x2, x5) • ! | | | P(x , x , . , x ) = f (x ; x ) X 3 X 5 1 2 n i i πi x 2 i 0 1 x 1 0 " 0 1 x 4 Even with no further restriction on the the f , it is always true that G(V , E), V set of nodes, E set of directed edges 1 i x 5 0 1 x 2 0 1 x x G is acyclic X 4 fi(xi; πi) = P(xi πi) 0 | x 6 Each node associated with a random variableXX2 i with 1 so we will just write 0 1 0 assignments xi X 6 x 2 x 1 X1 P(x1, x2, . , xn) = P(xi xπ ) 1 | i πi : Parents of node i, vi : Non descendants of node i i X 3 X 5 Local Markov property: A node is conditionally x 3 " x 1 0 1 Factorization of the joint in terms of local conditional probabilities. 0 1 independent of its non-descendants given its parent. 0 • 0 x 5 x 3 1 Exponential in “fan-in” of each node instead of in total variables n. 1 Il (G) = {Xi ⊥ Xvi |Xπi }. Yasemin Altun Lecture 2 Conditional Independence in DAGs Missing Edges If we order the nodes in a directed graphical model so that parents Key point about directed graphical models: • always come before their children in the ordering then the graphical • Missing edges imply conditional independence model implies the following about the distribution: Remember, that by the chain rule we can always write the full joint • x x xπ i as a product of conditionals, given an ordering: { i ⊥ πi| i}∀ where x are the nodes coming before x that are not its parents. P(x , x , x , x , . .) = P(x )P(x x )P(x x , x )P(x x , x , x ) . πi i 1 2 3 4 1 2 1 3 1 2 4 1 2 3 # | | | In other words, the DAG is telling us that each variable is If the joint is represented by a DAGM, then some of the • condition#ally independent of its non-descendants given its parents. • conditioned variables on the right hand sides are missing. Such an ordering is called a “topological” ordering. This is equivalent to enforcing conditional independence. • Start with the “idiot’s graph”: each node has all previous nodes in • the ordering as its parents. Now remove edges to get your DAG. • Removing an edge into node i eliminates an argument from the • conditional probability factor p(xi x1, x2, . , xi 1) | − Bayes Nets Defn: P factorizes wrt G if the joint probability n Y P(X1:n) = P(Xi |Xπi ) i=1 Thm: If Il (G) ⊆ I(P), then P factorizes according to G. Proof: P(X1:n) = P(X1)P(X2|X1)P(X3|X1, X2) ... N N N Y Y Y = P(Xi |X1:i−1) = P(Xi |Xπi , XV \πi ) = P(Xi |Xπi ) i=1 i=1 i=1 Infer global independencies I(G) using Bayes Ball Alg Thm: If P factorizes wrt G, then I(G) ⊆ I(P). Thm: If I :(X ⊥ Y |Z) ∈/ I(G), there is some distribution P that factorized wrt G and X ⊥P Y |Z Yasemin Altun Lecture 2 Markov Random Fields (MRFs) G(V , E), V set of nodes, E set of undirected edges Global Markov Property: A node i is conditionally independent from its non-neighbors given its neighbors Ni . Separation: XA ⊥ XC|XB if every path from a node in XA to a node in XC includes at least one node in XB. Clique c ∈ C, fully connected subset of nodes 1 Y X Y P(X ) = ψ (x ), Z = ψ (x ) 1:N Z c c c c c∈C X c Even more structure C the set of maximalUn cliquesdirected Models Surprisingly, once you have specified the basic conditional ψ potentialAlso (gcompatibilityraphs with one node p)er functionsrandom variable (positive,and edges that not • • independencies, there are other ones that follow from those. probabilistic)connect pairs of nodes, but now the edges are undirected. In general, it is a hard problem to say which extra CI statements Semantics: every node is conditionally independent from its • • P follow from a basic set. However, in the case of DAGMs, we have Z the partitionnon-neighb functionours given its n toeighb ensureours, i.e. x P(x) = 1. an efficient way of generating all CI statements that must be true x x x if every path b/w x and x goes through x A ⊥ C | B A C B given the connectivity of the graph. This involves the idea of d-separation in a graph. • Notice that for specific (numerical) choices of factors at the nodes • there may be even more conditional independencies, but we are only concerned with statements that are always true of every XB member of the family of distributions, no matter what specific XA XC factors live at the nodes. Yasemin Altun Lecture 2 Can model symmetric interactions that directed models cannot. Remember: the graph alone represents a family of joint distributions • aka Markov Random Fields, Markov Networks, Boltzmann • consistent with its CI assumptions, not any specific distribution. • Machines, Spin Glasses, Ising Models Explaining Away Simple Graph Separation X Z X Z In undirected models, simple graph separation (as opposed to • d-separation) tells us about conditional independencies. x x x if every path between x and x is blocked • A ⊥ C| B A C by some node in xB. Y Q: When we condition on y, are x and z independent? • P(x, y, z) = P(x)P(z)P(y x, z) | x and z are marginally independent, but given y they are XB • XA conditionally dependent. XC This important effect is called explaining away (Berkson’s paradox.) “Markov Ball” algorithm: • • For example, flip two coins independently; let x=coin1,z=coin2. remove xB and see if there is any path from xA to xC. • Let y=1 if the coins come up the same and y=0 if different. x and z are independent, but if I tell you y, they become coupled! • Hammersley Clifford Theorem For positive distributions, the following two characterization are equivalent Markov: XA ⊥ XB|XS whenever S seperates A from B Factorization (Gibbs) 1 Y p(x) = ψ (x ) Z C C C∈C Proof: Factorization ⇒ Markov. Show p(xA|xB, xS) = p(xA|xS) 1 Y Y p(x) = ψ (x ) ψ (x ) Z C C C C C⊆A∪S C⊆B∪S,C⊆S Yasemin Altun Lecture 2 CS 281a / Stat 241a Lecture 4 — September 13 Fall 2005 CS 281a / Stat 241a Lecture 4 — September 13 Fall 2005 as the energy function, from its connections in statistical mechanics in physics. We as the enerthengy hafunctionve the,nicefrominterpretationits connectionsthatinthestatisticalmost likelymecconfigurationshanics in physics.of ourWsysteme have then havelothewerniceenergies.interpretation that the most likely configurations of our system have lower energies.As in the directed case, we have two ways of characterizing the set of probability distri- As in butionsthe directedrepresencase,tedwebyhatheve tgraph:wo ways of characterizing the set of probability distri- butions represen1. Markted bov:y theX graph:X X whenever S seperates A from B A ⊥⊥ B| S 1. Markov: X X X whenever S seperates A from B 2. Factorization:A B S p(x) = 1 ψ (x ), where Z = ψ (x ) ⊥⊥ | Z C C C x1,x2,...,xn C C C 2. Factorization: p(x) = 1 ψ (x ),∈Cwhere Z = ψ (x )∈C Z C C C x1,x2,...,xn C C C Theorem 4.1 (Hamm∈erslC ey-C! lifford, 1973). For strictly" ∈pCositiv!e distributions, the two Theoremcharacterizations4.1 (Hammerslareey-C!equivliffalenord,t 1973). For strictly" positiv!e distributions, the two characterizations are equivalent Proof: For Markov Factorization, see 4.5.3 of chapter 16 for a proof using the M¨obius ⇒ § Proof: FInorvMarkersionovFormFula.actorization, see 4.5.3 of chapter 16 for a proof using the M¨obius Inversion FormLetula.show⇒Factorization Mark§ ov. Let S seperate A and B. We want to show that ⇒ Let shop(wXAFactorizationXB, XS) = p(XAMarkXB).ov. Let S seperate A and B. We want to show that | ⇒ | p(X X , X Claim:) = p(XWeXcan). write A| B S A| B Claim: We can write 1 p(x) = ψ (x ) ψ (x ), 1 Z C C C C C A S C B S p(x) = ψC (xC ) ψC (x⊆C )∪, Z ⊆#∪ C#!S C B S C A S ⊆ ∪ because S separates A and B⊆#and∪ so all theC#!cliquesS are included in this product.