Probabilistic Graphical Models

Probabilistic Graphical Models David Sontag New York University Lecture 3, February 14, 2013 David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 1 / 33 Undirected graphical models Reminder of lecture 2 An alternative representation for joint distributions is as an undirected graphical model (also known as Markov random fields) As in BNs, we have one node for each random variable Rather than CPDs, we specify (non-negative) potential functions over sets of variables associated with cliques C of the graph, 1 Y p(x ;:::; x ) = φ (x ) 1 n Z c c c C 2 Z is the partition function and normalizes the distribution: X Y Z = φc (^xc ) x^ ;:::;x^ c C 1 n 2 David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 2 / 33 Undirected graphical models 1 Y X Y p(x ;:::; x ) = φ (x ); Z = φ (^x ) 1 n Z c c c c c C x^ ;:::;x^ c C 2 1 n 2 Simple example (potential function on each edge encourages the variables to take the same value): B C C φA,B(a, b)= 0 1 φB,C(b, c)= 0 1 φA,C (a, c)= 0 1 B 0 10 1 0 10 1 0 10 1 A B A 1 1 1 10 A C 1 1 10 1 10 1 p(a; b; c) = φ (a; b) φ (b; c) φ (a; c); Z A;B · B;C · A;C where X Z = φ (â; b^) φ (b^; c^) φ (â; c^) = 2 1000 + 6 10 = 2060: A;B · B;C · A;C · · â;b^;c^ 0;1 3 2f g David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 3 / 33 Example: Ising model Invented by the physicist Wilhelm Lenz (1920), who gave it as a problem to his student Ernst Ising Mathematical model of ferromagnetism in statistical mechanics The spin of an atom is biased by the spins of atoms nearby on the material: = +1 = -1 Each atom X 1; +1 , whose value is the direction of the atom spin i 2 {− g If a spin at position i is +1, what is the probability that the spin at position j is also +1? Are there phase transitions where spins go from \disorder" to \order"? David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 4 / 33 Example: Ising model Each atom X 1; +1 , whose value is the direction of the atom spin i 2 {− g The spin of an atom is biased by the spins of atoms nearby on the material: = +1 = -1 1 X X p(x ; ; x ) = exp w x x u x 1 ··· n Z i;j i j − i i i<j i When wi;j > 0, nearby atoms encouraged to have the same spin (called ferromagnetic), whereas w < 0 encourages X = X i;j i 6 j Node potentials exp( u x ) encode the bias of the individual atoms − i i Scaling the parameters makes the distribution more or less spiky David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 5 / 33 Today's lecture Markov random fields 1 Factor graphs 2 Bayesian networks Markov random fields (moralization) ) 3 Hammersley-Clifford theorem (conditional independence joint distribution factorization) ) Conditional models 3 Discriminative versus generative classifiers 4 Conditional random fields David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 6 / 33 Higher-order potentials The examples so far have all been pairwise MRFs, involving only node potentials φi (Xi ) and pairwise potentials φi;j (Xi ; Xj ) Often we need higher-order potentials, e.g. φ(x; y; z) = 1[x + y + z 1]; ≥ where X ; Y ; Z are binary, enforcing that at least one of the variables takes the value 1 Although Markov networks are useful for understanding independencies, they hide much of the distribution's structure: A B C D Does this have pairwise potentials, or one potential for all 4 variables? David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 7 / 33 Factor graphs G does not reveal the structure of the distribution: maximum cliques vs. subsets of them A factor graph is a bipartite undirected graph with variable nodes and factor nodes. Edges are only between the variable nodes and the factor nodes Each factor node is associated with a single potential, whose scope is the set of variables that are neighbors in the factor graph A B A B C D Factor graphs C D A B Markov network C D The distribution is same as the MRF { this is just a different data structure David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 8 / 33 Example: Low-density parity-check codes Error correcting codes for transmitting a message over a noisy channel (invented by Galleger in the 1960's, then re-discovered in 1996) fA fB fC Y1 Y2 Y3 Y4 Y5 Y6 f1 f2 f3 f4 f5 f6 X1 X2 X3 X4 X5 X6 Each of the top row factors enforce that its variables have even parity: f (Y ; Y ; Y ; Y ) = 1 if Y Y Y Y = 0; and 0 otherwise A 1 2 3 4 1 ⊗ 2 ⊗ 3 ⊗ 4 Thus, the only assignments Y with non-zero probability are the following (called codewords): 3 bits encoded using 6 bits 000000; 011001; 110010; 101011; 111100; 100101; 001110; 010111 f (Y ; X ) = p(X Y ), the likelihood of a bit flip according to noise model i i i i j i David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 9 / 33 Example: Low-density parity-check codes fA fB fC Y1 Y2 Y3 Y4 Y5 Y6 The decoding problemf1 forf LDPCs2 f3 isf4 to findf5 f6 X1 X2 X3 X4 X5 X6 argmax p(y x) y j This is called the maximum a posteriori (MAP) assignment Since Z and p(x) are constants with respect to the choice of y, can equivalently solve (taking the log of p(y; x)): X argmaxy θc (xc ); c C where θc (xc ) = log φc (xc ) 2 This is a discrete optimization problem! David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 10 / 33 Converting BNs to Markov networks What is the equivalent Markov network for a hidden Markov model? Y Y Y Y Y Y1 2 3 4 5 6 X1 X2 X3 X4 X5 X6 Many inference algorithms are more conveniently given for undirected models { this shows how they can be applied to Bayesian networks David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 11 / 33 Moralization of Bayesian networks Procedure for converting a Bayesian network into a Markov network The moral graph [G] of a BN G = (V ; E) is an undirected graph over V M that contains an undirected edge between Xi and Xj if 1 there is a directed edge between them (in either direction) 2 Xi and Xj are both parents of the same node A B Moralization A B C D C D (term historically arose from the idea of \marrying the parents" of the node) The addition of the moralizing edges leads to the loss of some independence information, e.g., A C B, where A B is lost ! ? David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 12 / 33 Converting BNs to Markov networks 1 Moralize the directed graph to obtain the undirected graphical model: 2 Introduce one potential function for each CPD: A φB (x ; xMoralization) = p(x Ax ) B i i pa(i) i j pa(i) C D C D So, converting a hidden Markov model to a Markov network is simple: For variables having > 1 parent, factor graph notation is useful David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 13 / 33 Factorization implies conditional independencies p(x) is a Gibbs distribution over G if it can be written as 1 Y p(x ;:::; x ) = φ (x ); 1 n Z c c c C 2 where the variables in each potential c C form a clique in G 2 Recall that conditional independence is given by graph separation: XB XA XC Theorem (soundness of separation): If p(x) is a Gibbs distribution for G, then G is an I-map for p(x), i.e. I (G) I (p) ⊆ Proof: Suppose B separates A from C. Then we can write 1 p(X ; X ; X ) = f (X ; X )g(X ; X ): A B C Z A B B C David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 14 / 33 Conditional independencies implies factorization Theorem (soundness of separation): If p(x) is a Gibbs distribution for G, then G is an I-map for p(x), i.e. I (G) I (p) ⊆ What about the converse? We need one more assumption: A distribution is positive if p(x) > 0 for all x Theorem (Hammersley-Clifford, 1971): If p(x) is a positive distribution and G is an I-map for p(x), then p(x) is a Gibbs distribution that factorizes over G Proof is in book (as is counter-example for when p(x) is not positive) This is important for learning: Prior knowledge is often in the form of conditional independencies (i.e., a graph structure G) Hammersley-Clifford tells us that it suffices to search over Gibbs distributions for G { allows us to parameterize the distribution David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 15 / 33 Today's lecture Markov random fields 1 Factor graphs 2 Bayesian networks Markov random fields (moralization) ) 3 Hammersley-Clifford theorem (conditional independence joint distribution factorization) ) Conditional models 3 Discriminative versus generative classifiers 4 Conditional random fields David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 16 / 33 Discriminative versus generative classifiers There is often significant flexibility in choosing the structure and parameterization of a graphical model Generative Discriminative Y X X Y It is important to understand the trade-offs In the next few slides, we will study this question in the context of e-mail classification David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 17 / 33 From lecture 1..

Probabilistic Graphical Models

Graphical Models for Discrete and Continuous Data Arxiv:1609.05551V3

One-Network Adversarial Fairness

Remote Sensing

Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

On Construction of Hybrid Logistic Regression-Naıve Bayes Model For

Graphical Models

Data-Driven Topo-Climatic Mapping with Machine Learning Methods

A Hierarchical Generative Model of Electrocardiogram Records

A Discriminative Model to Predict Comorbidities in ASD

Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

Probabilistic PCA and Factor Analysis

Learning a Generative Model of Images by Factoring Appearance and Shape