Machine Learning What Is a Graphical Model?

Machine Learning

10-701/15-781, Spring 2008

Graphical Models

Eric Xing

Visit to Asia Smoking X1 X2 Lecture 18, March 26, 2008 X Tuberculosis 3 Lung Cancer X4 Bronchitis X5

Tuberculosis X or Cancer 6

XRay Result X X 7 Dyspnea 8 Reading: Chap. 8, C.B book Eric Xing 1

What is a graphical model? --- example from medical diagnostics

z A possible world for a patient with lung problem:

Visit to Asia Smoking X1 X2

X Lung Cancer Tuberculosis 3 X4 Bronchitis X5

Tuberculosis X or Cancer 6

XRay Result X 7 Dyspnea X8

Eric Xing 2

1 Recap of Basic Prob. Concepts

z Representation: what is the joint probability dist. on multiple variables?

P(X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 ,)

z How many state configurations in total? --- 28

z Are they all needed to be represented?

z Do we get any scientific/medical insight?

z Learning: where do we get all this probabilities?

z Maximal-likelihood estimation? but how many data do we need?

z Where do we put domain knowledge in terms of plausible relationships between variables, and plausible values of the probabilities?

z Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence?

Eric Xing 3

Dependencies among variables

Visit to Asia Smoking X1 X2

Patient Information

X Tuberculosis 3 Lung Cancer X4 Bronchitis X5

Medical Difficulties

Tuberculosis X or Cancer 6

XRay Result X 7 Dyspnea X8 Diagnostic Tests

Eric Xing 4

2 Probabilistic Graphical Models

z Represent dependency structure with a graph z Node <-> random variable z Edges encode dependencies z Absence of edge -> conditional independence z Directed and undirected versions

z Why is this useful? z A language for communication z A language for computation z A language for development

z Origins: z Wright 1920’s z Independently developed by Spiegelhalter and Lauritzen in statistics and Pearl in computer science in the late 1980’s

Eric Xing 5

Probabilistic Graphical Models, con'd

If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,

Visit to Asia Smoking X1 X2

P(X1, X2, X3, X4, X5, X6, X7, X8) X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 = P(X ) P(X ) P(X | X ) P(X | X ) P(X | X ) Tuberculosis 1 2 3 1 4 2 5 2 X or Cancer 6 P(X6| X3, X4) P(X7| X6) P(X8| X5, X6) XRay Result X 7 Dyspnea X8

Why we may favor a PGM? Representation cost: how many probability statements are needed? 2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28! Algorithms for systematic and efficient inference/learning computation • Exploring the graph structure and probabilistic (e.g., Bayesian, Markovian) semantics Incorporation of domain knowledge and causal (logical) structures

Eric Xing 6

3 Two types of GMs

z Directed edges give causality relationships (Bayesian Network or Directed Graphical Model):

z Undirected edges simply give (physical or symmetric) correlations between variables (Markov Random Field or Undirected Graphical model):

Eric Xing 7

Bayesian Network: Factorization Theorem

Visit to Asia Smoking X1 X2

z Theorem: Given a DAG, The most general form of the probability distribution that is consistent with the graph factors according to “node given its parents”: P(X) = P(X | X ) ∏ i π i i X where π i is the set of parents of xi. d is the number of nodes (variables) in the graph. Eric Xing 8

4 Bayesian Network: Conditional Independence Semantics

Structure: DAG Ancestor

• Meaning: a node is conditionally independent of every other node in the Parent Y1 Y2 network outside its Markov blanket X • Local conditional distributions (CPD) and the DAG completely determine the joint dist. Child Children's co-parent •Give causality relationships, and facilitate a generative Descendent process

Eric Xing 9

Local Structures & Independencies

z Common parent B z Fixing B decouples A and C "given the level of gene B, the levels of A and C are independent" A C

z Cascade

z Knowing B decouples A and C A B C "given the level of gene B, the level gene A provides no extra prediction value for the level of gene C"

z V-structure A B z Knowing C couples A and B because A can "explain away" B w.r.t. C C "If A correlates to C, then chance for B to also correlate to B will decrease"

z The language is compact, the concepts are rich!

Eric Xing 10

5 A simple justification

A C

Eric Xing 11

Graph separation criterion

z D-separation criterion for Bayesian networks (D for Directed edges):

Definition: variables x and y are D-separated (conditionally independent) given z if they are separated in the moralized ancestral graph

z Example:

Eric Xing 12

6 Global Markov properties of DAGs

z X is d-separated (directed-separated) from Z given Y if we can't send a ball from any node in X to any node in Z using the "Bayes- ball" algorithm illustrated bellow (and plus some boundary conditions):

• Defn: I(G)=all independence properties that correspond to d- separation:

I(G) = {X ⊥ Z Y : dsepG (X ; Z Y)}

• D-separation is sound and complete

Eric Xing 13

Example:

z Complete the I(G) of this x4 graph:

Eric Xing 14

7 Towards quantitative specification of probability distribution

z Separation properties in the graph imply independence properties about the associated variables

z For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents

z The Equivalence Theorem For a graph G,

Let D1 denote the family of all distributions that satisfy I(G),

Let D2 denote the family of all distributions that factor according to G,

Then D1≡D2.

Eric Xing 15

Example

A B

p(A,B,C) =

Eric Xing 16

8 Example, con'd

z Evolution

ancestor ?

T years Q Qh m A A CA G G A C

Tree Model

Eric Xing 17

Example, con'd

z Speech recognition

Y1 Y2 Y3 ... YT

XA1 XA2 XA3 ... XAT

Hidden Markov Model

Eric Xing 18

9 Example, con'd

z Genetic Pedigree

A0 B0 Ag Bg A1 B1

F0 Fg M 0 F1 Sg M 1

C C 0 g C 1

Eric Xing 19

Conditional probability tables (CPTs)

a0 0.75 b0 0.33 P(a,b,c.d) = a1 0.25 b1 0.67 P(a)P(b)P(c|a,b)P(d|c)

A B

a0b0 a0b1 a1b0 a1b1 c0 0.45 1 0.9 0.7 C c1 0.55 0 0.1 0.3

c0 c1 D d0 0.3 0.5 d1 07 0.5

Eric Xing 20

10 Conditional probability density func. (CPDs)

P(a,b,c.d) = A~N(µ , Σ ) a a B~N(µb, Σb) P(a)P(b)P(c|a,b)P(d|c)

A B

C C~N(A+B, Σc)

P(D| C)

D D~N(µa+C, Σa) C D

Eric Xing 21

Conditionally Independent Observations

θ Model parameters

y1 y2 yn-1 yn Data

Eric Xing 22

11 “Plate” Notation

θ Model parameters

Data = {y ,…y } yi 1 n

i=1:n

Plate = rectangle in graphical model

variables within a plate are replicated in a conditionally independent manner

Eric Xing 23

Example: Gaussian Model

µ σ Generative model:

p(y1,…yn | µ, σ) = P p(yi | µ, σ) = p(data | parameters) y i = p(D | θ) i=1:n where θ = {µ, σ}

Likelihood = p(data | parameters) = p( D | θ ) = L (θ) Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters Often easier to work with log L (θ)

Eric Xing 24

12 Example: Bayesian Gaussian Model

α µ σ β

i=1:n

Note: priors and parameters are assumed independent here

Eric Xing 25

Markov Random Fields

Structure: an undirected graph

• Meaning: a node is conditionally independent of Y1 Y2 every other node in the network given its Directed neighbors X • Local contingency functions (potentials) and the cliques in the graph completely determine the joint dist.

•Give correlations between variables, but no explicit way to generate samples

Eric Xing 26

13 Semantics of Undirected Graphs

z Let H be an undirected graph:

z B separates A and C if every path from a node in A to a node

in C passes through a node in B: sepH (A;C B) z A probability distribution satisfies the global Markov property if for any disjoint A, B, C, such that B separates A and C, A is

independent of C given B: I(H ) = {A ⊥ C B) :sepH (A;C B)}

Eric Xing 27

Representation

z Defn: an undirected graphical model represents a distribution

P(X1 ,…,Xn) defined by an undirected graph H, and a set of positive potential functions yc associated with cliques of H, s.t. 1 P(x1 ,K, xn ) = ∏ψ c (xc ) Z c∈C where Z is known as the partition function:

Z = ∑ ∏ψ c (xc ) x1,K,xn c∈C z Also known as Markov Random Fields, Markov networks …

z The potential function can be understood as an contingency function of its arguments assigning "pre-probabilistic" score of their joint configuration.

Eric Xing 28

14 Cliques

z For G={V,E}, a complete subgraph (clique) is a subgraph G'={V'ÍV,E'ÍE} such that nodes in V' are fully interconnected

z A (maximal) clique is a complete subgraph s.t. any superset V"ÉV' is not complete.

z A sub-clique is a not-necessarily-maximal clique.

D B

z Example: C

z max-cliques = {A,B,D}, {B,C,D},

z sub-cliques = {A,B}, {C,D}, …Æ all edges and singletons

Eric Xing 29

Example UGM – using max cliques

D B

C 1 P(x , x , x , x ) = ψ (x )×ψ (x ) 1 2 3 4 Z c 124 c 234

Z = ∑ψ c (x124 )×ψ c (x234 ) x1 ,x2 ,x3 ,x4

z For discrete nodes, we can represent P(X1:4) as two 3D tables instead of one 4D table

Eric Xing 30

15 Example UGM – using subcliques

D B

C 1 P(x1 , x2 , x3, x4 ) = ∏ψ ij (xij ) Z ij 1 = ψ (x )ψ (x )ψ (x )ψ (x )ψ (x ) Z 12 12 14 14 23 23 24 24 34 34

Z = ∑ ∏ψ ij (xij ) x1 ,x2 ,x3 ,x4 ij

z For discrete nodes, we can represent P(X1:4) as 5 2D tables instead of one 4D table

Eric Xing 31

Interpretation of Clique Potentials

X Y Z

z cannot have all potentials be marginals

z cannot have all potentials be conditionals

z The positive clique potentials can only be thought of as general "compatibility", "goodness" or "happiness" functions over their variables, but not as probability distributions.

Eric Xing 32

16 Exponential Form

z Constraining clique potentials to be positive could be inconvenient (e.g., the interactions between a pair of atoms can be either attractive or

repulsive). We represent a clique potential ψc(xc) in an unconstrained form using a real-value "energy" function φc(xc):

ψ c (xc ) = exp{−φc (xc )}

For convenience, we will call φc(xc) a potential when no confusion arises from the context. z This gives the joint a nice additive strcuture 1 ⎧ ⎫ 1 p(x) = exp⎨− ∑φc (xc )⎬ = exp{}− H (x) Z ⎩ c∈C ⎭ Z where the sum in the exponent is called the "free energy":

H (x) = ∑φc (xc ) c∈C z In physics, this is called the "Boltzmann distribution".

z In statistics, this is called a log-linear model.

Eric Xing 33

Example: Boltzmann machines

4 2

3 z A fully connected graph with pairwise (edge) potentials on

binary-valued nodes (for x i ∈ { − 1 , + 1 } or x i ∈ { 0 ,1 } ) is called a Boltzmann machine 1 ⎧ ⎫ P(x1 , x2 , x3 , x4 ) = exp⎨∑φij (xi, x j )⎬ Z ⎩ ij ⎭ 1 ⎧ ⎫ = exp⎨∑θij xi x j + ∑αi xi + C⎬ Z ⎩ ij i ⎭ z Hence the overall energy function has the form: H (x) = (x − µ)Θ (x − µ) = (x − µ)T Θ(x − µ) ∑ij i ij j

Eric Xing 34

17 Example: Ising (spin-glass) models

z Nodes are arranged in a regular topology (often a regular packing grid) and connected only to their geometric neighbors.

z Same as sparse Boltzmann machine, where θij≠0 iff i,j are neighbors.

z e.g., nodes are pixels, potential function encourages nearby pixels to have similar intensities.

z Potts model: multi-state Ising model.

Eric Xing 35

Example: Protein-Protein interaction networks

Eric Xing 36

18 Example: Modeling Go

Eric Xing 37

GMs are your old friends

Density estimation m,s Parametric and nonparametric methods X X Regression XY Linear, conditional mixture, nonparametric Q Q Classification

Generative and discriminative approach X X

Eric Xing 38

19 An (incomplete) genealogy of graphical models

(Picture by Zoubin Ghahramani and Sam Roweis)

Eric Xing 39

Probabilistic Inference

z Computing statistical queries regarding the network, e.g.:

z Is node X independent on node Y given nodes Z,W ?

z What is the probability of X=true if (Y=false and Z=true)?

z What is the joint distribution of (X,Y) if Z=false?

z What is the likelihood of some full assignment?

z What is the most likely assignment of values to all or a subset the nodes of the network?

z General purpose algorithms exist to fully automate such computation

z Computational cost depends on the topology of the network

z Exact inference:

z The junction tree algorithm z Approximate inference;

z Loopy belief propagation, variational inference, Monte Carlo sampling

Eric Xing 40

20 Learning Graphical Models

The goal:

Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)

E B E B

R A R A

C C (B,E,A,C,R)=(T,F,F,T,F) E B P(A | E,B) e b 0.9 0.1 (B,E,A,C,R)=(T,F,T,T,F) …….. e b 0.2 0.8 (B,E,A,C,R)=(F,T,T,T,F) e b 0.9 0.1 e b 0.01 0.99 Eric Xing 41

A Bayesian Learning Approach

For model p(A|Q ):

z Treat parameters/model Q as unobserved random variable(s)

z probabilistic statements of Q is conditioned on the values of the observed variables Aobs p(Θ )

E B E B

R A R A

C C • Bayes’ rule: p(Θ | A) ∝ p(A | Θ) p(Θ) posterior likelihood prior

Θ = Θp(Θ | A)dΘ • Bayesian estimation: Bayes ∫ Eric Xing 42

21 Why graphical models

z Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data.

z The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.

z Many of the classical multivariate probabilistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism

• -- examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models.

z The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism.

--- M. Jordan Eric Xing 43