<<

CS 6347

Lecture 4

Markov Random Fields (a.k.a Undirected Graphical Models)

• A Bayesian network is a directed that captures independence relationships of a given distribution

– Encodes local Markov independence assumptions that each node is independent of its non-descendants given its parents

– Corresponds to a factorization of the joint distribution

푝 푥1, … , 푥푛 = 푝(푥푖|푥푝푎푟푒푛푡푠(푖)) 푖

2 Limits of Bayesian Networks

• Not all sets of independence relations can be captured by a Bayesian network

– 퐴 ⊥ 퐶 | 퐵, 퐷 – 퐵 ⊥ 퐷 | 퐴, 퐶

• Possible DAGs that represent these independence relationships?

퐴 퐷 퐵 퐷

퐵 퐶 퐶 퐴

3 Markov Random Fields (MRFs)

• A Markov random is an undirected graphical model

– Undirected graph 퐺 = (푉, 퐸)

– One node for each

– One potential function or "factor" associated with cliques, 퐶, of the graph

– Nonnegative potential functions represent interactions and need not correspond to conditional (may not even sum to one)

4 Markov Random Fields (MRFs)

• A Markov is an undirected graphical model

– Corresponds to a factorization of the joint distribution

1 푝 푥 , … , 푥 = 휓 (푥 ) 1 푛 푍 푐 푐 푐∈퐶

′ 푍 = 휓푐(푥푐) ′ ′ 푥1,…,푥푛 푐∈퐶

5 Markov Random Fields (MRFs)

• A is an undirected graphical model

– Corresponds to a factorization of the joint distribution

1 푝 푥 , … , 푥 = 휓 (푥 ) 1 푛 푍 푐 푐 푐∈퐶

′ 푍 = 휓푐(푥푐) ′ ′ 푥1,…,푥푛 푐∈퐶 Normalizing constant, 푍, often called the partition function 6 An Example

퐵 퐶

1 • 푝 푥 , 푥 , 푥 = 휓 (푥 , 푥 )휓 (푥 , 푥 ) 휓 (푥 , 푥 ) 퐴 퐵 퐶 푍 퐴퐵 퐴 퐵 퐵퐶 퐵 퐶 퐴퐶 퐴 퐶

• Each potential function can be specified as a table as

before 푥퐴 = 0 푥퐴 = 1

1 1 휓퐴퐵 푥퐴, 푥퐵 = 1 0

7 The

of ferromagnets

• Each atom has an associated that is biased by both its neighbors in the material and an external magnetic field

– Spins can be either +1 or -1 +1 −1 +1 −1 – Edge potentials capture the local interactions +1 +1 +1 +1 – Singleton potentials capture the external field +1 −1 +1 −1

1 푝 푥 = exp ( ℎ 푥 + 퐽 푥 푥 ) −1 −1 −1 +1 푉 푍 푖 푖 푖푗 푖 푗 푖∈V 푖,푗 ∈퐸

8 Independence Assertions

• Instead of d-separation, we need only consider separation:

– If 푋 ⊆ 푉 is graph separated from 푌 ⊆ 푉 by 푍 ⊆ 푉, (i.e., all paths from 푋 to 푌 go through 푍) then 푋 ⊥ 푌 | 푍

– What independence assertions follow from this MRF? 퐴 퐷

퐵 퐶

9 Independence Assertions

퐴 퐵 퐶

1 푝 푥 , 푥 , 푥 = 휓 푥 , 푥 휓 (푥 , 푥 ) 퐴 퐵 퐶 푍 퐴퐵 퐴 퐵 퐵퐶 퐵 퐶

• How does separation imply independence?

• Show that 퐴 ⊥ 퐶 | 퐵

10 Independence Assertions

• In particular, each variable is independent of all of its non-neighbors given its neighbors

– All paths leaving a single variable must pass through some neighbor

• If the joint probability distribution, 푝, factorizes with respect to the graph 퐺, then 퐺 is an I-map for 푝

• If 퐺 is an I-map of a positive distribution 푝, then 푝 factorizes with respect to the graph 퐺

– Hamersley-Clifford Theorem

11

BNs vs. MRFs

Property Bayesian Networks Markov Random Fields

Conditional Factorization Potential Functions Distributions

Product of Conditional Normalized Product of Distribution Distributions Potentials

Cycles Not Allowed Allowed

Partition Potentially NP-hard to 1 Function Compute

Independence d-Separation Graph Separation Test

12 Moralization

• Every Bayesian network can be converted into an MRF with some possible loss of independence information

– Remove the direction of all arrows in the network

– If 퐴 and 퐵 are parents of 퐶 in the Bayesian network, we add an edge between 퐴 and 퐵 in the MRF

• This procedure is called "moralization" because it "marries" the parents of every node

퐴 퐷 ? 퐵 퐶

13 Moralization

• Every Bayesian network can be converted into an MRF with some possible loss of independence information

– Remove the direction of all arrows in the network

– If 퐴 and 퐵 are parents of 퐶 in the Bayesian network, we add an edge between 퐴 and 퐵 in the MRF

• This procedure is called "moralization" because it "marries" the parents of every node

퐴 퐷 퐴 퐷

퐵 퐶 퐵 퐶

14 Moralization

퐶 퐶

퐴 퐵 퐴 퐵

• What independence information is lost?

15 Factorizations

• Many factorizations over the same graph may represent the same joint distribution

– Some are better than others (e.g., they more compactly represent the distribution)

– Simply looking at the graph is not enough to understand which specific factorization is being assumed

퐴 퐷

퐵 퐶

16 Factor Graphs

• Factor graphs are used to explicitly represent a given factorization over a given graph

– Not a different model, but rather different way to visualize an MRF

– Undirected with two types of nodes: variable nodes (circles) and factor nodes (squares)

– Factor nodes are connected to the variable nodes on which they depend

17 Factor Graphs

1 푝 푥 , 푥 , 푥 = 휓 (푥 , 푥 )휓 (푥 , 푥 ) 휓 푥 , 푥 퐴 퐵 퐶 푍 퐴퐵 퐴 퐵 퐵퐶 퐵 퐶 퐴퐶 퐴 퐶

휓퐴퐵 휓퐴퐶

퐵 퐶 휓퐵퐶

18 Conditional Random Fields (CRFs)

• Undirected graphical models that represent conditional probability distributions 푝 푌 푋)

– Potentials can depend on both 푋 and 푌, typically only the observed variables are considered in the model

1 푝 푌 푋) = 휓 (푥 , 푦 ) 푍(푥) 푐 푐 푐 푐∈C

′ 푍 푥 = 휓푐(푥푐, 푦푐) 푦′ 푐∈C

19 Log-Linear Models

• CRFs often assume that the potentials are log-linear functions

휓푐 푥푐, 푦푐 = exp(푤 ⋅ 푓푐(푥푐, 푦푐))

where 푓푐 is referred to as a feature vector and 푤 is some vector of feature weights

• The feature weights are typically learned from data

• CRFs don’t require us to model the full joint distribution (which may not be possible anyhow)

20 Conditional Random Fields (CRFs)

• Binary

– Label the pixels of an image as belonging to the foreground or background

– +/- correspond to foreground/background

– Interaction between neighboring pixels in the image depends on how similar the pixels are

• Similar pixels should preference having the same spin (i.e., being in the same part of the image)

21 Conditional Random Fields (CRFs)

• Binary image segmentation

– This can be modeled as a CRF where the image information (e.g., pixel colors) is observed, but the segmentation is unobserved

– Because the model is conditional, we don’t need to describe the joint probability distribution of (natural) images and their foreground/background segmentations

– CRFs will be particularly important when we want to learn graphical models

22 Low Density Parity Check Codes

• Want to send a message across a noisy channel in which bits can be flipped with some probability – use error correcting codes

휓퐴 휓퐵 휓퐶

푦1 푦2 푦3 푦4 푦5 푦6 휙1 휙2 휙3 휙4 휙5 휙6 푥 푥 푥 푥 푥 푥 1 2 3 4 5 6

• 휓퐴, 휓퐵, 휓퐶 are all parity check constraints: they equal one if their input contains an even number of ones and zero otherwise

• 휙푖 푥푖, 푦푖 = 푝 푦푖 푥푖 , the probability that the 𝑖th bit was flipped during transmission23

Low Density Parity Check Codes

휓퐴 휓퐵 휓퐶

푦1 푦2 푦3 푦4 푦5 푦6

휙1 휙2 휙3 휙4 휙5 휙6

푥1 푥2 푥3 푥4 푥5 푥6

• The parity check constraints enforce that the 푦’s can only be one of a few possible codewords: 000000, 001011, 010101, 011110, 100110, 101101, 110011, 111000

• Decoding the message that was sent is equivalent to computing the most likely codeword under the joint probability distribution 24

Low Density Parity Check Codes

휓퐴 휓퐵 휓퐶

푦1 푦2 푦3 푦4 푦5 푦6

휙1 휙2 휙3 휙4 휙5 휙6

푥1 푥2 푥3 푥4 푥5 푥6

• Most likely codeword is given by MAP inference

arg max 푝 푦|푥 푦

• Do we need to compute the partition function for MAP inference?

25