CS 6347
Lecture 11
MCMC Sampling Methods Last Time
• Sampling from discrete univariate distributions
• To sample 푝(푦), draw samples from 푝(푥′, 푦′) and reject those with 푦 ≠ 푦′
– Importance sampling
• Introduce a proposal distribution 푞(푥) whose support contains the support of 푝(푥, 푦)
• Sample from 푞 and reweight the samples to generate samples from 푝
2 Today
• We saw how to sample from Bayesian networks, but how do we sample from MRFs?
1 – Can’t even compute 푝 푥 = 휓 (푥 ) without knowing 푍 푐∈퐶 푐 푐 the partition function
– No well-defined ordering in the model
• To sample from MRFs, we will need fancier forms of sampling
– So-called Markov Chain Monte Carlo (MCMC) methods
3 Markov Chains
• A Markov chain is a sequence of random variables 푋1, … , 푋푛 ∈ 푆 such that
푝 푥푛+1 푥1, … , 푥푛 = 푝 푥푛+1 푥푛
• The set 푆 is called the state space, and 푝 푋푛+1 = 푏 푋푛 = 푎 is the probability of transitioning from state 푎 to state 푏 at step 푛
• As a Bayesian network or a MRF, the joint distribution over the first 푛 steps factorizes over a chain
4 Markov Chains
• When the probability of transitioning between two states does not depend on time, we call it a time homogeneous Markov chain
– Represent it by a 푆 × |푆| transition matrix 푃
• 푃푖푗 = 푝(푋푛+1 = 푗|푋푛 = 푖)
• 푃 is a stochastic matrix (all rows sum to one)
– Draw it as a directed graph over the state space with an arrow from 푎 ∈ 푆 to 푏 ∈ 푆 labelled by the probability of transitioning from 푎 to 푏
5 Markov Chains
• Given some initial distribution over states 푝(푥1)
– Represent 푝(푥1) as a length |푆| vector, 휋1 – The probability distribution after 푛 steps is given by 푛 휋푛 = 휋1푃 • Typically interested in the long term (i.e., what is the state of the system when 푛 is large)
• In particular, we are interested in steady-state distributions 휇 such that 휇 = 휇푃
– A given chain may or may not converge to a steady state
6 Markov Chains
• Theorem: every irreducible, aperiodic Markov chain converges to a unique steady state distribution independent of the initial distribution
– Irreducible: the directed graph of transitions is strongly connected (i.e., there is a directed path between every pair of nodes)
– Aperiodic: 푝 푋푛 = 푖 푋1 = 푖) > 0 for all large enough 푛 • If the state graph is strongly connected and there is a non-zero probability of remaining in any state, then the chain is irreducible and aperiodic
7 Detailed Balance
• Lemma: a vector of probabilities 휇 is a stationary distribution of the MC with transition matrix 푃 if
휇푖푃푖푗 = 휇푗푃푗푖 Proof:
휇푃 푗 = 휇푖푃푖푗 = 휇푗푃푗푖 = 휇푗 푖 푖 So, 휇푃 = 휇
8 MCMC Sampling
• Markov chain Monte Carlo sampling
– Construct a Markov chain where the stationary distribution is the distribution we want to sample from
– Use the Markov chain to generate samples from the distribution
– Use the same Monte Carlo estimation strategy as before
– Will let us sample conditional distributions easily as well!
• Choose an initial assignment 푥0
• Fix an ordering of the variables (any order is fine)
• For each 푗 ∈ 푉 in order
푡+1 푡+1 푡 푡 – Draw a sample 푧 from 푝(푥푗|푥1 , … , 푥푗−1 , 푥푗+1, … , 푥 푉 )
푡+1 – Set 푥푗 = 푧
• Set 푡 ← 푡 + 1 and repeat
10 Gibbs Sampling
1 • If 푝 푥 = 휓 (푥 ), we can use the conditional independence 푍 퐶 퐶 퐶 assumptions to sample from 푝(푥푗|푥푁 푗 )
– This lets us exploit the graph structure for sampling
– For Bayesian networks, reduces to 푝(푋푗|푥푀퐵 푗 ) where 푀퐵(푗) is 푗’s Markov blanket (푗’s parents, children, and its children's parents)
11 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 1 .9 0 0 .3 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0 ∝ 푝 푥퐴 푝 푥퐶 = 0 푥퐴, 푥퐵 = 0 푝 푥퐴 = 0 푥퐵 = 0, 푥퐶 = 0 ∝ .3 ⋅ .1 = .03 푝 푥퐴 = 1 푥퐵 = 0, 푥퐶 = 0 ∝ .7 ⋅ .01 = .007
12 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0 ∝ 푝 푥퐴 푝 푥퐶 = 0 푥퐴, 푥퐵 = 0 푝 푥퐴 = 0 푥퐵 = 0, 푥퐶 = 0 ∝ .3 ⋅ .1 → .811 푝 푥퐴 = 1 푥퐵 = 0, 푥퐶 = 0 ∝ .7 ⋅ .01 → .189
13 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0 ∝ 푝 푥퐵 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 푝 푥퐵 = 0 푥퐴 = 0, 푥퐶 = 0 ∝ .4 ⋅ .1 = .04 푝 푥퐵 = 1 푥퐴 = 0, 푥퐶 = 0 ∝ .6 ⋅ .2 = .12
14 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0 ∝ 푝 푥퐵 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 푝 푥퐵 = 0 푥퐴 = 0, 푥퐶 = 0 ∝ .4 ⋅ .1 → .25 푝 푥퐵 = 1 푥퐴 = 0, 푥퐶 = 0 ∝ .6 ⋅ .2 → .75
15 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 Using Bayes rule, 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ 푝 푥퐶|푥퐴 = 0, 푥퐵 = 1 푝 푥퐷 = 0 푥퐶 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .2 ⋅ .3 = .06 푝 푥퐶 = 1 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .8 ⋅ .4 = .32
16 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 1 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 Using Bayes rule, 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ 푝 푥퐶|푥퐴 = 0, 푥퐵 = 1 푝 푥퐷 = 0 푥퐶 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .2 ⋅ .3 → .158 푝 푥퐶 = 1 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .8 ⋅ .4 → .842
17 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 1 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐷 푥퐶 = 1 푝 푥퐷 = 0 푥퐶 = 1 = .4 푝 푥퐷 = 1 푥퐶 = 1 = .6
18 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 1 0 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(1) Sample from 푝 푥퐷 푥퐶 = 1 푝 푥퐷 = 0 푥퐶 = 1 = .4 푝 푥퐷 = 1 푥퐶 = 1 = .6
19 Gibbs Sampling
퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 1 0 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75
(2) Repeat the same process to generate the next sample
20 Gibbs Sampling
• Gibbs sampling forms a Markov chain
• The states of the chain are the assignments and the probability of transitioning from an assignment 푦 to an assignment 푧 (in the order 1, … , 푛)
푝 푧1 푦푉∖ 1 푝 푧2 푦푉∖ 1,2 , 푧1 … 푝(푧푛|푧푉∖{푛})
• If there are no zero probability states, then the chain is irreducible and aperiodic (hence it converges)
• The stationary distribution is 푝(푥) – proof?
21 Gibbs Sampling
• Recall that it takes time to reach the steady state distribution from an arbitrary starting distribution
• The mixing time is the number of samples that it takes before the approximate distribution is close to the steady state distribution
– In practice, this can take 1000s of iterations (or more)
– We typically ignore the samples for a set amount of time called the burn in phase and then begin producing samples
22 Gibbs Sampling
• We can use Gibbs sampling for MRFs as well!
– We don’t need to compute the partition function to use it (why not?)
– Many “real” MRFs will have lots of zero probability assignments
• If you don’t start with a non-zero assignment, the algorithm can get stuck (changing a single variable may not allow you to escape)
• Might not be possible to go between all possible non-zero assignments by only flipping one variable at a time
23 Metropolis-Hastings Algorithm
• This idea of choosing a transition probability between new assignments and the current assignments can be generalized beyond the transition probabilities used in Gibbs sampling
• Pick some transition function 푞(푥′|푥) that depends on the current state 푥
– We would ideally want the probability of transitioning between any two non-zero states to be non-negative
24 Metropolis-Hastings Algorithm
• Choose an initial assignment 푥
• Sample an assignment 푧 from the proposal distribution 푞(푥′|푥)
• Sample 푟 uniformly from [0,1]
푝 푧 푞(푥|푧) • If 푟 < min 1, 푝 푥 푞(푧|푥)
– Set 푥 to 푧
• Else
– Leave 푥 unchanged
25 Metropolis-Hastings Algorithm
• Choose an initial assignment 푥
• Sample an assignment 푧 from the proposal distribution 푞(푥′|푥)
• Sample 푟 uniformly from [0,1]
푝 푧 푝 푥 푝 푧 푞(푥|푧) and are like • If 푟 < min 1, 푞 푧 푥 푞 푥 푧 푝 푥 푞(푧|푥) importance weights
– Set 푥 to 푧 The acceptance probability is then a function of the ratio of the • Else importance of 푧 and the importance of 푥 – Leave 푥 unchanged
26 Metropolis-Hastings Algorithm
• The Metropolis-Hastings algorithm produces a Markov chain that converges to 푝(푥) from any initial distribution (assuming that it is irreducible and aperiodic)
• What are some choices for 푞(푥′|푥)?
– Use an importance sampling distribution
– Use a uniform distribution (like a random walk)
• Gibbs sampling is a special case of this algorithm where the proposal distribution corresponds to the transition matrix
27