CS 6347 Lecture 11 MCMC Sampling Methods

Home , Markov chain Monte Carlo

CS 6347

Lecture 11

MCMC Sampling Methods Last Time

• Sampling from discrete univariate distributions

– Rejection sampling

• To sample 푝(푦), draw samples from 푝(푥′, 푦′) and reject those with 푦 ≠ 푦′

– Importance sampling

• Introduce a proposal distribution 푞(푥) whose support contains the support of 푝(푥, 푦)

• Sample from 푞 and reweight the samples to generate samples from 푝

2 Today

• We saw how to sample from Bayesian networks, but how do we sample from MRFs?

1 – Can’t even compute 푝 푥 = 휓 (푥 ) without knowing 푍 푐∈퐶 푐 푐 the partition function

– No well-defined ordering in the model

• To sample from MRFs, we will need fancier forms of sampling

– So-called Markov Chain Monte Carlo (MCMC) methods

3 Markov Chains

• A Markov chain is a sequence of random variables 푋1, … , 푋푛 ∈ 푆 such that

푝 푥푛+1 푥1, … , 푥푛 = 푝 푥푛+1 푥푛

• The set 푆 is called the state space, and 푝 푋푛+1 = 푏 푋푛 = 푎 is the probability of transitioning from state 푎 to state 푏 at step 푛

• As a Bayesian network or a MRF, the joint distribution over the first 푛 steps factorizes over a chain

4 Markov Chains

• When the probability of transitioning between two states does not depend on time, we call it a time homogeneous Markov chain

– Represent it by a 푆 × |푆| transition matrix 푃

• 푃푖푗 = 푝(푋푛+1 = 푗|푋푛 = 푖)

• 푃 is a stochastic matrix (all rows sum to one)

– Draw it as a directed graph over the state space with an arrow from 푎 ∈ 푆 to 푏 ∈ 푆 labelled by the probability of transitioning from 푎 to 푏

5 Markov Chains

• Given some initial distribution over states 푝(푥1)

– Represent 푝(푥1) as a length |푆| vector, 휋1 – The probability distribution after 푛 steps is given by 푛 휋푛 = 휋1푃 • Typically interested in the long term (i.e., what is the state of the system when 푛 is large)

• In particular, we are interested in steady-state distributions 휇 such that 휇 = 휇푃

– A given chain may or may not converge to a steady state

6 Markov Chains

• Theorem: every irreducible, aperiodic Markov chain converges to a unique steady state distribution independent of the initial distribution

– Irreducible: the directed graph of transitions is strongly connected (i.e., there is a directed path between every pair of nodes)

– Aperiodic: 푝 푋푛 = 푖 푋1 = 푖) > 0 for all large enough 푛 • If the state graph is strongly connected and there is a non-zero probability of remaining in any state, then the chain is irreducible and aperiodic

7 Detailed Balance

• Lemma: a vector of probabilities 휇 is a stationary distribution of the MC with transition matrix 푃 if

휇푖푃푖푗 = 휇푗푃푗푖 Proof:

휇푃 푗 = 휇푖푃푖푗 = 휇푗푃푗푖 = 휇푗 푖 푖 So, 휇푃 = 휇

8 MCMC Sampling

• Markov chain Monte Carlo sampling

– Construct a Markov chain where the stationary distribution is the distribution we want to sample from

– Use the Markov chain to generate samples from the distribution

– Use the same Monte Carlo estimation strategy as before

– Will let us sample conditional distributions easily as well!

9 Gibbs Sampling

• Choose an initial assignment 푥0

• Fix an ordering of the variables (any order is fine)

• For each 푗 ∈ 푉 in order

푡+1 푡+1 푡 푡 – Draw a sample 푧 from 푝(푥푗|푥1 , … , 푥푗−1 , 푥푗+1, … , 푥 푉 )

푡+1 – Set 푥푗 = 푧

• Set 푡 ← 푡 + 1 and repeat

10 Gibbs Sampling

1 • If 푝 푥 = 휓 (푥 ), we can use the conditional independence 푍 퐶 퐶 퐶 assumptions to sample from 푝(푥푗|푥푁 푗 )

– This lets us exploit the graph structure for sampling

– For Bayesian networks, reduces to 푝(푋푗|푥푀퐵 푗 ) where 푀퐵(푗) is 푗’s Markov blanket (푗’s parents, children, and its children's parents)

11 Gibbs Sampling

퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 1 .9 0 0 .3 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75

(1) Sample from 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0 ∝ 푝 푥퐴 푝 푥퐶 = 0 푥퐴, 푥퐵 = 0 푝 푥퐴 = 0 푥퐵 = 0, 푥퐶 = 0 ∝ .3 ⋅ .1 = .03 푝 푥퐴 = 1 푥퐵 = 0, 푥퐶 = 0 ∝ .7 ⋅ .01 = .007

12 Gibbs Sampling

퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75

(1) Sample from 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐴 푥퐵 = 0, 푥퐶 = 0 ∝ 푝 푥퐴 푝 푥퐶 = 0 푥퐴, 푥퐵 = 0 푝 푥퐴 = 0 푥퐵 = 0, 푥퐶 = 0 ∝ .3 ⋅ .1 → .811 푝 푥퐴 = 1 푥퐵 = 0, 푥퐶 = 0 ∝ .7 ⋅ .01 → .189

13 Gibbs Sampling

(1) Sample from 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0 ∝ 푝 푥퐵 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 푝 푥퐵 = 0 푥퐴 = 0, 푥퐶 = 0 ∝ .4 ⋅ .1 = .04 푝 푥퐵 = 1 푥퐴 = 0, 푥퐶 = 0 ∝ .6 ⋅ .2 = .12

14 Gibbs Sampling

퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75

(1) Sample from 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0, 푥퐷 = 0 Using Bayes rule, 푝 푥퐵 푥퐴 = 0, 푥퐶 = 0 ∝ 푝 푥퐵 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 푝 푥퐵 = 0 푥퐴 = 0, 푥퐶 = 0 ∝ .4 ⋅ .1 → .25 푝 푥퐵 = 1 푥퐴 = 0, 푥퐶 = 0 ∝ .6 ⋅ .2 → .75

15 Gibbs Sampling

(1) Sample from 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 Using Bayes rule, 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ 푝 푥퐶|푥퐴 = 0, 푥퐵 = 1 푝 푥퐷 = 0 푥퐶 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .2 ⋅ .3 = .06 푝 푥퐶 = 1 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .8 ⋅ .4 = .32

16 Gibbs Sampling

퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 1 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75

(1) Sample from 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 Using Bayes rule, 푝 푥퐶 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ 푝 푥퐶|푥퐴 = 0, 푥퐵 = 1 푝 푥퐷 = 0 푥퐶 푝 푥퐶 = 0 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .2 ⋅ .3 → .158 푝 푥퐶 = 1 푥퐴 = 0, 푥퐵 = 1, 푥퐷 = 0 ∝ .8 ⋅ .4 → .842

17 Gibbs Sampling

(1) Sample from 푝 푥퐷 푥퐶 = 1 푝 푥퐷 = 0 푥퐶 = 1 = .4 푝 푥퐷 = 1 푥퐶 = 1 = .6

18 Gibbs Sampling

퐴 푃(퐴) 퐵 푃(퐵) 0 .3 0 .4 Order: A, B, C, D, A, B, C, D, … 1 .7 A B 1 .6 A B C D 퐴 퐵 퐶 푃(퐶|퐴, 퐵) 퐶 퐷 푃(퐷|퐶) 0 0 0 .1 C 0 0 0 0 0 0 .3 0 0 1 .9 0 1 1 0 0 1 0 .2 0 1 .7 0 1 1 .8 1 0 .4 1 0 0 .01 D 1 1 .6 1 0 1 .99 1 1 0 .25 1 1 1 .75

(1) Sample from 푝 푥퐷 푥퐶 = 1 푝 푥퐷 = 0 푥퐶 = 1 = .4 푝 푥퐷 = 1 푥퐶 = 1 = .6

19 Gibbs Sampling

(2) Repeat the same process to generate the next sample

20 Gibbs Sampling

• Gibbs sampling forms a Markov chain

• The states of the chain are the assignments and the probability of transitioning from an assignment 푦 to an assignment 푧 (in the order 1, … , 푛)

푝 푧1 푦푉∖ 1 푝 푧2 푦푉∖ 1,2 , 푧1 … 푝(푧푛|푧푉∖{푛})

• If there are no zero probability states, then the chain is irreducible and aperiodic (hence it converges)

• The stationary distribution is 푝(푥) – proof?

21 Gibbs Sampling

• Recall that it takes time to reach the steady state distribution from an arbitrary starting distribution

• The mixing time is the number of samples that it takes before the approximate distribution is close to the steady state distribution

– In practice, this can take 1000s of iterations (or more)

– We typically ignore the samples for a set amount of time called the burn in phase and then begin producing samples

22 Gibbs Sampling

• We can use Gibbs sampling for MRFs as well!

– We don’t need to compute the partition function to use it (why not?)

– Many “real” MRFs will have lots of zero probability assignments

• If you don’t start with a non-zero assignment, the algorithm can get stuck (changing a single variable may not allow you to escape)

• Might not be possible to go between all possible non-zero assignments by only flipping one variable at a time

23 Metropolis-Hastings Algorithm

• This idea of choosing a transition probability between new assignments and the current assignments can be generalized beyond the transition probabilities used in Gibbs sampling

• Pick some transition function 푞(푥′|푥) that depends on the current state 푥

– We would ideally want the probability of transitioning between any two non-zero states to be non-negative

24 Metropolis-Hastings Algorithm

• Choose an initial assignment 푥

• Sample an assignment 푧 from the proposal distribution 푞(푥′|푥)

• Sample 푟 uniformly from [0,1]

푝 푧 푞(푥|푧) • If 푟 < min 1, 푝 푥 푞(푧|푥)

– Set 푥 to 푧

• Else

– Leave 푥 unchanged

25 Metropolis-Hastings Algorithm

• Choose an initial assignment 푥

• Sample an assignment 푧 from the proposal distribution 푞(푥′|푥)

• Sample 푟 uniformly from [0,1]

푝 푧 푝 푥 푝 푧 푞(푥|푧) and are like • If 푟 < min 1, 푞 푧 푥 푞 푥 푧 푝 푥 푞(푧|푥) importance weights

– Set 푥 to 푧 The acceptance probability is then a function of the ratio of the • Else importance of 푧 and the importance of 푥 – Leave 푥 unchanged

26 Metropolis-Hastings Algorithm

• The Metropolis-Hastings algorithm produces a Markov chain that converges to 푝(푥) from any initial distribution (assuming that it is irreducible and aperiodic)

• What are some choices for 푞(푥′|푥)?

– Use an importance sampling distribution

– Use a uniform distribution (like a random walk)

• Gibbs sampling is a special case of this algorithm where the proposal distribution corresponds to the transition matrix