<<

INTRODUCTION TO MONTE CARLO METHODS

DJC MACKAY

Department of Physics Cambridge University

Cavendish Laboratory Madingley Road

Cambridge CB HE United Kingdom

ABSTRACT

This chapter describ es a sequence of Monte Carlo metho ds imp or

tance theMetrop olis metho dand

Gibbs samplingFor each metho d we discuss whether the metho d is

exp ected to b e useful for highdimensional problems such as arise in in

ference with graphical mo dels After the metho ds have b een describ ed

the terminology of Markovchain Monte Carlo metho ds is presented

The chapter concludes with a discussion of advanced metho ds includ

ing metho ds for reducing b ehaviour

For details of Monte Carlo metho ds theorems and pro ofs and a full

list of references the reader is directed to Neal Gilks Richardson

and Spiegelhalter and Tanner

The problems to b e solved

The aims of Monte Carlo metho ds are to solve one or b oth of the following

problems

r R

Problem to generate samples fx g from a given probability distri

r

bution P x

Problem to estimate exp ectations of functions under this distribution

for example

Z

N

d hxi x P xx

Please note that I will use the word sample in the following sense a sample from

a distribution P x is a single realization x whose probability distributio n is P x This

contrasts with the alternative usage in where sample refers to a collection of

realizations fxg

DJC MACKAY

The P x whichwewillcallthetarget density

might b e a distribution from statistical physics or a conditional distribution

arising in data mo delling for example the p osterior probabilityofa

mo dels given some observed data We will generally assume

that x is an N dimensional vector with real comp onents x but wewill

n

sometimes consider discrete spaces also

We will concentrate on the rst problem sampling b ecause if wehave

solved it then we can solve the second problem by using the random sam

r R

ples fx g to give the estimator

r

X

r

x

R

r

r R

Clearly if the vectors fx g are generated from P x then the exp ecta

r

tion of is Also as the number of samples R increases the of

will decrease as where is the variance of

R

Z

N

d x P xx

This is one of the imp ortant prop erties of Monte Carlo metho ds

The accuracy of the Monte Carlo estimate equation is

indep endent of the dimensionality of the space sampled Tobe

So regardless of the dimensionality precise the variance of goesas

R

r

of xitmay b e that as few as a dozen indep endent samples fx g suce

to estimate satisfactorily

We will nd later however that high dimensionality can cause other dif

culties for Monte Carlo metho ds Obtaining indep endent samples from a

given distribution P x is often not easy

WHY IS SAMPLING FROM P x HARD

We will assume that the density from whichwe wish to draw samples P x

can b e evaluated at least to within a multiplicative constant that is we

can evaluate a function P x such that

P x P xZ

If wecanevaluate P x why can we not easily solve problem Whyisit

in general dicult to obtain samples from P x There are two diculties

The rst is that wetypically do not know the

Z

N

Z d x P x

MONTE CARLO METHODS

3 3 P*(x) P*(x)

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

-4 -2 0 2 4 -4 -2 0 2 4

a b

Howtodraw samples Figure a The function P x exp x x

from this density b The function P xevaluated at a discrete set of uniformly spaced

p oints fx gHowtodraw samples from this discrete distribution

i

The second is that even if we did know Z the problem of drawing samples

from P x is still a challenging one esp ecially in highdimensional spaces

There are only a few highdimensional densities from whichitiseasyto

draw samples for example the Gaussian distribution

Let us start from a simple onedimensional example Imagine that we

wish to draw samples from the density P xP xZ where

i h

x P x exp x x

We can plot this function gure a But that do es not wecandraw

samples from it To give ourselves a simpler problem we could discretize

the variable x and ask for samples from the discrete probability distribution

over a set of uniformly spaced p oints fx g gure b How could wesolve

i

this problem If weevaluate p P x at each p oint x we can compute

i i

i

X

Z p

i

i

and

Z p p

i

i

and we can then sample from the probability distribution fp g using various

i

metho ds based on a source of random bits But what is the cost of this

pro cedure and how do es it scale with the dimensionality of the space

A sample from a univariate Gaussian can be generated by computing

p

logu where u and u are uniformly distributed in cosu

DJC MACKAY

N Let us concentrate on the initial cost of evaluating Z To compute

Z equation wehave to visit every p oint in the space In gure b

there are uniformly spaced p oints in one dimension If our system had

N dimensions N say then the corresp onding number of points

would b e an unimaginable number of evaluations of P Even if each

comp onent x only to ok two discrete values the number of evaluations of

n

P would b e anumb er that is still horribly huge equal to the fourth

power of the numb er of particles in the universe

One system with states is a collection of spins for example

a fragment of an Ising mo del or or Markov

eld Yeomans whose probability distribution is prop ortional to

P x expE x

where x fg and

n

X X

E x J x x H x

mn m n n n

mn n

The energy function E x is readily evaluated for any x But if wewishto

evaluate this function at al l states x the computer time required would b e

function evaluations

The Ising mo del is a simple mo del which has b een around for a long time

but the task of generating samples from the distribution P x P xZ is

still an activeresearch area as evidenced by the work of Propp and Wilson

UNIFORM SAMPLING

Having agreed that we cannot visit every lo cation x in the state space we

might consider trying to solve the second problem estimating the exp ec

r R

uniformly tation of a function x bydrawing random samples fx g

r

from the state space and evaluating P x at those p oints Then we could

intro duce Z dened by

R

R

X

r

Z P x

R

r

R

N

and estimate d x xP xby

R

r

X

P x

r

x

Z

R

r

MONTE CARLO METHODS

Is anything wrong with this strategy Well it dep ends on the functions x

and P x Let us assume that x is a b enign smo othly varying function

and concentrate on the nature of P x A highdimensional distribution

is often concentrated in a small region of the state space known as its

H X

typical set T whose volume is given by jT j where H X is the

ShannonGibbs entropy of the probability distribution P x

X

H X P x log

P x

x

If almost all the probability mass is lo cated in the typical set and x

R

N

is a b enign function the value of d x xP x will b e principally

determined by the values that xtakes on in the typical set So uniform

sampling will only stand a chance of giving a go o d estimate of if we

makethenumb er of samples R suciently large that we are likely to hit

the typical set a numb er of times So howmany samples are required Let

us take the case of the Ising mo del again The total size of the state space

N H

is states and the typical set has size So each sample has a chance

H N

of of falling in the typical set The numb er of samples required to

hit the typical set once is thus of order

N H

R

min

So what is H At high temp eratures the probability distribution of an

Ising mo del tends to a uniform distribution and the entropy tends to

H N bits so R is of order Under these conditions uniform

max min

sampling maywell b e a satisfactory technique for estimating But high

temp eratures are not of great interest Considerably more interesting are

intermediate temp eratures such as the critical temp erature at whichthe

Ising mo del melts from an ordered phase to a disordered phase At this

temp erature the entropy of an Ising mo del is roughly NbitsFor this

probability distribution the numb er of samples required simply to hit the

typical set once is of order

N N N

R

min

whichforN is ab out This is roughly the square of the number

of particles in the universe Thus uniform sampling is utterly useless for

the study of Ising mo dels of mo dest size And in most highdimensional

problems if the distribution P x is not actually uniform uniform sampling

is unlikely to b e useful

OVERVIEW

Having established that drawing samples from a highdimensional distri

bution P x P xZ is dicult even if P x is easy to evaluate we will

DJC MACKAY

φ(x) Q*(x) P*(x)

x

Figure Functions involved in imp ortance sampling We wish to estimate the exp ecta

tion of x under P x P x We can generate samples from the simpler distribution

Qx Q x We can evaluate Q and P at any p oint

now study a sequence of Monte Carlo metho ds imp ortance sampling

rejection sampling the Metrop olis metho d and

Imp ortance sampling

Imp ortance sampling is not a metho d for generating samples from P x

problem it is just a metho d for estimating the exp ectation of a func

tion x problem It can b e viewed as a generalization of the uniform

sampling metho d

For illustrative purp oses let us imagine that the target distribution is

a onedimensional density P x It is assumed that we are able to evalu

ate this density at least to within a multiplicative constant thus wecan

evaluate a function P xsuch that

P x P xZ

But P x is to o complicated a function for us to b e able to sample from

it directlyWenow assume that wehave a simpler density Qx whichwe

can evaluate to within a multiplicative constant that is we can evaluate

Q x where Qx Q xZ and from whichwe can generate samples

Q

An example of the functions P Q and is shown in gure We call Q

the sampler density

r R

In imp ortance sampling we generate R samples fx g from Qx

r

If these p oints were samples from P x then we could estimate by equa

tion But when we generate samples from Qvalues of x where Qx

is greater than P x will b e overrepresented in this estimator and p oints

MONTE CARLO METHODS

-6.2 -6.2

-6.4 -6.4

-6.6 -6.6

-6.8 -6.8

-7 -7

-7.2 -7.2

10 100 1000 10000 100000 1000000 10 100 1000 10000 100000 1000000

a b

Figure Imp ortance sampling in action a using a Gaussian sampler density b using a

Cauchy sampler density Horizontal axis shows numb er of samples on a log scale Vertical

axis shows the estimate The horizontal line indicates the true value of

where Qx is less than P x will b e underrepresented Totakeinto ac

count the fact that wehave sampled from the wrong distribution wein

tro duce weights

r

P x

w

r

r

Q x

whichwe use to adjust the imp ortance of eachpoint in our estimator thus

P

r

w x

r

r

P

w

r

r

If Qx is nonzero for all x where P x is nonzero it can b e proved that

the estimator converges to the mean value of x as R increases

A practical diculty with imp ortance sampling is that it is hard to

estimate how reliable the estimator is The variance of is hard to

r

estimate b ecause the empirical of the quantities w and w x

r r

are not necessarily a go o d guide to the true variances of the numerator and

denominator in equation If the prop osal density Qx is small in a

region where jxP xj is large then it is quite p ossible even after many

r

p oints x have b een generated that none of them will have fallen in that

region This leads to an estimate of that is drastically wrong and no

indication in the empirical variance that the true variance of the estimator

is large

DJC MACKAY

CAUTIONARY ILLUSTRATION OF

Inatoy problem related to the mo delling of amino acid probability distribu

tions with a onedimensional variable x Ievaluated a quantityofinterest

using imp ortance sampling The results using a Gaussian sampler and a

Cauchy sampler are shown in gure The horizontal axis shows the num

b er of samples on a log scale In the case of the Gaussian sampler after

ab out samples had b een evaluated one might b e tempted to call a halt

but evidently there are infrequent samples that makeahuge contribution

to and the value of the estimate at samples is wrong Even after

a million samples have b een taken the estimate has still not settled down

close to the true value In contrast the Cauchy sampler do es not suer from

glitches and converges on the scale shown here after ab out samples

This example illustrates the fact that an imp ortance sampler should

have heavy tails

IMPORTANCE SAMPLING IN MANY DIMENSIONS

Wehave already observed that care is needed in onedimensional imp or

tance sampling problems Is imp ortance sampling a useful technique in

spaces of higher dimensionalitysay N

Consider a simple casestudy where the target density P x is a uniform

distribution inside a sphere

x R

P

P x

x R

P

P

where x x and the prop osal density is a Gaussian centred

i i

on the origin

Y

Qx Normalx

i

i

An imp ortance sampling metho d will b e in trouble if the estimator is

dominated by a few large weights w What will b e the typical range of

r

values of the weights w By the central limit theorem if is the distance

r

from the origin of a sample from Q the quantity has a roughly Gaussian

distribution with mean and standard deviation

p

N N

Thus almost all samples from Q lie in a typical set with distance from the

p

origin very close to N Let us assume that is chosen such that the

typical set of Q lies inside the sphere of radius R If it do es not then the

P

law of large numb ers implies that almost all the samples generated from Q

MONTE CARLO METHODS

will fall outside R and will haveweight zero Then weknow that most

P

samples from Q will haveavalue of Q that lies in the range

p

N N

exp

N

Thus the weights w P Q will typically havevalues in the range

r

p

N N

N

exp

So if we drawahundred samples what will the typical range of weights

be We can roughly estimate the ratio of the largest weight to the median

weightby doubling the standard deviation in equation The largest

weight and the median weight will typically b e in the ratio

max

p

w

r

exp N

med

w

r

In N dimensions therefore the largest weight after one hundred

samples is likely to b e roughly times greater than the median weight

Thus an imp ortance sampling estimate for a highdimensional problem will

very likely b e utterly dominated by a few samples with huge weights

In conclusion imp ortance sampling in high dimensions often suers

from two diculties First we clearly need to obtain samples that lie in

the typical set of P and this maytake a long time unless Q is a go o d

approximation to P Second even if we obtain samples in the typical set

the weights asso ciated with those samples are likely to vary by large factors

b ecause the probabilities of p oints in a typical set although similar to each

p

N other still dier by factors of order exp

Rejection sampling

We assume again a onedimensional density P xP xZ that is to o

complicated a function for us to b e able to sample from it directlyWe

assume that wehave a simpler proposal density Qx whichwecanevaluate

within a multiplicative factor Z as b efore and whichwe can generate

Q

samples from We further assume that we know the value of a constant c

such that

for all x cQ x P x

Aschematic picture of the two functions is shown in gure a

We generate two random numb ers The rst x is generated from the

prop osal density Qx We then evaluate cQ x and generate a uniformly

DJC MACKAY

 

 b a      cQ*(x) cQ*(x) P*(x) P*(x)             u      

x x x

Figure Rejection sampling a The functions involved in rejection sampling We desire

samples from P x P x We are able to draw samples from Qx Q x and we

knowavalue c such that cQ x P x for all xbApointx u is generated at

random in the lightly shaded area under the curve cQ x If this p oint also lies b elow

P x then it is accepted

distributed u from the interval cQ x These tworan

dom numb ers can b e viewed as selecting a p oint in the twodimensional

plane as shown in gure b

Wenowevaluate P x and accept or reject the sample x by comparing

the value of u with the value of P x If u P x then x is rejected

otherwise it is accepted which that we add x to our set of samples

r

fx gThevalue of u is discarded

Why do es this pro cedure generate samples from P x The prop osed

p ointx u comes with uniform probability from the lightly shaded area

underneath the curve cQ xasshown in gure b The rejection rule

rejects all the p oints that lie ab ove the curve P x So the p oints x u

that are accepted are uniformly distributed in the heavily shaded area under

P x This implies that the probabilitydensityofthexco ordinates of the

accepted p oints must b e prop ortional to P x so the samples must b e

indep endent samples from P x

Rejection sampling will work b est if Q is a go o d approximation to P If

Q is very dierent from P then c will necessarily have to b e large and the

frequency of rejection will b e large

REJECTION SAMPLING IN MANY DIMENSIONS

In a highdimensional problem it is very likely that the requirement that

cQ b e an upp er b ound for P will force c to b e so huge that acceptances

will b e very rare indeed Finding suchavalue of c may b e dicult to o

since in many problems we dont know b eforehand where the mo des of P

are lo cated or how high they are

MONTE CARLO METHODS

P(x) cQ(x)

-4 -3 -2 -1 0 1 2 3 4

Figure A Gaussian P x and a slightly broader Gaussian Qxscaledupby a factor

c such that cQx P x

As a case study consider a pair of N dimensional Gaussian distribu

tions with mean zero gure Imagine generating samples from one with

standard deviation and using rejection sampling to obtain samples from

Q

the other whose standard deviation is Let us assume that these two

P

standard deviations are close in value say is one p ercent larger than

Q

must b e larger than b ecause if this is not the case there is

P Q P

no c such that cQ upp erb ounds P for all x So what is the value of c

if the dimensionalityisN The densityofQx at the origin is

N

so for cQ to upp erb ound P we need to set

Q

N

Q Q

exp N log c

N

P

P

Q

With N and we nd c exp What will

P

the rejection rate b e for this value of c The answer is immediate since

the acceptance rate is the ratio of the volume under the curve P xtothe

volume under cQx the fact that P and Q are normalized implies that the

acceptance rate will b e cFor our case study this is In general

c grows exp onentially with the dimensionality N

Rejection sampling therefore whilst a useful metho d for onedimensional

problems is not a practical technique for generating samples from high

dimensional distributions P x

The Metrop olis metho d

Imp ortance sampling and rejection sampling only work well if the prop osal

density Qx is similar to P x In large and complex problems it is dicult

to create a single density Qx that has this prop erty

DJC MACKAY

Qx x

Qx x

P x P x

x x

x x



Figure Metrop olis metho d in one dimension The prop osal distribution Qx xis

here shown as having a shap e that changes as x changes though this is not typical of

the prop osal densities used in practice

The Metrop olis instead makes use of a prop osal density Q

t  t

which depends on the current state x The density Qx x mightin

the simplest case b e a simple distribution suchasaGaussiancentred on

t 

the current x The prop osal density Qx x can b e any xed densityIt

 t

is not necessary for Qx x to lo ok at all similar to P x An example

of a prop osal densityisshown in gure this gure shows the density

 t

Qx x for two dierent states x and x

As b efore we assume that we can evaluate P x for any xAtentative

  t

new state x is generated from the prop osal density Qx x To decide

whether to accept the new state we compute the quantity

t  

Qx x P x

a

t  t

P x Qx x

If a then the new state is accepted

Otherwise the new state is accepted with probability a

t 

If the step is accepted weset x x If the step is rejected then we

t t

set x x Note the dierence from rejection sampling in rejection

sampling rejected p oints are discarded and have no inuence on the list of

r

samples fx g that we collected Here a rejection causes the currentstate

to b e written onto the list of p oints another time

Notation Ihave used the sup erscript r R to lab el p oints that

are independent samples from a distribution and the sup erscript t T

to lab el the sequence of states in a Markovchain It is imp ortanttonotethat

a Metrop olis simulation of T iterations do es not pro duce T indep endent

samples from the target distribution P The samples are correlated

To compute the acceptance probabilitywe need to b e able to compute

 t t   t

the probability ratios P x P x and Qx x Qx x Ifthepro

p osal density is a simple symmetrical density such as a Gaussian centred

on the currentpoint then the latter factor is unity and the Metrop olis

MONTE CARLO METHODS

P x

x

Qx x

L

Figure Metrop olis metho d in two dimensions showing a traditional prop osal density

that has a suciently small step size that the acceptance frequency will b e ab out

metho d simply involves comparing the value of the target densityatthe

two p oints The general algorithm for asymmetric Qgiven ab ove is often

called the Metrop olisHastings algorithm

It can b e shown that for any p ositive Q that is any Q such that

  t

Qx x for all x x as t the probability distribution of x

tends to P xP xZ This statement should not b e seen as implying



that Q has to assign p ositive probabilitytoevery p oint x we will discuss

 

examples later where Qx x for some x x notice also that wehave

said nothing ab out how rapidly the convergence to P x takes place

The Metrop olis metho d is an example of a Markovchain Monte

Carlo metho d abbreviated MCMC In contrast to rejection sampling

r

where the accepted p oints fx g are indep endent samples from the desired

distribution Markovchain Monte Carlo metho ds involve a Markov pro cess

t t

in which a sequence of states fx g is generated each sample x having

t

a probability distribution that dep ends on the previous value x Since

successive samples are correlated with each other the Markovchain may

have to b e run for a considerable time in order to generate samples that

are eectively indep endent samples from P

Just as it was dicult to estimate the variance of an imp ortance sam

pling estimator so it is dicult to assess whether a Markovchain Monte

Carlo metho d has converged and to quantify how long one has to wait to

obtain samples that are eectively indep endent samples from P

DJC MACKAY

DEMONSTRATION OF THE METROPOLIS METHOD

The Metrop olis metho d is widely used for highdimensional problems Many

implementations of the Metrop olis metho d employ a prop osal distribution

with a length scale that is short relative to the length scale L of the prob

able region gure A reason for cho osing a small length scale is that for

most highdimensional problems a large random step from a typical p oint

that is a sample from P x is very likely to end in a state which has very

low probability such steps are unlikely to b e accepted If is large move

ment around the state space will only o ccur when a transition to a state

which has very low probability is actually accepted or when a large random

step chances to land in another probable state So the rate of progress will

b e slow unless small steps are used

The disadvantage of small steps on the other hand is that the Metrop o

lis metho d will explore the probability distribution byarandom walk and

random walks take a long time to get anywhere Consider a onedimensional

random walk for example on eachstepofwhich the state moves randomly

to the left or to the right with equal probability After T steps of size

p

T Recall that the state is only likely to havemoved a distance ab out

the rst aim of Monte Carlo sampling is to generate a number of inde

pendent samples from the given distribution a dozen say If the largest

length scale of the state space is Lthenwehavetosimulate a random

walk Metrop olis metho d for a time T L b efore we can exp ect to get

a sample that is roughly indep endent of the initial condition and thats

assuming that every step is accepted if only a fraction f of the steps are

accepted on average then this time is increased by a factor f

Rule of thumb lower b ound on numb er of iterations of a Metrop o

lis metho d If the largest length scale of the space of probable states

is L a Metrop olis metho d whose prop osal distribution generates a ran

dom walk with step size must b e run for at least T L iterations

to obtain an indep endent sample

This rule of thumb only gives a lower b ound the situation maybemuch

worse if for example the probability distribution consists of several islands

of high probability separated by regions of low probability

To illustrate howslow the exploration of a state space by random walk

is gure shows a simulation of a Metrop olis algorithm for generating

samples from the distribution

x f g

P x

otherwise

MONTE CARLO METHODS

b Metrop olis c Indep endent sampling

100 iterations 100 iterations

12 12

10 10

8 8

6 6

4 4

2 2

0 0 0 5 10 15 20 0 5 10 15 20 400 iterations 400 iterations 40 40 35 35 30 30 25 a 25 20 20 15 15 10 10 5 5 0 0 0 5 10 15 20 0 5 10 15 20 1200 iterations 1200 iterations 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0

0 5 10 15 20 0 5 10 15 20

Figure Metrop olis metho d for a toy problem a The state sequence for t

Horizontal direction states from to vertical direction time from to

the cross bars mark time intervals of duration b Histogram of o ccupancy of the

states after and iterations c For comparison histograms resulting when

successive p oints are drawn independently from the target distribution

DJC MACKAY

The prop osal distribution is



x x



Qx x

otherwise

Because the target distribution P x is uniform rejections will o ccur only

 

when the prop osal takes the state to x orx

The simulation was started in the state x and its evolution is

shown in gure a How long do es it take to reach one of the end states

x and x Since the distance is steps the rule of thumbabove

predicts that it will typically take a time T iterations to reachan

end state This is conrmed in the present example The rst step into an

end state o ccurs on the th iteration Howlongdoesittake to visit both

end states The rule of thumb predicts ab out iterations are required to

traverse the whole state space And indeed the rst encounter with the other

end state takes place on the th iteration Thus eectively indep endent

samples are only generated bysimulating for ab out four hundred iterations

This simple example shows that it is imp ortant to try to ab olish random

walk b ehaviour in Monte Carlo metho ds A systematic exploration of the

toy state space f g could get around it using the same step sizes

in ab out twenty steps instead of four hundred

METROPOLIS METHOD IN HIGH DIMENSIONS

The rule of thumb that we discussed ab ove giving a lower b ound on the

numb er of iterations of a random walk Metrop olis metho d also applies to

higher dimensional problems Consider the simplest case of a target distri

bution that is a Gaussian and a prop osal distribution that is a spherical

Gaussian of standard deviation in each direction equal to Without loss

of generalitywe can assume that the target distribution is a separable dis

tribution aligned with the axes fx g and that it has standard deviations

n

max min

f g in the dierent directions nLet and b e the largest and

n

smallest of these standard deviations Let us assume that is adjusted

such that the acceptance probability is close to Under this assumption

eachvariable x evolves indep endently of all the others executing a ran

n

dom walk with step sizes ab out The time taken to generate eectively

indep endent samples from the target distribution will b e controlled bythe

max

largest lengthscale just as in the previous section where we needed

at least T L iterations to obtain an indep endent sample here we

max

need T

Nowhow big can b e The bigger it is the smaller this number T be

min

comes but if is to o big bigger than then the acceptance rate

will fall sharply It seems plausible that the optimal must b e similar to

MONTE CARLO METHODS

x x

P x

t

P x jx

t

x

a b

x x

x x

t

x

t

x

P x jx

t

x

d c

x x

Figure Gibbs sampling a The joint density P x from which samples are required

t

t

b Starting from a state x x is sampled from the conditional density P x jx c

A sample is then made from the conditional density P x jx d A couple of iterations

of Gibbs sampling

min

Strictly this may not b e true in sp ecial cases where the second small

min

est is signicantly greater than the optimal may b e closer to that

n

second smallest But our rough conclusion is this where simple spheri

n

max min

cal prop osal distributions are used we will need at least T

max min

iterations to obtain an indep endent sample where and are the

longest and shortest lengthscales of the target distribution

This is go o d news and bad news It is go o d news b ecause unlikethe

cases of rejection sampling and imp ortance sampling there is no catas

trophic dep endence on the dimensionality N But it is bad news in that all

the same this quadratic dep endence on the lengthscale ratio may force us

to makevery lengthysimulations

Fortunately there are metho ds for suppressing random walks in Monte

Carlo simulations whichwe will discuss later

DJC MACKAY

Gibbs sampling

Weintro duced imp ortance sampling rejection sampling and the Metrop o

lis metho d using onedimensional examples Gibbs sampling also known

as the heat bath method is a metho d for sampling from distributions over

at least two dimensions It can b e viewed as a Metrop olis metho d in which

the prop osal distribution Q is dened in terms of the conditional distri

butions of the joint distribution P x It is assumed that whilst P xis

to o complex to draw samples from directly its conditional distributions

P x jfx g are tractable to work with For many graphical mo dels but

i j j i

not all these onedimensional conditional distributions are straightforward

to sample from Conditional distributions that are not of standard form

may still b e sampled from by adaptive rejection sampling if the conditional

distribution satises certain convexity prop erties Gilks and Wild

Gibbs sampling is illustrated for a case with twovariables x x x

t

in gure On each iteration we start from the current state x andx

t

A is sampled from the conditional density P x jx with x xed to x

sample x is then made from the conditional density P x jx using the

t

new value of x This brings us to the new state x and completes the

iteration

In the general case of a system with K variables a single iteration

involves sampling one at a time

t t t t

x P x jx x x

K

t t t t

x P x jx x x

K

t t t t

etc x P x jx x x

K

Gibbs sampling can b e viewed as a Metrop olis metho d which has the prop

erty that every prop osal is always accepted Because Gibbs sampling is a

t

Metrop olis metho d the probability distribution of x tends to P xas

t aslongasP x do es not have pathological prop erties

GIBBS SAMPLING IN HIGH DIMENSIONS

Gibbs sampling suers from the same defect as simple Metrop olis algo

rithms the state space is explored by a random walk unless a fortuitous

parameterization has b een chosen whichmakes the probability distribution

P x separable If saytwovariables x and x are strongly correlated

having marginal densities of width L and conditional densities of width

then it will take at least ab out L iterations to generate an indep en

dent sample from the target densityHowever Gibbs sampling involves no

adjustable parameters so it is an attractive strategy when one wants to get

MONTE CARLO METHODS

a mo del running quickly An excellentsoftware package BUGSisavailable

whichmakes it easy to set up almost arbitrary probabilistic mo dels and

simulate them by Gibbs sampling Thomas Spiegelhalter and Gilks

Terminology for Markovchain Monte Carlo metho ds

Wenow sp end a few moments sketching the theory on which the Metrop olis

metho d and Gibbs sampling are based

A Markovchain can b e sp ecied byaninitial probability distribution



p x and a transition probability T x x

The probability distribution of the state at the t th iteration of the

Markovchain is given by

Z

t  N  t

p x d x T x xp x

We construct the chain such that

The desired distribution P x is the invariant distribution of the

chain



A distribution xisaninvariant distribution of T x xif

Z

 N 

x d x T x x x

The chain must also b e ergo dic that is

t

p x xas t forany p x

It is often convenient to construct T by mixing or concatenating simple

base transitions B all of which satisfy

Z

 N 

P x d x B x xP x

for the desired density P x These base transitions need not b e individually

ergo dic

Many useful transition probabilities satisfy the prop

erty

  



T x xP x T x x P x for all x and x

This equation says that if wepick a state from the target density P and

make a transition under T to another state it is just as likely that we will

  

pick x and go from x to x as it is that we will pick x and go from x to

x Markovchains that satisfy detailed balance are also called reversible

Markovchains The reason why the detailed balance prop ertyisofinterest

is that detailed balance implies invariance of the distribution P x under

DJC MACKAY

the Markovchain T the pro of of this is left as an exercise for the reader

Proving that detailed balance holds is often a key step when proving that

a Markovchain Monte Carlo simulation will converge to the desired distri

bution The Metrop olis metho d and Gibbs sampling metho d b oth satisfy

detailed balance for example Detailed balance is not an essential condi

tion however and we will see later that irreversible Markovchains can b e

useful in practice

Practicaliti es

Can we predict howlonga Markovchain Monte Carlo simulation

will take to equilibrate By considering the random walks involved in a

Markovchain Monte Carlo simulation we can obtain simple lower bounds

on the time required for convergence But predicting this time more pre

cisely is a dicult problem and most of the theoretical results are of little

practical use

Can we diagnose or detect convergence in a running simulation

This is also a dicult problem There are a few practical to ols available

but none of them is p erfect Cowles and Carlin

Can we sp eed up the convergence time and time b etween inde

p endent samples of a Markovchain Monte Carlo metho d Here

there is go o d news

SPEEDING UP MONTE CARLO METHODS

Reducing random walk behaviour in Metropolis methods

The hybrid Monte Carlo metho d reviewed in Neal is a Metrop olis

metho d applicable to continuous state spaces whichmakes use of gradient

information to reduce random walk b ehaviour

For many systems the probability P x can b e written in the form

E x

e

P x

Z

where not only E x but also its gradient with resp ect to x can b e read

ily evaluated It seems wasteful to use a simple randomwalk Metrop olis

metho d when this gradientisavailable the gradient indicates whichdi

rection one should go in to nd states with higher probability

In the hybrid Monte Carlo metho d the state space x is augmented by

momentum variables p and there is an alternation of twotyp es of prop osal

The rst prop osal randomizes the momentum variable leaving the state x

unchanged The second prop osal changes b oth x and p using simulated

Hamiltonian dynamics as dened by the Hamiltonian

H x p E x K p

MONTE CARLO METHODS

g gradE x set gradient using initial x

E findE x set objective function too

for l L loop L times

p randn sizex initial momentum is Normal

H pp E evaluate Hxp

xnew x

gnew g

for tau Tau make Tau leapfrog steps

p p epsilon gnew make halfstep in p

xnew xnew epsilon p makestepinx

gnew gradE xnew find new gradient

p p epsilon gnew make halfstep in p

endfor

Enew findE xnew find new value of H

Hnew p p Enew

dHHnewH Decide whether to accept

if dH accept

elseif rand expdH accept

else accept

endif

if accept

g gnew xxnew EEnew

endif

endfor

Figure Octave source co de for the hybrid Monte Carlo metho d

T

where K p is a kinetic energy suchasK p p p These two prop os

als are used to create asymptotically samples from the joint density

P x p expH x p exp E x expK p

H

Z Z

H H

This density is separable so it is clear that the of x is

the desired distribution expE xZ So simply discarding the momen

DJC MACKAY

1 1 b a 0.5 0.5 0 0 -0.5 -0.5 -1

-1 -1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1

1 1

c d 0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Figure ab Hybrid Monte Carlo used to generate samples from a bivariate Gaus

sian with correlation cd Randomwalk Metrop olis metho d for comparison

a Starting from the state indicated by the arrow the continuous line represents two

successive tra jectories generated by the Hamiltonian dynamics The squares showthe

endp oints of these two tra jectories Each tra jectory consists of Tau leapfrog steps

with epsilon After each tra jectory the momentum is randomized Here b oth

tra jectories are accepted the errors in the Hamiltonian were and resp ec

tively b The second gure shows how a sequence of four tra jectories converges from

an initial condition indicated by the arrow that is not close to the typical set of the

target distribution The tra jectory parameters Tau and epsilon were randomized for

each tra jectory using uniform distributio ns with means and resp ectivelyThe

rst tra jectory takes us to a new state similar in energy to the rst state

The second tra jectory happ ens to end in a state nearer the b ottom of the energy land

scap e Here since the p otential energy E is smaller the kinetic energy K p is

necessarily larger than it was at the start When the momentum is randomized for the

third tra jectory its magnitude b ecomes much smaller After the fourth tra jectory has

b een simulated the state app ears to have b ecome typical of the target densitycA

randomwalk Metrop olis metho d using a Gaussian prop osal density with radius such

that the acceptance rate was in this simulation The numb er of prop osals was so

the total amount of computer time used was similar to that in a The distance moved is

small b ecause of random walk b ehaviour d A randomwalk Metrop olis metho d given

a similar amount of computer time to b

MONTE CARLO METHODS

t

tum variables we will obtain a sequence of samples fx g which asymp

totically come from P x

The rst prop osal draws a new momentum from the Gaussian density

expK pZ During the second dynamical prop osal the momentum

K

variable determines where the state x go es and the gradient of E xde

termines how the momentum p changes in accordance with the equations

x p

Ex

p

x

Because of the p ersistent motion of x in the direction of the momentum

p during each dynamical prop osal the state of the system tends to move

a distance that go es linearly with the computer time rather than as the

square ro ot

If the simulation of the Hamiltonian dynamics is numerically p erfect

then the prop osals are accepted every time b ecause the total energy H x p

is a constant of the motion and so a in equation is equal to one If the

simulation is imp erfect b ecause of nite step sizes for example then some

of the dynamical prop osals will b e rejected The rejection rule makes use of

the change in H x p which is zero if the simulation is p erfect The o cca

t t

sional rejections ensure that asymptoticallywe obtain samples x p

from the required joint density P x p

H

The source co de in gure describ es a hybrid Monte Carlo metho d

which uses the leapfrog algorithm to simulate the dynamics on the function

findEx whose gradient is found by the function gradEx Figure

shows this algorithm generating samples from a bivariate Gaussian whose

T

energy function is E x x Ax with

A

Overrelaxation

The metho d of overrelaxation is a similar metho d for reducing ran

dom walk b ehaviour in Gibbs sampling Overrelaxation was originally in

tro duced for systems in which all the conditional distributions are Gaus

sian There are joint distributions that are not Gaussian whose conditional

distributions are all Gaussian for example P x y expx y Z

t

of the cur In ordinary Gibbs sampling one draws the new value x

i

rentvariable x from its conditional distribution ignoring the old value

i

t

x This leads to lengthy random walks in cases where the variables are

i strongly correlated as illustrated in the left hand panel of gure

DJC MACKAY

Gibbs sampling Overrelaxation

1 1 a

0.5 0.5

0 0

-0.5 -0.5

-1 -1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

b 0 -0.2 -0.4 -0.6 -0.8 -1

-1 -0.8-0.6-0.4-0.2 0

Figure Overrelaxation contrasted with Gibbs sampling for a bivariate Gaussian with

correlation a The state sequence for iterations each iteration involvin g one

up date of b oth variables The overrelaxation metho d had This excessively

large value is chosen to makeiteasytoseehowtheoverrelaxation metho d reduces

T 

random walk b ehaviour The dotted line shows the contour x x b Detail of

a showing the two steps making up each iteration After Neal

t

In Adlers overrelaxation metho d one instead samples x

i

from a Gaussian that is biased to the opp osite side of the conditional dis

tribution If the conditional distribution of x is Normal and the

i

t

currentvalue of x is x then Adlers metho d sets x to

i i

i

t t

x x

i i

where Normal and is a parameter b etween and commonly

set to a negativevalue



The transition matrix T x x dened by this pro cedure do es not satisfy

detailed balance The individual transitions for the individual co ordinates

just describ ed do satisfy detailed balance but when weformachain by ap

plying them in a xed sequence the overall chain is not reversible If say

twovariables are p ositively correlated then they will on a short timescale

MONTE CARLO METHODS

evolve in a directed manner instead of by random walk as shown in g

ure This may signicantly reduce the time required to obtain eectively

indep endent samples This metho d is still a valid sampling strategy it

converges to the target density P x b ecause it is made up of transitions

that satisfy detailed balance

The overrelaxation metho d has b een generalized by Neal and

this volume whose ordered overrelaxation metho d is applicable to any

system where Gibbs sampling is used For practical purp oses this metho d

mayspeedupasimulation by a factor of ten or twenty

Simulatedannealing

A third technique for sp eeding convergence is In

simulated annealing a temp erature parameter is intro duced which when

large allows the system to make transitions whichwould b e improbable at

temp erature The temp erature may b e initially set to a large value and

reduced gradually to It is hop ed that this pro cedure reduces the chance of

the simulations b ecoming stuck in an unrepresentative probability island

We asssume that we wish to sample from a distribution of the form

E x

e

P x

Z

where E xcanbeevaluated In the simplest simulated annealing metho d

we instead sample from the distribution

E x



T

P x e

T

Z T

and decrease T gradually to

Often the energy function can b e separated into two terms

E x E x E x

of which the rst term is nice for example a separable function of xand

the second is nasty In these cases a b etter simulated annealing metho d

might make use of the distribution

E x

E x



T

P x e



T

Z T

with T gradually decreasing to In this way the distribution at high

temp eratures reverts to a wellb ehaved distribution dened by E

Simulated annealing is often used as an optimization metho d where the

aim is to nd an x that minimizes E x in which case the temp erature is

decreased to zero rather than to As a Monte Carlo metho d simulated

DJC MACKAY

annealing as describ ed ab ove do esnt sample exactly from the right dis

tribution the closely related simulated temp ering metho ds Marinari and

Parisi correct the biases intro duced by the annealing pro cess bymak

ing the temp erature itself a random variable that is up dated in Metrop olis

fashion during the simulation

CAN THE NORMALIZING CONSTANT BE EVALUATED

If the target density P xisgiven in the form of an unnormalized density

P x with P x P x the value of Z maywell b e of interest Monte

Z

Carlo metho ds do not readily yield an estimate of this quantity and it

is an area of active research to nd ways of evaluating it Techniques for

evaluating Z include

Imp ortance sampling reviewed by Neal

Thermo dynamic integration during simulated annealing the accep

tance ratio metho d and umbrella sampling reviewed by Neal

Reversible jump Markovchain Monte Carlo Green

Perhaps the b est way of dealing with Z however is to nd a solution to

ones task that do es not require that Z be evaluated In Bayesian data mo d

elling one can avoid the need to evaluate Z whichwould b e imp ortant

for mo del comparison bynothaving more than one mo del Instead of

using several mo dels diering in complexity for example and evaluating

their relative p osterior probabilities one can make a single hierarchical

mo del having for example various continuous hyp erparameters whichplay

a role similar to that played by the distinct mo dels Neal

THE METROPOLIS METHOD FOR BIG MODELS

Our original description of the Metrop olis metho d involved a jointupdating



of all the variables using a prop osal density Qx x For big problems it

b 

may b e more ecient to use several prop osal distributions Q x x each

of which up dates only some of the comp onents of xEach prop osal is indi

vidually accepted or rejected and the prop osal distributions are rep eatedly

run through in sequence



In the Metrop olis metho d the prop osal density Qx xtypically has

anumb er of parameters that control for example its width These pa

rameters are usually set by trial and error with the rule of thumbbeing

that one aims for a rejection frequency of ab out It is not valid to have

the width parameters b e dynamically up dated during the simulation in a

way that dep ends on the history of the simulation Such a mo dication

of the prop osal densitywould violate the detailed balance condition which

guarantees that the Markovchain has the correct invariant distribution

MONTE CARLO METHODS

GIBBS SAMPLING IN BIG MODELS

Our description of Gibbs sampling involved sampling one parameter at a

time as describ ed in equations For big problems it maybemore

ecient to sample groups of variables jointly that is to use several prop osal

distributions

t t t

t

x x P x x jx x

a

a

a

K

t t t t t

t

x x x etc x P x x jx x

a b

a

a

b K b

HOW MANY SAMPLES ARE NEEDED

At the start of this chapter we observed that the variance of an estimator

dep ends only on the numb er of indep endent samples R and the value of

Z

N

d x P xx

Wehavenow discussed a variety of metho ds for generating samples from

P x Howmany indep endent samples R should we aim for

In many problems we really only need ab out twelve indep endent sam

ples from P x Imagine that x is an unknown vector such as the amount

of corrosion presentineach of underground pip elin es around Sicily

and x is the total cost of repairing those pip elines The distribution

P x describ es the probability of a state x given the tests that havebeen

carried out on some pip eline s and the assumptions ab out the physics of

corrosion The quantity is the exp ected cost of the repairs The quantity

is the variance of the cost measures byhowmuchwe should exp ect

the actual cost to dier from the exp ectation

Now how accurately would a manager liketoknowI would suggest

there is little p oint in knowing to a precision ner than ab out After

all the true cost is likely to dier by fromIfwe obtain R

p

indep endent samples from P x we can estimate to a precision of

which is smaller than So twelve samples suce

ALLOCATION OF RESOURCES

Assuming wehave decided howmany indep endent samples R are required

an imp ortant question is how one should make use of ones limited computer

resources to obtain these samples

Atypical Markovchain Monte Carlo exp erimentinvolves an initial p e

rio d in whichcontrol parameters of the simulation suchasstepsizesmaybe

adjusted This is followed by a burn in p erio d during whichwehopethe

DJC MACKAY

(1)

(2)

(3)

Figure Three p ossible Monte Carlo strategies for obtaining twelve

samples using a xed amount of computer time Computer time is represented by hor

izontal lines samples by white circles A single run consisting of one long burn in

period followed by a sampling p erio d Four mediumlength runs with dierent initial

conditions and a mediumlength burn in p erio d Twelve short runs

simulation converges to the desired distribution Finally as the simulation

continues we record the state vector o ccasionally so as to create a list of

r R

states fx g that we hop e are roughly indep endent samples from P x

r

There are several p ossible strategies gure

Make one long run obtaining all R samples from it

Make a few medium length runs with dierent initial conditions ob

taining some samples from each

Make R short runs each starting from a dierent random initial con

dition with the only state that is recorded b eing the nal state of each

simulation

The rst strategy has the b est chance of attaining convergence The last

strategy mayhave the advantage that the correlations b etween the recorded

samples are smaller The middle path app ears to b e p opular with Markov

chain Monte Carlo exp erts b ecause it avoids the ineciency of discarding

burnin iterations in many runs while still allowing one to detect problems

with lackofconvergence that would not b e apparent from a single run

MONTE CARLO METHODS

PHILOSOPHY

One curious defect of these Monte Carlo metho ds which are widely used

byBayesian statisticians is that they are all nonBayesian They involve

computer exp eriments from which estimators of quantities of interest are

derived These estimators dep end on the sampling distributions that were

used to generate the samples In contrast an alternativeBayesian approach

to the problem would use the results of our computer exp eriments to infer

the prop erties of the target function P x and generate predictive distribu

tions for quantities of interest such as This approachwould give answers

r

whichwould dep end only on the computed values of P x at the p oints

r

fx g the answers would not dep end on how those p oints were chosen

It remains an op en problem to create a Bayesian version of Monte Carlo

metho ds

Summary

Monte Carlo metho ds are a p owerful to ol that allow one to implement

any probability distribution that can b e expressed in the form P x

P x

Z

Monte Carlo metho ds can answer virtually any query related to P x

by putting the query in the form

Z

X

r

xP x x

R

r

In highdimensional problems the only satisfactory metho ds are those

based on Markovchain Monte Carlo the Metrop olis metho d and Gibbs

sampling

Simple Metrop olis although widely used p erform p o orly

b ecause they explore the space byaslow random walk More sophisti

cated Metrop olis algorithms suchashybrid Monte Carlo make use of

prop osal densities that give faster movement through the state space

The eciency of Gibbs sampling is also troubled by random walks

The metho d of ordered overrelaxation is a general purp ose technique

for suppressing them

ACKNOWLEDGEMENTS

This presentation of Monte Carlo metho ds owes a great deal to Wally Gilks

and I thank Radford Neal for teaching me ab out Monte

Carlo metho ds and for giving helpful comments on the manuscript

DJC MACKAY

References

Adler S L Overrelaxation metho d for the MonteCarlo evaluation of the par

tition function for multiquadrati c actions Physical Review DParticles and Fields

Cowles M K and Carlin B P Markovchain MonteCarlo convergence diag

nostics a comparative review Journal of the American Statistical Association

Gilks W and Wild P Adaptive rejection sampling for Gibbs sampling Applied

Statistics

Gilks W R Richardson S and Spiegelhalter D J

in Practice Chapman and Hall

Green P J Reversible jump Markovchain Monte Carlo computation and Bayesian

mo del determination Biometrika

Marinari E and Parisi G Simulated temp ering a new MonteCarlo scheme

Europhysics Letters

Neal R M Probabilis tic inference using Markovchain Monte Carlo metho ds

Technical Report CRGTR Dept of Computer Science UniversityofToronto

Neal R M Suppressing random walks in Markovchain Monte Carlo using ordered

overrelaxation Technical Report Dept of Statistics UniversityofToronto

Neal R M Bayesian Learning for Neural Networksnumb er in Lecture Notes

in Statistics Springer New York

Propp J G and Wilson D B Exact sampling with coupled Markovchains and

applications to statistical mechanics Random Structures and Algorithms

Tanner M A Tools for Methods for the Exploration of Pos

terior Distributions and LikelihoodFunctions Springer Series in Statistics rd edn

Springer Verlag

Thomas A Spiegelhalter D J and Gilks W R BUGS A program to p erform

Bayesian inference using Gibbs sampling in J M Bernardo J O Berger A P

Dawid and A F M Smith eds Clarendon Press Oxford

pp

Yeomans J Statistical mechanics of phase transitions Clarendon Press Oxford

For a full bibliography and a more thorough review of Monte Carlo meth

o ds the reader is encouraged to consult Neal Gilks et al

and Tanner