Markov chain Monte Carlo in action: a tutorial
Peter J. Green
University of Bristol, Department of Mathematics, Bristol, BS8 1TW, UK.
1. Intro duction
Markov chain Monte Carlo is probably ab out 50 years old, and has b een b oth develop ed and
extensively used in physics for the last four decades. However, the most sp ectacular increase
in its impact and in uence in statistics and probability has come since the late '80's.
It has now come to be an all-p ervading technique in statistical computation, in particular for
Bayesian inference, and esp ecially in complex sto chastic systems.
2. Cyclones example: p oint pro cesses and change p oints
We will illustrate the ideas of MCMC with a running example: the observations are a p oint
pro cess of events at times y ;y ;:::;y in an interval [0;L. We supp ose the events o ccur as a
1 2 N
Poisson pro cess | but at a p ossibly non-uniform rate: say xt p er unit time, at time t;we wish
to make inference ab out xt. We consider a series of mo dels, ultimately allowing an unknown
number of change p oints, unknown hyp erparameters, and a parametric p erio dic comp onent.
The mo dels and the resp ective algorithms and inferences will be illustrated by an analysis of
a data set of the times of cyclones hitting the Bay of Bengal; there were 141 cyclones over a
p erio d of 100 years.
Mo del 1: constant rate. First supp ose that xt x for all t. Then the times of the events
are immaterial: we observe N events in a time interval of length L; the obvious estimate of x
N
b
is x = , the maximum likelihood estimator of x under the assumption that N has a Poisson
L
distribution, with mean xL.
Mo del 2: constant rate, the Bayesian way. For a Bayesian approach to this example,
supp ose that wehave prior information ab out x from previous studies, for example. Supp ose
we can mo del this by x ; :
Then we nd that a posteriori x has a Gamma distribution with mean + N = + L, or
approximately N=L if N and L are large compared with and . Thus with a lot of data, the
Bayesian p osterior mean is close to the maximum likeliho o d estimator.
There is no need for MCMC in this mo del: you can calculate the p osterior exactly, and recognise
it as a standard distribution; it only worked like this b ecause we used a conjugate prior.
Mo del 3: constant rate, with hyp erparameter. Supp ose you are reluctant to sp ecify
your prior fully: you are happy to say x ; and can sp ecify but not , and want
to state only e; f for xed e and f a formulation that makes more sense in our next
formulation, mo del 4.
Now pxjN ; ; e; f no longer has an explicit form, but the ful l conditionals pxjN ; ; ; e; f
and p jx; N ; ; e; f are simple:
xjN ; ; ; e; f + N; + L
as b efore, and
jx; N ; ; e; f e + ; f + x:
What happ ens if we generate a sample of x; pairs by alternately drawing x and from
these distributions?
Figure 1 shows the rst few moves of this pro cess applied to mo del 3 on the cyclones data; we
to ok e = 1 and f = N=L = 1:41. The marginal distribution for x, as accumulated from the
rst 1000 sweeps of this pro cess is also displayed in Figure 1. beta oo o o o o o o o oo o o oooo o o o 0123 012345
1.1 1.2 1.3 1.4 1.5 1.6 1.7 0.5 1.0 1.5 2.0 2.5
x x
Figure 1: First few moves of the Gibbs sampler, and marginal distribution for
from 1000 sweeps, for the cyclones data, model 3.
This is a simple example of a Gibbs sampler. The alternating up dates of one variable conditioned
on the other induces Markov dep endence: the successively sampled pairs form a Markovchain
+ +
on the uncountable state space R R , and it is readily shown that the joint p osterior is the
unique invariant distribution of the chain. Standard theorems imply that the chain converges
to this invariant distribution in several useful senses, so that we can treat the realised values
as a sample from the p osterior.
Mo del 4: constant rate, with change p oint. Now let us supp ose xt is a step function
a suitable mo del if we p ostulate one or more change points; the pro cess is completely random,
but the rate switches b etween levels. Let us take k change p oints at known times T ;T ;:::;T
1 2 k
with
8
x if 0 t > 0 1 > > < x if T t 1 1 2 xt= ; > > > : x if T t k k Supp ose that x ;x :::;x are a priori indep endently drawn from Gamma distributions, as 0 ; k b efore: x ; . Then if N ;N ;:::;N are the numb ers of events b etween adjacent fT g, j 0 1 k j the ab ove metho d extends to sampling in turn from x j + N ; + T T ;j =0;1:::;k and j j j+1 j k X j e +k +1 ; f + x ; i i=o forming a Markovchain with a k + 2-dimensional state space fx ;x ;:::;x ; g. Note that 0 1 k we write `j' to mean `given all other variables' | including the data. The hierarchical mo del using random allows `b orrowing strength' in estimation from all the data together: the x are conditionally indep endent given , but are unconditionally dependent. j In inference their values will b e shrunk together. 3. Beyond the Gibbs sampler { MCMC in general Having motivated the idea of MCMC by use of the Gibbs sampler in a very basic problem, we are now in a p osition to discuss the sub ject from a rather more general p ersp ective. The ob jective is to construct a discrete time Markovchain whose state space is X the param- eter space in Bayesian statistics, and whose limiting distribution is a sp eci ed target e.g. a Bayesian p osterior. That is, we want a transition kernel P such that t 0 0 P fx 2 Ajx g! A as t !1;8x : Having constructed such a Markov chain, in the sense of devising a transition kernel with this limiting prop erty, we then construct it in another sense | we form a realisation of the chain 1 2 N fx ; x ;:::;x g and treat this as if it was a random sample from . A number of standard recip es have b een develop ed see, for example, Besag, et al., 1995, including the Gibbs sampler, and the Metrop olis and Hastings metho ds; the latter is very general, and can even b e extended to cases where the parameter space is not of xed dimension. We illustrate some of the extensions to our p oint pro cess mo del that can also b e handled. Mo del 5: another hyp erparameter. Let us now supp ose is also unknown, with, a priori, c; d for xed constants c and d. For the cyclones data, we to ok c = d = 2. This last change means that Gibbs sampling is not enough. In a Markov chain with states x = x ;x ;:::;x ; ; , we can up date using a random walk Metrop olis move, on the 0 1 k log scale: the acceptance ratio simpli es to 8 9 ! ! k +1 c 0 < = 0 Y