Markov Chain Monte Carlo in Action: a Tutorial

Markov chain Monte Carlo in action: a tutorial Peter J. Green University of Bristol, Department of Mathematics, Bristol, BS8 1TW, UK. [email protected] 1. Intro duction Markov chain Monte Carlo is probably ab out 50 years old, and has b een b oth develop ed and extensively used in physics for the last four decades. However, the most sp ectacular increase in its impact and in uence in statistics and probability has come since the late '80's. It has now come to be an all-p ervading technique in statistical computation, in particular for Bayesian inference, and esp ecially in complex sto chastic systems. 2. Cyclones example: p oint pro cesses and change p oints We will illustrate the ideas of MCMC with a running example: the observations are a p oint pro cess of events at times y ;y ;:::;y in an interval [0;L. We supp ose the events o ccur as a 1 2 N Poisson pro cess | but at a p ossibly non-uniform rate: say xt p er unit time, at time t;we wish to make inference ab out xt. We consider a series of mo dels, ultimately allowing an unknown number of change p oints, unknown hyp erparameters, and a parametric p erio dic comp onent. The mo dels and the resp ective algorithms and inferences will be illustrated by an analysis of a data set of the times of cyclones hitting the Bay of Bengal; there were 141 cyclones over a p erio d of 100 years. Mo del 1: constant rate. First supp ose that xt x for all t. Then the times of the events are immaterial: we observe N events in a time interval of length L; the obvious estimate of x N b is x = , the maximum likelihood estimator of x under the assumption that N has a Poisson L distribution, with mean xL. Mo del 2: constant rate, the Bayesian way. For a Bayesian approach to this example, supp ose that wehave prior information ab out x from previous studies, for example. Supp ose we can mo del this by x ; : Then we nd that a posteriori x has a Gamma distribution with mean + N = + L, or approximately N=L if N and L are large compared with and . Thus with a lot of data, the Bayesian p osterior mean is close to the maximum likeliho o d estimator. There is no need for MCMC in this mo del: you can calculate the p osterior exactly, and recognise it as a standard distribution; it only worked like this b ecause we used a conjugate prior. Mo del 3: constant rate, with hyp erparameter. Supp ose you are reluctant to sp ecify your prior fully: you are happy to say x ; and can sp ecify but not , and want to state only e; f for xed e and f a formulation that makes more sense in our next formulation, mo del 4. Now pxjN ; ; e; f no longer has an explicit form, but the ful l conditionals pxjN ; ; ; e; f and p jx; N ; ; e; f are simple: xjN ; ; ; e; f + N; + L as b efore, and jx; N ; ; e; f e + ; f + x: What happ ens if we generate a sample of x; pairs by alternately drawing x and from these distributions? Figure 1 shows the rst few moves of this pro cess applied to mo del 3 on the cyclones data; we to ok e = 1 and f = N=L = 1:41. The marginal distribution for x, as accumulated from the rst 1000 sweeps of this pro cess is also displayed in Figure 1. beta oo o o o o o o o oo o o oooo o o o 0123 012345 1.1 1.2 1.3 1.4 1.5 1.6 1.7 0.5 1.0 1.5 2.0 2.5 x x Figure 1: First few moves of the Gibbs sampler, and marginal distribution for from 1000 sweeps, for the cyclones data, model 3. This is a simple example of a Gibbs sampler. The alternating up dates of one variable conditioned on the other induces Markov dep endence: the successively sampled pairs form a Markovchain + + on the uncountable state space R R , and it is readily shown that the joint p osterior is the unique invariant distribution of the chain. Standard theorems imply that the chain converges to this invariant distribution in several useful senses, so that we can treat the realised values as a sample from the p osterior. Mo del 4: constant rate, with change p oint. Now let us supp ose xt is a step function a suitable mo del if we p ostulate one or more change points; the pro cess is completely random, but the rate switches b etween levels. Let us take k change p oints at known times T ;T ;:::;T 1 2 k with 8 x if 0 t<T > 0 1 > > < x if T t<T 1 1 2 xt= ; > > > : x if T t<L k k Supp ose that x ;x :::;x are a priori indep endently drawn from Gamma distributions, as 0 ; k b efore: x ; . Then if N ;N ;:::;N are the numb ers of events b etween adjacent fT g, j 0 1 k j the ab ove metho d extends to sampling in turn from x j + N ; + T T ;j =0;1:::;k and j j j+1 j k X j e +k +1 ; f + x ; i i=o forming a Markovchain with a k + 2-dimensional state space fx ;x ;:::;x ; g. Note that 0 1 k we write `j' to mean `given all other variables' | including the data. The hierarchical mo del using random allows `b orrowing strength' in estimation from all the data together: the x are conditionally indep endent given , but are unconditionally dependent. j In inference their values will b e shrunk together. 3. Beyond the Gibbs sampler { MCMC in general Having motivated the idea of MCMC by use of the Gibbs sampler in a very basic problem, we are now in a p osition to discuss the sub ject from a rather more general p ersp ective. The ob jective is to construct a discrete time Markovchain whose state space is X the parameter space in Bayesian statistics, and whose limiting distribution is a sp eci ed target e.g. a Bayesian p osterior. That is, we want a transition kernel P such that t 0 0 P fx 2 Ajx g! A as t !1;8x : Having constructed such a Markov chain, in the sense of devising a transition kernel with this limiting prop erty, we then construct it in another sense | we form a realisation of the chain 1 2 N fx ; x ;:::;x g and treat this as if it was a random sample from . A number of standard recip es have b een develop ed see, for example, Besag, et al., 1995, including the Gibbs sampler, and the Metrop olis and Hastings metho ds; the latter is very general, and can even b e extended to cases where the parameter space is not of xed dimension. We illustrate some of the extensions to our p oint pro cess mo del that can also b e handled. Mo del 5: another hyp erparameter. Let us now supp ose is also unknown, with, a priori, c; d for xed constants c and d. For the cyclones data, we to ok c = d = 2. This last change means that Gibbs sampling is not enough. In a Markov chain with states x = x ;x ;:::;x ; ; , we can up date using a random walk Metrop olis move, on the 0 1 k log scale: the acceptance ratio simpli es to 8 9 ! ! k +1 c 0 < = 0 Y d k +1 min 1; x e j 0 : ; Mo del 6: unknown change p oints. If x ;x ;:::;x are unknown, so probably are the times 0 1 k of the change p oints T <T < <T . The state vector is now x =x ;x ;:::;x ;T ;T ;:::; 1 2 k 0 1 k 1 2 T ; ; . Let us assume a priori pT ;T ;:::;T / T T T :::T T L T : k 1 2 k 1 2 1 k k1 k The p osterior marginal or joint conditional distributions are quite complex, for this or any prior, so Metrop olis-Hastings is needed. The details are a little messy but straightforward. The 0 acceptance probability for a prop osal that T be drawn uniformly from [T ;T ] is j 1 j+1 j 0 0 T T T T j 1 j +1 j j min 1; likeliho o d ratio : T T T T j j 1 j +1 j Mo del 7: unknown number of change p oints What if the number of change p oints, k , is also unknown? We might place a prior on k , say Poisson: k pk = e k! and then make Bayesian inference ab out all unknowns: x = k; ; ; T ;:::; T ;x ;:::;x . 1 k 0 k There are 2k +4 parameters: the number of things you don't know is one of the things you don't know! o o o o intensity probability o o oo 01234 o oooo 0.0 0.10 0.20 0 20406080100 024681012 time k Figure 2: Posterior sample of step functions xt for model 6 with k = 2, and posterior for k in model 7, applied to cyclones data.

Load more