<<

Markov Chain Monte Carlo Gibbs Sampler

Recall: To compute the expectation h(Y ) we use the approximation Let Y = (Y1, . . . , Yd) be d dimensional with d 2 and distribution f(y).

n 1 ¡ ¢ (h(Y )) h(Y (t)) with Y (1), . . . , Y (n) h(y). The full conditional distribution of Yi is given by n t=1 f(y , . . . , y , y , y , . . . , y ) P 1 i1 i i+1 d (1) (n) f(yi|y1, . . . , yi1, yi+1, . . . , yd) = Thus our aim is to sample Y , . . . , Y from f(y). f(y1, . . . , yi1, yi, yi+1, . . . , yd) dyi Problem: Independent sampling from f(y) may be dicult. Monte Carlo (MCMC) approach Sample or update in turn: Generate Markov chain {Y (t)} with stationary distribution f(y). Y (t+1) f(y |Y (t), Y (t), . . . , Y (t)) Early iterations Y (1), . . . , Y (m) reect starting value Y (0). 1 1 2 3 d (t+1) (t+1) (t) (t) Y f(y2|Y , Y , . . . , Y ) These iterations are called burn-in. 2 1 3 d (t+1) (t+1) (t+1) (t) (t) Y3 f(y3|Y1 , Y2 , Y4 , . . . , Yd ) After the burn-in, we say the chain has “converged”...... Omit the burn-in from averages: (t+1) (t+1) (t+1) (t+1) Y f(yd|Y , Y , . . . , Y ) 1 n d 1 2 d1 h(Y (t)) n m t=m+1 Always use most recent values. P 2 Burn−in Stationarity In two dimensions, the sample path of the Gibbs sampler looks like this:

1 0.30

) t=1 t ( 0 Y t=2

−1 0.25 t=4 −2 0 100 200 300 400 500 600 700 800 900 1000 Iteration t=3 0.20 ) t ( 2 Y

(t) 0.15 t=7 How do we construct a Markov chain {Y } which has stationary distri- t=6 bution f(y)? t=5 0.10

Gibbs sampler 0.15 0.20 0.25 0.30 0.35 (t) Y1 Metropolis-Hastings (Metropolis et al 1953; Hastings 1970)

MCMC, April 29, 2004 - 1 - MCMC, April 29, 2004 - 2 -

Gibbs Sampler Gibbs Sampler

T Detailed balance for Gibbs sampler: For simplicity, let Y = (Y1, Y2) . Then the Example: Bayes inference for a univariate normal sample (t+1) (t) update Y at time t + 1 is obtained from the previous Y in two steps: T Consider normally distributed observations Y = (Y1, . . . , Yn) (t+1) (t) Y1 p(y1|Y2 ) (t+1) (t+1) iid 2 Y2 p(y2|Y1 ) Yi N (, ).

0 (t+1) 0 (t) Accordingly the transition matrix P (y, y ) = ¡ (Y = y |Y = y) can be factorized : into two separate transition matrices n n 2 1 2 1 2 0 0 f(Y |, ) exp (Y ) P (y, y ) = P (y, y˜) P (y˜, y ) 2 2 i 1 2 2 i=1 ³ ´ ³ ´ 0 T P where y˜ = (y1, y2) is the intermediate result after the rst step. Obviously we have Prior distribution (noninformative prior):

0 0 0 0 P1(y, y˜) = p(y1|y2) and P2(y˜, y ) = p(y2|y1). 2 1 (, ) 2 0 0 0 0 0 Note that for any y, y , we have P1(y, y ) = 0 if y2 =6 y2 and P2(y, y ) = 0 if y1 =6 y1. According to the detailed balance for time-dependent Markov chains, it suces to show Posterior distribution: 0 0 n 2 +1 n detailed balance for each of the transition matrices: For any states y, y such that y2 = y2 2 1 1 2 (, |Y ) 2 exp 2 (Yi ) 0 0 0 2 i=1 p(y) P1(y, y ) = p(y1, y2) p(y1|y2) = p(y1|y2) p(y1, y2) ³ ´ ³ ´ 0 0 0 0 0 2 P = p(y1|y2) p(y1, y2) = P1(y , y) p(y ), Dene = 1/ . Then we can show that

0 0 while for y, y with y2 =6 y2 the equation is trivially fullled. 2 2 0 0 (| , Y ) = N Y , /n Similarly we obtain for y, y such that y1 = y1 n n 1 2 0 0 0 (|, Y ) = ¡ , (¢Yi ) p(y) P2(y, y ) = p(y1, y2) p(y2|y1) = p(y2|y1) p(y2, y1) 2 2 i=1 0 0 0 0 0 ³ ´ = p(y2|y1) p(y1, y2) = P2(y , y) p(y ), P

0 0 while for y, y with y1 =6 y1 the equation trivially holds. Altogether this shows that p(y) Gibbs sampler: is indeed the stationary distribution of the Gibbs sampler. Note that combined we get

0 0 0 0 0 0 p(y) P (y, y ) = p(y) P1(y, y˜) P1(y˜, y ) = p(y ) P2(y , y˜) P1(y˜, y) =6 p(y ) P (y , y). (t+1) N Y , (n (t))1 n Explanation: Markov chains {Yt} which satisfy the detailed balance equation are called (t+1) n 1 (t+1) 2 ¡ , (Yi ¢ ) time-reversible since it can be shown that 2 2 i=1 ³ ´

0 0 P ¡

¡ t t (Yt+1 = y |Yt = y) = (Yt = y|Yt+1 = y ). with 2 ( +1) = 1/ ( +1) For the above Gibbs sampler, to go back in time we have to update the two components (t+1) (t+1) in reverse order - rst Y2 and then Y1 .

MCMC, April 29, 2004 - 3 - MCMC, April 29, 2004 - 4 - Gibbs Sampler Markov Chain Monte Carlo

Implementation in R Example: Bivariate normal distribtution n<-20 #Data T T Y<-rnorm(n,2,2) Let Y = (Y1, Y2) be normally distributed with mean = (0, 0) and MC<-2;N<-1000 #Run MC=2 chains of length N=1000 covariance matrix p<-rep(0,2*MC*N) #Allocate memory for results dim(p)<-c(2,MC,N) 1 for (j in (1:MC)) { #Loop over chains = . p2<-rgamma(1,n/2,1/2) #Starting value for tau 1 for (i in (1:N)) { #Gibbs iterations à ! p1<-rnorm(1,mean(Y),sqrt(1/(p2*n))) #Update mu p2<-rgamma(1,n/2,sum((Y-p1)^2)/2) #Update tau The conditional distributions are p[1,j,i]<-p1 #Save results p[2,j,i]<-p2 2 } Y1|Y2 = N ( Y2, 1 ) } 2 Y2|Y1 = N ( Y1, 1 )

Results: Bayes inference for a univariate normal sample Thus the steps of the Gibbs sampler are Two runs of Gibbs sampler (N=500): (t+1) (t) 2 Y1 N ( Y2 , 1 ), 3.5 (t+1) (t+1) 3.0 2 2.5 Y2 N ( Y1 , 1 ). 2.0 1.5 1.0 0.5 0.6 0.4 T 0.2 Note: (t) (t) (t) 0 We can obtain an independent sample Y = (Y1 , Y2 ) by 0 100 200 300 400 500 Iteration

(t) (t) (t+1) 1.0 µ 1.0 τ Y N (0, 1), 0.8 0.8 1 0.6 0.6 (t+1) (t+1) 2 0.4ACF 0.4ACF Y2 N ( Y1 , 1 ). 0.2 0.2

0.0 0.0

0 10 20 30 40 50 0 10 20 30 40 50 Lag Lag

Marginal and joint posterior distributions (based on 1000 draws):

100 0.6

80 0.5 80

0.4 60 60 ) t ( 0.3 τ 40 40 Frequency Frequency 0.2

20 20 0.1

0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.5 1.0 1.5 2.0 2.5 3.0 3.5 ( ) ( ) µ t τ(t) µ t

MCMC, April 29, 2004 - 5 - MCMC, April 29, 2004 - 6 -

Markov Chain Monte Carlo Markov Chain Monte Carlo

Comparison of MCMC and independent draws Convergence diagnostics

1.0 0.8 Plot chain for each quantity of interest. 0.6 (t) ACF 0.4 Y1 Y1 0.2 0.0 0 Iterations 100 to 400 Independent sampling (n=300) 0 0 50 100 150 200 30 Plot auto-correlation function (ACF) Lag

Burn−in 2 1 )

t 0 ( 30 (t) (t+h) µ −1 −2 20 i(h) = corr Yi , Yi . −3 0 50 100 150 200 100 100 iteration 20 ¡ ¢ Frequency Frequency measures the correlation of values h lags apart. 10 10 Slow decay of ACF indicates slow convergence and bad mixing.

200 200 0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (t) Can be used to nd independent subsample. Y1 Y1 Iterations 100 to 700 Independent sampling (n=600) Run multiple, independent chains (e.g. 3-10). 80 300 300 50 Several long runs (Gelman and Rubin 1992) 60 40 gives indication of convergence 40 30 Frequency Frequency 400 20 400 a sense of statistical security 20 10

0 0 one very long run (Geyer, 1992) −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (t) Y1 Y1 reaches parts other schemes cannot reach. 500 Iterations 100 to 1000 Independent sampling (n=900) 500

80 Widely dispersed starting values are particularly helpful to detect slow convergence. 60 60 600 600 40

40 40 20 convergence Frequency Frequency ¨ ¨ 1 ¨ µ 0 20 20 −20

700 0 0 700 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0 200 400 600 800 1000 (t) Iteration Y1 Y1 Iterations 100 to 10000 Independent sampling (n=9900) 800 800 If not satised, try some other diagnostics (Ã literature). 800 800 600 600

400 400 Frequency Frequency

900 200 200 900

0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (t) Y1 Y1 1000 1000

MCMC, April 29, 2004 - 7 - MCMC, April 29, 2004 - 8 - Markov Chain Monte Carlo

Note: Even after the chain reached convergence, it might not yet good enough for estimating (h(Y )).

60

3 50

2 40 1 30

0 Ã Frequency −1 20 −2 −3 10 0 100 200 300 400 500 600 700 800 900 1000 0 −3 −2 −1 0 1 2 3 Iteration Problem: Chain should show good mixing (transition between states) Ã run the chain for a longer period

1200 3 2 900 1 600

0 Ã Frequency −1 −2 300 −3 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 0 −3 −2 −1 0 1 2 3 Iteration

Monte Carlo error Suppose we want to estimate g(Y ) by

1 N ¡ ¢ hˆ = h(Y (t)) with Y (t) f(y). N t=1 P The error of the approximation (Monte Carlo error) is var(hˆ). Estimation of Monte Carlo error: q Let {Y (i,t)} be I Markov chains. Then var(hˆ) can be estimated by

1 I (hˆ(i) hˆ)2 I(I 1) i=1 P where hˆ(i) is the MCMC estimate based in the ith chain hˆ is the average of the hˆ(i) (overall estimate)

MCMC, April 29, 2004 - 9 -