Markov Chain Monte Carlo Gibbs Sampler Gibbs Sampler Gibbs Sampler
Total Page:16
File Type:pdf, Size:1020Kb
Markov Chain Monte Carlo Gibbs Sampler Recall: To compute the expectation h(Y ) we use the approximation Let Y = (Y1; : : : ; Yd) be d dimensional with d ¸ 2 and distribution f(y). n 1 ¡ ¢ (h(Y )) ¼ h(Y (t)) with Y (1); : : : ; Y (n) » h(y): The full conditional distribution of Yi is given by n t=1 f(y ; : : : ; y ; y ; y ; : : : ; y ) P 1 i¡1 i i+1 d (1) (n) f(yijy1; : : : ; yi¡1; yi+1; : : : ; yd) = Thus our aim is to sample Y ; : : : ; Y from f(y). f(y1; : : : ; yi¡1; yi; yi+1; : : : ; yd) dyi Problem: Independent sampling from f(y) may be di±cult. R Gibbs sampling Markov chain Monte Carlo (MCMC) approach Sample or update in turn: ± Generate Markov chain fY (t)g with stationary distribution f(y). Y (t+1) » f(y jY (t); Y (t); : : : ; Y (t)) ± Early iterations Y (1); : : : ; Y (m) reect starting value Y (0). 1 1 2 3 d (t+1) (t+1) (t) (t) Y » f(y2jY ; Y ; : : : ; Y ) ± These iterations are called burn-in. 2 1 3 d (t+1) (t+1) (t+1) (t) (t) Y3 » f(y3jY1 ; Y2 ; Y4 ; : : : ; Yd ) ± After the burn-in, we say the chain has \converged". ± Omit the burn-in from averages: (t+1) (t+1) (t+1) (t+1) Y » f(ydjY ; Y ; : : : ; Y ) 1 n d 1 2 d¡1 h(Y (t)) n ¡ m t=m+1 Always use most recent values. P 2 Burn−in Stationarity In two dimensions, the sample path of the Gibbs sampler looks like this: 1 0.30 ) t=1 t ( 0 Y t=2 −1 0.25 t=4 −2 0 100 200 300 400 500 600 700 800 900 1000 Iteration t=3 0.20 ) t ( 2 Y (t) 0.15 t=7 How do we construct a Markov chain fY g which has stationary distri- t=6 bution f(y)? t=5 0.10 ± Gibbs sampler 0.15 0.20 0.25 0.30 0.35 (t) Y1 ± Metropolis-Hastings algorithm (Metropolis et al 1953; Hastings 1970) MCMC, April 29, 2004 - 1 - MCMC, April 29, 2004 - 2 - Gibbs Sampler Gibbs Sampler T Detailed balance for Gibbs sampler: For simplicity, let Y = (Y1; Y2) . Then the Example: Bayes inference for a univariate normal sample (t+1) (t) update Y at time t + 1 is obtained from the previous Y in two steps: T Consider normally distributed observations Y = (Y1; : : : ; Yn) (t+1) (t) Y1 » p(y1jY2 ) (t+1) (t+1) iid 2 Y2 » p(y2jY1 ) Yi » N (¹; ): 0 (t+1) 0 (t) Accordingly the transition matrix P (y; y ) = ¡ (Y = y jY = y) can be factorized Likelihood function: into two separate transition matrices n n 2 1 2 1 2 0 0 f(Y j¹; ) » exp ¡ (Y ¡ ¹) P (y; y ) = P (y; y~) P (y~; y ) 2 2 i 1 2 2 i=1 ³ ´ ³ ´ 0 T P where y~ = (y1; y2) is the intermediate result after the ¯rst step. Obviously we have Prior distribution (noninformative prior): 0 0 0 0 P1(y; y~) = p(y1jy2) and P2(y~; y ) = p(y2jy1): 2 1 ¼(¹; ) » 2 0 0 0 0 0 Note that for any y; y , we have P1(y; y ) = 0 if y2 =6 y2 and P2(y; y ) = 0 if y1 =6 y1. According to the detailed balance for time-dependent Markov chains, it su±ces to show Posterior distribution: 0 0 n 2 +1 n detailed balance for each of the transition matrices: For any states y; y such that y2 = y2 2 1 1 2 ¼(¹; jY ) » 2 exp ¡ 2 (Yi ¡ ¹) 0 0 0 2 i=1 p(y) P1(y; y ) = p(y1; y2) p(y1jy2) = p(y1jy2) p(y1; y2) ³ ´ ³ ´ 0 0 0 0 0 2 P = p(y1jy2) p(y1; y2) = P1(y ; y) p(y ); De¯ne ¿ = 1/ . Then we can show that 0 0 while for y; y with y2 =6 y2 the equation is trivially ful¯lled. 2 ¹ 2 0 0 ¼(¹j ; Y ) = N Y ; =n Similarly we obtain for y; y such that y1 = y1 n n 1 2 0 0 0 ¼(¿j¹; Y ) = ¡ ¡ ; (¢Yi ¡ ¹) p(y) P2(y; y ) = p(y1; y2) p(y2jy1) = p(y2jy1) p(y2; y1) 2 2 i=1 0 0 0 0 0 ³ ´ = p(y2jy1) p(y1; y2) = P2(y ; y) p(y ); P 0 0 while for y; y with y1 =6 y1 the equation trivially holds. Altogether this shows that p(y) Gibbs sampler: is indeed the stationary distribution of the Gibbs sampler. Note that combined we get 0 0 0 0 0 0 p(y) P (y; y ) = p(y) P1(y; y~) P1(y~; y ) = p(y ) P2(y ; y~) P1(y~; y) =6 p(y ) P (y ; y): ¹(t+1) » N Y¹ ; (n ¢ ¿ (t))¡1 n Explanation: Markov chains fYtg which satisfy the detailed balance equation are called (t+1) n 1 (t+1) 2 ¿ » ¡ ¡ ; (Yi ¡ ¹¢ ) time-reversible since it can be shown that 2 2 i=1 ³ ´ 0 0 P ¡ ¡ t t (Yt+1 = y jYt = y) = (Yt = yjYt+1 = y ): with 2 ( +1) = 1=¿ ( +1) For the above Gibbs sampler, to go back in time we have to update the two components (t+1) (t+1) in reverse order - ¯rst Y2 and then Y1 . MCMC, April 29, 2004 - 3 - MCMC, April 29, 2004 - 4 - Gibbs Sampler Markov Chain Monte Carlo Implementation in R Example: Bivariate normal distribtution n<-20 #Data T T Y<-rnorm(n,2,2) Let Y = (Y1; Y2) be normally distributed with mean ¹ = (0; 0) and MC<-2;N<-1000 #Run MC=2 chains of length N=1000 covariance matrix p<-rep(0,2*MC*N) #Allocate memory for results dim(p)<-c(2,MC,N) 1 ½ for (j in (1:MC)) { #Loop over chains § = : p2<-rgamma(1,n/2,1/2) #Starting value for tau ½ 1 for (i in (1:N)) { #Gibbs iterations à ! p1<-rnorm(1,mean(Y),sqrt(1/(p2*n))) #Update mu p2<-rgamma(1,n/2,sum((Y-p1)^2)/2) #Update tau The conditional distributions are p[1,j,i]<-p1 #Save results p[2,j,i]<-p2 2 } Y1jY2 = N (½ Y2; 1 ¡ ½ ) } 2 Y2jY1 = N (½ Y1; 1 ¡ ½ ) Results: Bayes inference for a univariate normal sample Thus the steps of the Gibbs sampler are Two runs of Gibbs sampler (N=500): (t+1) (t) 2 Y1 » N (½ Y2 ; 1 ¡ ½ ); 3.5 (t+1) (t+1) 3.0 2 2.5 Y2 » N (½ Y1 ; 1 ¡ ½ ): 2.0 1.5 1.0 0.5 0.6 0.4 T 0.2 Note: (t) (t) (t) 0 We can obtain an independent sample Y = (Y1 ; Y2 ) by 0 100 200 300 400 500 Iteration (t) (t) (t+1) 1.0 µ 1.0 τ Y » N (0; 1); 0.8 0.8 1 0.6 0.6 (t+1) (t+1) 2 0.4ACF 0.4ACF Y2 » N (½ Y1 ; 1 ¡ ½ ): 0.2 0.2 0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 50 Lag Lag Marginal and joint posterior distributions (based on 1000 draws): 100 0.6 80 0.5 80 0.4 60 60 ) t ( 0.3 τ 40 40 Frequency Frequency 0.2 20 20 0.1 0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.5 1.0 1.5 2.0 2.5 3.0 3.5 ( ) ( ) µ t τ(t) µ t MCMC, April 29, 2004 - 5 - MCMC, April 29, 2004 - 6 - Markov Chain Monte Carlo Markov Chain Monte Carlo Comparison of MCMC and independent draws Convergence diagnostics 1.0 0.8 ² Plot chain for each quantity of interest. 0.6 (t) ACF 0.4 Y1 Y1 0.2 0.0 0 Iterations 100 to 400 Independent sampling (n=300) 0 0 50 100 150 200 30 ² Plot auto-correlation function (ACF) Lag Burn−in 2 1 ) t 0 ( 30 (t) (t+h) µ −1 −2 20 ½i(h) = corr Yi ; Yi : −3 0 50 100 150 200 100 100 iteration 20 ¡ ¢ Frequency Frequency measures the correlation of values h lags apart. 10 10 ± Slow decay of ACF indicates slow convergence and bad mixing. 200 200 0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (t) ± Can be used to ¯nd independent subsample. Y1 Y1 Iterations 100 to 700 Independent sampling (n=600) ² Run multiple, independent chains (e.g. 3-10). 80 300 300 50 ± Several long runs (Gelman and Rubin 1992) 60 40 ¢ gives indication of convergence 40 30 Frequency Frequency 400 20 400 ¢ a sense of statistical security 20 10 0 0 ± one very long run (Geyer, 1992) −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (t) Y1 Y1 ¢ reaches parts other schemes cannot reach. 500 Iterations 100 to 1000 Independent sampling (n=900) 500 80 ² Widely dispersed starting values are particularly helpful to detect slow convergence. 60 60 600 600 40 40 40 20 convergence Frequency Frequency © © 1 ©¼ µ 0 20 20 −20 700 0 0 700 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0 200 400 600 800 1000 (t) Iteration Y1 Y1 Iterations 100 to 10000 Independent sampling (n=9900) 800 800 If not satis¯ed, try some other diagnostics (à literature). 800 800 600 600 400 400 Frequency Frequency 900 200 200 900 0 0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (t) Y1 Y1 1000 1000 MCMC, April 29, 2004 - 7 - MCMC, April 29, 2004 - 8 - Markov Chain Monte Carlo Note: Even after the chain reached convergence, it might not yet good enough for estimating (h(Y )).