Bayesian Modeling Strategies for Generalized Linear Models, Part 1
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Modeling Strategies for Generalized Linear Models, Part 1 Reading: Ho Chapter 9; Albert and Chib (1993) Sections 13.2, 4.1; Polson (2012) Sections 13, Appendix S6.2; Pillow and Scott (2012) Sections 13.1, 4; Dadaneh et al. (2018); Neelon (2018) Sections 1 and 2 Fall 2018 1 / 160 Linear Regression Model Consider the following simple linear regression model: x T 1 Yi = i β + ei i = ;:::; n; where • x i is a p × 1 vector of covariates (including an intercept) • β is a p × 1 vector of regression coecients iid N 0 −1 • ei ∼ ( ; τe ) 1 2 is a precision term • τe = /σe Or, combining all n observations: Y = X β + e; where Y and e are n × 1 vectors and X is an n × p design matrix 2 / 160 Maximum Likelihood Inference for Linear Model It is straightforward to show that the MLEs are T −1 T βb = X X X Y 2 1 σ = (Y − X βb)T (Y − X βb) = MLE be n 2 1 σ~ = (Y − X βb)T (Y − X βb) = RMLE e n − p Gauss-Markov Theorem: Under the standard linear regression assumptions, βb is the best linear unbiased estimator (BLUE) of β 3 / 160 Prior Specication for Linear Model In the Bayesian framework, we place priors on and 2 (or, β σe equivalently τe ) Common choices are the so-called semi-conjugate or conditionally conjugate priors −1 β ∼ Np(β0; T 0 ); where T 0 is the prior precision τe ∼ Ga(a; b) where b is a rate parameter Equivalently, we can say that 2 is inverse-Gamma (IG) with σe shape a and scale b These choices lead to conjugate full conditional distributions 4 / 160 Choice of Prior Parameters Common choices for the prior parameters include: • β0 = 0 −1 • T 0 = 0:01I p ) Σ0 = T 0 = 100I p • a = b = 0:001 See Ho Section 9.2.2 for alternative default priors, including Zellner's g-prior for β 5 / 160 Full Conditional Distributions∗ Can show that the full conditional for β is βjY = y; τe ∼ N(m; V ); where T −1 V = T 0 + τe X X and T m = V T 0β0 + τe X y To derive this, you must complete the square in p dimensions: Proposition For vectors β and η and symmetric, invertible matrix V , βT V −1β − 2ηT β = (β − V η)T V −1 (β − V η) − ηT V η 6 / 160 Full Conditional Distributions (Cont'd) Similarly, the full conditional distribution of τe is ∗ ∗ τe jy; β ∼ Ga(a ; b ) where a∗ = a + n=2 b∗ = b + (y − X β)T (y − X β)=2 Homework for Friday: Please derive the full conditionals for β and τe We can use these full conditionals to develop a straightforward Gibbs sampler for posterior inference See Linear Regression.r for an illustration 7 / 160 R Code for Linear Regression Gibbs Sampler Gibbs Sampler for Linear Regression Model library(mvtnorm) # Priors beta0<-rep(0,p) # Prior mean of beta, where p=# of parameters T0<-diag(0.01,p) # Prior Precision of beta a<-b<-0.001 # Gamma hyperparms for taue # Inits taue<-1 # Error precision # MCMC Info nsim<-1000 # Number of MCMC Iterations thin<-1 # Thinning interval burn<-nsim/2 # Burnin lastit<-(nsim-burn)/thin # Last stored value # Store Beta<-matrix(0,lastit,p) # Matrices to store results Sigma2<-rep(0,lastit) Resid<-matrix(0,lastit,n) # Store resids Dy<-matrix(0,lastit,512) # Store density values for residual density plot Qy<-matrix(0,lastit,100) # Store quantiles for QQ plot ######### # Gibbs # ######### tmp<-proc.time() # Store current time for (i in1:nsim){ # Update beta v<-solve(T0+taue*crossprod(X)) m<-v%*%(T0%*%beta0+taue*crossprod(X,y)) beta<-c(rmvnorm(1,m,v)) # Update tau taue<-rgamma(1,a+n/2,b+crossprod(y-X%*%beta)/2) # Store Samples if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigma2[j]<-1/taue Resid[j,]<-resid<-y-X%*%beta # Raw Resid Dy[j,]<-density(resid/sd(resid),from=-5,to=5)$y # Density of Standardized Resids Qy[j,]<-quantile(resid/sd(resid),probs=seq(.001,.999,length=100)) # Quantiles for QQ Plot } if (i%%100==0) print(i) } run.time<-proc.time()-tmp # MCMC run time # Took 1 second to run 1000 iterations with n=1000 subjects 8 / 160 1 Example The program Linear Regression.r simulates 1000 observations from the following linear regression model: Yi = β0 + β1xi + ei ; i = 1;:::; 1000 iid N 0 2 eij ∼ ( ; σe ) The results, based on 1000 iterations with a burn-in of 500, are: Table 1: Results for Linear Regression Model. True Posterior Parameter Value MLE (SE) Mean (SD) β0 −1 −1:01 (0:06) −1:01 (0:06) β1 1 0:99 (0:06) 0:99 (0:06) 2 4 3 83 0 17 3 84 0 16 σe : ( : ) : ( : ) 9 / 160 Plots of Standardized Residuals Density Plot of Standardized Residuals 0.4 MLE Bayes 0.3 0.2 Density 0.1 0.0 −4 −2 0 2 4 Quantile QQ Plot of Standardized Residuals 3 ● ● ● MLE ● 2 ● ● ● ● Bayes ●●● ●●●●● ●●●●●● 1 ● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●● −1 ● ●●● ●●●● ● ● ● ● Sample Quantiles ● −3 −3 −2 −1 0 1 2 3 Normal Quantile 10 / 160 Skew-Normal Data Suppose we generate data from a skew-normal distribution1 SN(µ, !2; α), where µ 2 <, ! > 0, and α 2 < are location, scale, and skewness parameters α > 0 ) positive skewness, α < 0 ) negative skewness, and when α = 0, the density reduces to a symmetric normal distribution For details on Bayesian approaches to tting SN models, see Frühwirth-Schnatter and Pyne (2010) and Neelon et al. (2015) For now, suppose we ignore skewness and t an ordinary linear regression to the data See Residual Diagnostics with SN Data.r for details 1O'Hagan and Leonard (1976); Azzalini, 1985 11 / 160 Plots of Standardized Residuals True Errors (e) Density of Standardized Residuals α = −5 MLE ω = 4 0.4 Bayes 0.15 0.3 0.10 0.2 Density Density 0.05 0.1 0.0 0.00 −15 −10 −5 0 −4 −2 0 2 y Standardized Residual QQ Plot of Standardized Residuals 3 ● MLE ● Bayes 2 ● ● ● ● ●●● ●●●● ●●●●● 1 ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● 0 ● ●●● ●●● ●●● ●●● ●● ●●● ●●● −1 ●● ●● ●● ●● Observed Quantile ● −2 ● ●● ● −3 −3 −2 −1 0 1 2 3 ● ● Normal Quantile 12 / 160 Probit and Logit Models for Binary Data∗ Consider the following probit model for a dichotomous outcome Yi : −1 Pr 1 x T 1 Φ [ (Yi = )] = i β; i = ;:::; n; where Φ(·) denotes the CDF of a standard normal random variable We can represent the model vis-à-vis a latent variable Zi such that N x T 1 and Zi ∼ ( i β; ) Yi = 1 () Zi > 0 implying that Pr 1 Pr 0 x T (Yi = ) = (Zi > ) = Φ( i β) 13 / 160 Latent Variable Interpretation of Probit Model 0.4 0.3 ) i z 0.2 f( Pr(Zi > 0) = Pr(Yi = 1) 0.1 0.0 xTβ −2 0 i 2 4 zi 14 / 160 Albert and Chib (1993) Data-Augmented Sampler Albert and Chib (1993) take advantage of this latent variable structure to develop an ecient data-augmented Gibbs sampler for probit regression Data augmentation is a method by which we introduce T additional (or augmented) variables, Z = (Z1;:::; Zn) , as part of the Gibbs sampler to facilitate sampling Data augmentation is useful when the conditional density π(βjy) is intractable, but the joint posterior π(β; zjy) is easy to sample from via Gibbs, where z is an n × 1 vector of realizations of Z 15 / 160 Data Augmentation Sampler In particular, suppose it's straightforward to sample from the full conditionals π(βjy; z) and π(zjy; β) Then we can apply Gibbs sampling to obtain the joint posterior π(β; zjy) After convergence, the samples of β, fβ(1);:::; β(T )g, will constitute a Monte Carlo sample from π(βjy) Note that if β and Y are conditionally independent given Z, so that π(βjy; z) = π(βjz), then the sampler proceeds in two stages: 1 Draw z from π(zjy; β) 2 Draw β from π(βjz) 16 / 160 Gibbs Sampler for Probit Model∗ The data augmented sampler proposed by Albert and Chib proceeds by −1 assigning a Np β0; T 0 prior to β and dening the posterior variance T −1 of β as V = T 0 + X X Note that because Var(Zi ) = 1, we can dene V outside the Gibbs loop Next, we iterate through the following Gibbs steps: 1 For 1 , sample from a N x T 1 distribution i = ;:::; n zi ( i β; ) truncated below (above) by 0 for yi = 1 (yi = 0) T 2 Sample β from Np(m; V ), where m = V T 0β0 + X z and V is dened above Note: Conditional on Z, β is independent of Y , so we can work solely with the augmented likelihood when updating β See Probit.r for details 17 / 160 R Code for Probit Gibbs Sampler Gibbs for Probit Regression Model # Priors beta0<-rep(0,p) # Prior mean of beta (of dimension p) T0<-diag(.01,p) # Prior precision of beta # Inits beta<-rep(0,p) z<-rep(0,n) # Latent normal variables ns<-table(y) # Category sample sizes # MCMC info analogous to linear reg. code # Posterior var of beta -- Note: can calculate outside of loop vbeta<-solve(T0+crossprod(X,X)) ######### # Gibbs # ######### tmp<-proc.time() # Store current time for (i in1:nsim){ # Update latent normals, z, from truncated normal using inverse-CDF method muz<-X%*%beta z[y==0]<-qnorm(runif(ns[1],0,pnorm(0,muz[y==0],1)),muz[y==0],1) z[y==1]<-qnorm(runif(ns[2],pnorm(0,muz[y==1],1),1),muz[y==1],1) # Alternatively, can use rtnorm function from msm package -- this is slower # z[y==0]<-rtnorm(n0,muz[y==0],1,-Inf,0) # z[y==1]<-rtnorm(n1,muz[y==1],1,0,Inf) # Update beta mbeta <- vbeta%*%(T0%*%beta0+crossprod(X,z)) # Posterior mean of beta beta<-c(rmvnorm(1,mbeta,vbeta)) ################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta } if (i%%100==0) print(i) } proc.time()-tmp # MCMC run time -- 0.64 seconds to run 1000 iterations with n=1000 18 / 160 1 Example The program Probit.r ts the following probit model: Yi ∼ Bern(πi ) −1 Φ (πi ) = β0 + β1xi ; i = 1;:::; 1000: The results, based on 1000 iterations with a burn-in of 500, are: Table 2: Results for Probit Model.