Bayesian Modeling Strategies for Generalized Linear Models, Part 1

Bayesian Modeling Strategies for Generalized Linear Models, Part 1 Reading: Ho Chapter 9; Albert and Chib (1993) Sections 13.2, 4.1; Polson (2012) Sections 13, Appendix S6.2; Pillow and Scott (2012) Sections 13.1, 4; Dadaneh et al. (2018); Neelon (2018) Sections 1 and 2 Fall 2018 1 / 160 Linear Regression Model Consider the following simple linear regression model: x T 1 Yi = i β + ei i = ;:::; n; where • x i is a p × 1 vector of covariates (including an intercept) • β is a p × 1 vector of regression coecients iid N 0 −1 • ei ∼ ( ; τe ) 1 2 is a precision term • τe = /σe Or, combining all n observations: Y = X β + e; where Y and e are n × 1 vectors and X is an n × p design matrix 2 / 160 Maximum Likelihood Inference for Linear Model It is straightforward to show that the MLEs are T −1 T βb = X X X Y 2 1 σ = (Y − X βb)T (Y − X βb) = MLE be n 2 1 σ~ = (Y − X βb)T (Y − X βb) = RMLE e n − p Gauss-Markov Theorem: Under the standard linear regression assumptions, βb is the best linear unbiased estimator (BLUE) of β 3 / 160 Prior Specication for Linear Model In the Bayesian framework, we place priors on and 2 (or, β σe equivalently τe ) Common choices are the so-called semi-conjugate or conditionally conjugate priors −1 β ∼ Np(β0; T 0 ); where T 0 is the prior precision τe ∼ Ga(a; b) where b is a rate parameter Equivalently, we can say that 2 is inverse-Gamma (IG) with σe shape a and scale b These choices lead to conjugate full conditional distributions 4 / 160 Choice of Prior Parameters Common choices for the prior parameters include: • β0 = 0 −1 • T 0 = 0:01I p ) Σ0 = T 0 = 100I p • a = b = 0:001 See Ho Section 9.2.2 for alternative default priors, including Zellner's g-prior for β 5 / 160 Full Conditional Distributions∗ Can show that the full conditional for β is βjY = y; τe ∼ N(m; V ); where T −1 V = T 0 + τe X X and T m = V T 0β0 + τe X y To derive this, you must complete the square in p dimensions: Proposition For vectors β and η and symmetric, invertible matrix V , βT V −1β − 2ηT β = (β − V η)T V −1 (β − V η) − ηT V η 6 / 160 Full Conditional Distributions (Cont'd) Similarly, the full conditional distribution of τe is ∗ ∗ τe jy; β ∼ Ga(a ; b ) where a∗ = a + n=2 b∗ = b + (y − X β)T (y − X β)=2 Homework for Friday: Please derive the full conditionals for β and τe We can use these full conditionals to develop a straightforward Gibbs sampler for posterior inference See Linear Regression.r for an illustration 7 / 160 R Code for Linear Regression Gibbs Sampler Gibbs Sampler for Linear Regression Model library(mvtnorm) # Priors beta0<-rep(0,p) # Prior mean of beta, where p=# of parameters T0<-diag(0.01,p) # Prior Precision of beta a<-b<-0.001 # Gamma hyperparms for taue # Inits taue<-1 # Error precision # MCMC Info nsim<-1000 # Number of MCMC Iterations thin<-1 # Thinning interval burn<-nsim/2 # Burnin lastit<-(nsim-burn)/thin # Last stored value # Store Beta<-matrix(0,lastit,p) # Matrices to store results Sigma2<-rep(0,lastit) Resid<-matrix(0,lastit,n) # Store resids Dy<-matrix(0,lastit,512) # Store density values for residual density plot Qy<-matrix(0,lastit,100) # Store quantiles for QQ plot ######### # Gibbs # ######### tmp<-proc.time() # Store current time for (i in1:nsim){ # Update beta v<-solve(T0+taue*crossprod(X)) m<-v%*%(T0%*%beta0+taue*crossprod(X,y)) beta<-c(rmvnorm(1,m,v)) # Update tau taue<-rgamma(1,a+n/2,b+crossprod(y-X%*%beta)/2) # Store Samples if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigma2[j]<-1/taue Resid[j,]<-resid<-y-X%*%beta # Raw Resid Dy[j,]<-density(resid/sd(resid),from=-5,to=5)$y # Density of Standardized Resids Qy[j,]<-quantile(resid/sd(resid),probs=seq(.001,.999,length=100)) # Quantiles for QQ Plot } if (i%%100==0) print(i) } run.time<-proc.time()-tmp # MCMC run time # Took 1 second to run 1000 iterations with n=1000 subjects 8 / 160 1 Example The program Linear Regression.r simulates 1000 observations from the following linear regression model: Yi = β0 + β1xi + ei ; i = 1;:::; 1000 iid N 0 2 eij ∼ ( ; σe ) The results, based on 1000 iterations with a burn-in of 500, are: Table 1: Results for Linear Regression Model. True Posterior Parameter Value MLE (SE) Mean (SD) β0 −1 −1:01 (0:06) −1:01 (0:06) β1 1 0:99 (0:06) 0:99 (0:06) 2 4 3 83 0 17 3 84 0 16 σe : ( : ) : ( : ) 9 / 160 Plots of Standardized Residuals Density Plot of Standardized Residuals 0.4 MLE Bayes 0.3 0.2 Density 0.1 0.0 −4 −2 0 2 4 Quantile QQ Plot of Standardized Residuals 3 ● ● ● MLE ● 2 ● ● ● ● Bayes ●●● ●●●●● ●●●●●● 1 ● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●● −1 ● ●●● ●●●● ● ● ● ● Sample Quantiles ● −3 −3 −2 −1 0 1 2 3 Normal Quantile 10 / 160 Skew-Normal Data Suppose we generate data from a skew-normal distribution1 SN(µ, !2; α), where µ 2 <, ! > 0, and α 2 < are location, scale, and skewness parameters α > 0 ) positive skewness, α < 0 ) negative skewness, and when α = 0, the density reduces to a symmetric normal distribution For details on Bayesian approaches to tting SN models, see Frühwirth-Schnatter and Pyne (2010) and Neelon et al. (2015) For now, suppose we ignore skewness and t an ordinary linear regression to the data See Residual Diagnostics with SN Data.r for details 1O'Hagan and Leonard (1976); Azzalini, 1985 11 / 160 Plots of Standardized Residuals True Errors (e) Density of Standardized Residuals α = −5 MLE ω = 4 0.4 Bayes 0.15 0.3 0.10 0.2 Density Density 0.05 0.1 0.0 0.00 −15 −10 −5 0 −4 −2 0 2 y Standardized Residual QQ Plot of Standardized Residuals 3 ● MLE ● Bayes 2 ● ● ● ● ●●● ●●●● ●●●●● 1 ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● 0 ● ●●● ●●● ●●● ●●● ●● ●●● ●●● −1 ●● ●● ●● ●● Observed Quantile ● −2 ● ●● ● −3 −3 −2 −1 0 1 2 3 ● ● Normal Quantile 12 / 160 Probit and Logit Models for Binary Data∗ Consider the following probit model for a dichotomous outcome Yi : −1 Pr 1 x T 1 Φ [ (Yi = )] = i β; i = ;:::; n; where Φ(·) denotes the CDF of a standard normal random variable We can represent the model vis-à-vis a latent variable Zi such that N x T 1 and Zi ∼ ( i β; ) Yi = 1 () Zi > 0 implying that Pr 1 Pr 0 x T (Yi = ) = (Zi > ) = Φ( i β) 13 / 160 Latent Variable Interpretation of Probit Model 0.4 0.3 ) i z 0.2 f( Pr(Zi > 0) = Pr(Yi = 1) 0.1 0.0 xTβ −2 0 i 2 4 zi 14 / 160 Albert and Chib (1993) Data-Augmented Sampler Albert and Chib (1993) take advantage of this latent variable structure to develop an ecient data-augmented Gibbs sampler for probit regression Data augmentation is a method by which we introduce T additional (or augmented) variables, Z = (Z1;:::; Zn) , as part of the Gibbs sampler to facilitate sampling Data augmentation is useful when the conditional density π(βjy) is intractable, but the joint posterior π(β; zjy) is easy to sample from via Gibbs, where z is an n × 1 vector of realizations of Z 15 / 160 Data Augmentation Sampler In particular, suppose it's straightforward to sample from the full conditionals π(βjy; z) and π(zjy; β) Then we can apply Gibbs sampling to obtain the joint posterior π(β; zjy) After convergence, the samples of β, fβ(1);:::; β(T )g, will constitute a Monte Carlo sample from π(βjy) Note that if β and Y are conditionally independent given Z, so that π(βjy; z) = π(βjz), then the sampler proceeds in two stages: 1 Draw z from π(zjy; β) 2 Draw β from π(βjz) 16 / 160 Gibbs Sampler for Probit Model∗ The data augmented sampler proposed by Albert and Chib proceeds by −1 assigning a Np β0; T 0 prior to β and dening the posterior variance T −1 of β as V = T 0 + X X Note that because Var(Zi ) = 1, we can dene V outside the Gibbs loop Next, we iterate through the following Gibbs steps: 1 For 1 , sample from a N x T 1 distribution i = ;:::; n zi ( i β; ) truncated below (above) by 0 for yi = 1 (yi = 0) T 2 Sample β from Np(m; V ), where m = V T 0β0 + X z and V is dened above Note: Conditional on Z, β is independent of Y , so we can work solely with the augmented likelihood when updating β See Probit.r for details 17 / 160 R Code for Probit Gibbs Sampler Gibbs for Probit Regression Model # Priors beta0<-rep(0,p) # Prior mean of beta (of dimension p) T0<-diag(.01,p) # Prior precision of beta # Inits beta<-rep(0,p) z<-rep(0,n) # Latent normal variables ns<-table(y) # Category sample sizes # MCMC info analogous to linear reg. code # Posterior var of beta -- Note: can calculate outside of loop vbeta<-solve(T0+crossprod(X,X)) ######### # Gibbs # ######### tmp<-proc.time() # Store current time for (i in1:nsim){ # Update latent normals, z, from truncated normal using inverse-CDF method muz<-X%*%beta z[y==0]<-qnorm(runif(ns[1],0,pnorm(0,muz[y==0],1)),muz[y==0],1) z[y==1]<-qnorm(runif(ns[2],pnorm(0,muz[y==1],1),1),muz[y==1],1) # Alternatively, can use rtnorm function from msm package -- this is slower # z[y==0]<-rtnorm(n0,muz[y==0],1,-Inf,0) # z[y==1]<-rtnorm(n1,muz[y==1],1,0,Inf) # Update beta mbeta <- vbeta%*%(T0%*%beta0+crossprod(X,z)) # Posterior mean of beta beta<-c(rmvnorm(1,mbeta,vbeta)) ################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta } if (i%%100==0) print(i) } proc.time()-tmp # MCMC run time -- 0.64 seconds to run 1000 iterations with n=1000 18 / 160 1 Example The program Probit.r ts the following probit model: Yi ∼ Bern(πi ) −1 Φ (πi ) = β0 + β1xi ; i = 1;:::; 1000: The results, based on 1000 iterations with a burn-in of 500, are: Table 2: Results for Probit Model.

Bayesian Modeling Strategies for Generalized Linear Models, Part 1

MNP: R Package for Fitting the Multinomial Probit Model

Chapter 3: Discrete Choice

Lecture 6 Multiple Choice Models Part II – MN Probit, Ordered Choice

A Multinomial Probit Model with Latent Factors: Identification and Interpretation Without a Measurement System

Nlogit — Nested Logit Regression

Simulating Generalized Linear Models

Estimation of Logit and Probit Models Using Best, Worst and Best-Worst Choices Paolo Delle Site, Karim Kilani, Valerio Gatta, Edoardo Marcucci, André De Palma

Using Heteroskedastic Ordered Probit Models to Recover Moments of Continuous Test Score Distributions from Coarsened Data

Multinomial Logistic Regression Models with SAS Example

The Generalized Multinomial Logit Model

15 Panel Data Models for Discrete Choice William Greene, Department of Economics, Stern School of Business, New York University

A Survey of Discrete Choice Models