Bayesian Modeling Strategies for Generalized Linear Models, Part 1 Reading: Ho Chapter 9; Albert and Chib (1993) Sections 13.2, 4.1; Polson (2012) Sections 13, Appendix S6.2; Pillow and Scott (2012) Sections 13.1, 4; Dadaneh et al. (2018); Neelon (2018) Sections 1 and 2

Fall 2018

1 / 160 Model

Consider the following model: x T 1 Yi = i β + ei i = ,..., n, where

• x i is a p × 1 vector of covariates (including an intercept) • β is a p × 1 vector of regression coecients

iid N 0 −1 • ei ∼ ( , τe ) 1 2 is a precision term • τe = /σe

Or, combining all n observations: Y = X β + e, where Y and e are n × 1 vectors and X is an n × p design matrix 2 / 160 Maximum Likelihood Inference for Linear Model

It is straightforward to show that the MLEs are

T −1 T βb = X X X Y

2 1 σ = (Y − X βb)T (Y − X βb) = MLE be n

2 1 σ˜ = (Y − X βb)T (Y − X βb) = RMLE e n − p

Gauss-Markov Theorem: Under the standard linear regression assumptions, βb is the best linear unbiased estimator (BLUE) of β

3 / 160 Prior Specication for Linear Model

In the Bayesian framework, we place priors on and 2 (or, β σe equivalently τe )

Common choices are the so-called semi-conjugate or conditionally conjugate priors

−1 β ∼ Np(β0, T 0 ), where T 0 is the prior precision

τe ∼ Ga(a, b) where b is a rate parameter Equivalently, we can say that 2 is inverse-Gamma (IG) with σe shape a and scale b

These choices lead to conjugate full conditional distributions

4 / 160 Choice of Prior Parameters

Common choices for the prior parameters include:

• β0 = 0

−1 • T 0 = 0.01I p ⇒ Σ0 = T 0 = 100I p • a = b = 0.001

See Ho Section 9.2.2 for alternative default priors, including Zellner's g-prior for β

5 / 160 Full Conditional Distributions∗

Can show that the full conditional for β is

β|Y = y, τe ∼ N(m, V ), where

T −1 V = T 0 + τe X X and

T  m = V T 0β0 + τe X y

To derive this, you must complete the square in p dimensions: Proposition For vectors β and η and symmetric, invertible matrix V ,

βT V −1β − 2ηT β = (β − V η)T V −1 (β − V η) − ηT V η

6 / 160 Full Conditional Distributions (Cont'd)

Similarly, the full conditional distribution of τe is

∗ ∗ τe |y, β ∼ Ga(a , b ) where a∗ = a + n/2 b∗ = b + (y − X β)T (y − X β)/2

Homework for Friday: Please derive the full conditionals for β and τe

We can use these full conditionals to develop a straightforward Gibbs sampler for posterior inference

See Linear Regression.r for an illustration

7 / 160 R Code for Linear Regression Gibbs Sampler

Gibbs Sampler for Linear Regression Model

library(mvtnorm)

# Priors beta0<-rep(0,p) # Prior mean of beta, where p=# of parameters T0<-diag(0.01,p) # Prior Precision of beta a<-b<-0.001 # Gamma hyperparms for taue

# Inits taue<-1 # Error precision

# MCMC Info nsim<-1000 # Number of MCMC Iterations thin<-1 # Thinning interval burn<-nsim/2 # Burnin lastit<-(nsim-burn)/thin # Last stored value

# Store Beta<-matrix(0,lastit,p) # Matrices to store results Sigma2<-rep(0,lastit) Resid<-matrix(0,lastit,n) # Store resids Dy<-matrix(0,lastit,512) # Store density values for residual density plot Qy<-matrix(0,lastit,100) # Store quantiles for QQ plot

######### # Gibbs # ######### tmp<-proc.time() # Store current time

for (i in1:nsim){ # Update beta v<-solve(T0+taue*crossprod(X)) m<-v%*%(T0%*%beta0+taue*crossprod(X,y)) beta<-c(rmvnorm(1,m,v))

# Update tau taue<-rgamma(1,a+n/2,b+crossprod(y-X%*%beta)/2)

# Store Samples if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigma2[j]<-1/taue Resid[j,]<-resid<-y-X%*%beta # Raw Resid Dy[j,]<-density(resid/sd(resid),from=-5,to=5)$y # Density of Standardized Resids Qy[j,]<-quantile(resid/sd(resid),probs=seq(.001,.999,length=100)) # Quantiles for QQ Plot } if (i%%100==0) print(i) } run.time<-proc.time()-tmp # MCMC run time # Took 1 second to run 1000 iterations with n=1000 subjects 8 / 160

1 Example The program Linear Regression.r simulates 1000 observations from the following linear regression model:

Yi = β0 + β1xi + ei , i = 1,..., 1000 iid N 0 2 eij ∼ ( , σe )

The results, based on 1000 iterations with a burn-in of 500, are:

Table 1: Results for Linear Regression Model. True Posterior Parameter Value MLE (SE) Mean (SD)

β0 −1 −1.01 (0.06) −1.01 (0.06) β1 1 0.99 (0.06) 0.99 (0.06) 2 4 3 83 0 17 3 84 0 16 σe . ( . ) . ( . )

9 / 160 Plots of Standardized Residuals

Density Plot of Standardized Residuals

0.4 MLE Bayes 0.3 0.2 Density 0.1 0.0

−4 −2 0 2 4

Quantile

QQ Plot of Standardized Residuals

3 ● ● ● MLE ● 2 ● ● ● ● Bayes ●●● ●●●●● ●●●●●●

1 ● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●

−1 ● ●●● ●●●● ● ● ● ● Sample Quantiles ● −3

−3 −2 −1 0 1 2 3

Normal Quantile 10 / 160 Skew-Normal Data Suppose we generate data from a skew-normal distribution1 SN(µ, ω2, α), where µ ∈ <, ω > 0, and α ∈ < are location, scale, and skewness parameters

α > 0 ⇒ positive skewness, α < 0 ⇒ negative skewness, and when α = 0, the density reduces to a symmetric normal distribution

For details on Bayesian approaches to tting SN models, see Frühwirth-Schnatter and Pyne (2010) and Neelon et al. (2015)

For now, suppose we ignore skewness and t an ordinary linear regression to the data

See Residual Diagnostics with SN Data.r for details 1O'Hagan and Leonard (1976); Azzalini, 1985

11 / 160 Plots of Standardized Residuals

True Errors (e) Density of Standardized Residuals

α = −5 MLE

ω = 4 0.4 Bayes 0.15 0.3 0.10 0.2 Density Density 0.05 0.1 0.0 0.00 −15 −10 −5 0 −4 −2 0 2

y Standardized Residual

QQ Plot of Standardized Residuals

3 ● MLE ● Bayes 2 ● ● ● ● ●●● ●●●● ●●●●● 1 ●●●●● ●●● ●●●● ●●● ●●● ●●● ●●● 0 ● ●●● ●●● ●●● ●●● ●● ●●● ●●● −1 ●● ●● ●● ●● Observed Quantile ●

−2 ● ●● ● −3

−3 −2 −1 0 1 2 3 ● ● Normal Quantile 12 / 160 Probit and Logit Models for Binary Data∗ Consider the following for a dichotomous outcome

Yi :

−1 Pr 1 x T 1 Φ [ (Yi = )] = i β, i = ,..., n,

where Φ(·) denotes the CDF of a standard normal random variable

We can represent the model vis-à-vis a latent variable Zi such that

N x T 1 and Zi ∼ ( i β, )

Yi = 1 ⇐⇒ Zi > 0 implying that Pr 1 Pr 0 x T (Yi = ) = (Zi > ) = Φ( i β)

13 / 160 Latent Variable Interpretation of Probit Model 0.4 0.3 ) i z 0.2 f(

Pr(Zi > 0) = Pr(Yi = 1) 0.1 0.0 xTβ −2 0 i 2 4

zi

14 / 160 Albert and Chib (1993) Data-Augmented Sampler

Albert and Chib (1993) take advantage of this latent variable structure to develop an ecient data-augmented Gibbs sampler for probit regression

Data augmentation is a method by which we introduce T additional (or augmented) variables, Z = (Z1,..., Zn) , as part of the Gibbs sampler to facilitate sampling

Data augmentation is useful when the conditional density π(β|y) is intractable, but the joint posterior π(β, z|y) is easy to sample from via Gibbs, where z is an n × 1 vector of realizations of Z

15 / 160 Data Augmentation Sampler

In particular, suppose it's straightforward to sample from the full conditionals π(β|y, z) and π(z|y, β)

Then we can apply Gibbs sampling to obtain the joint posterior π(β, z|y)

After convergence, the samples of β, {β(1),..., β(T )}, will constitute a Monte Carlo sample from π(β|y)

Note that if β and Y are conditionally independent given Z, so that π(β|y, z) = π(β|z), then the sampler proceeds in two stages:

1 Draw z from π(z|y, β)

2 Draw β from π(β|z)

16 / 160 Gibbs Sampler for Probit Model∗

The data augmented sampler proposed by Albert and Chib proceeds by −1 assigning a Np β0, T 0 prior to β and dening the posterior variance T −1 of β as V = T 0 + X X

Note that because Var(Zi ) = 1, we can dene V outside the Gibbs loop Next, we iterate through the following Gibbs steps:

1 For 1 , sample from a N x T 1 distribution i = ,..., n zi ( i β, ) truncated below (above) by 0 for yi = 1 (yi = 0)

T  2 Sample β from Np(m, V ), where m = V T 0β0 + X z and V is dened above

Note: Conditional on Z, β is independent of Y , so we can work solely with the augmented likelihood when updating β

See Probit.r for details 17 / 160 R Code for Probit Gibbs Sampler

Gibbs for Probit Regression Model

# Priors beta0<-rep(0,p) # Prior mean of beta (of dimension p) T0<-diag(.01,p) # Prior precision of beta

# Inits beta<-rep(0,p) z<-rep(0,n) # Latent normal variables ns<-table(y) # Category sample sizes

# MCMC info analogous to linear reg. code

# Posterior var of beta -- Note: can calculate outside of loop vbeta<-solve(T0+crossprod(X,X))

######### # Gibbs # ######### tmp<-proc.time() # Store current time

for (i in1:nsim){

# Update latent normals, z, from truncated normal using inverse-CDF method muz<-X%*%beta z[y==0]<-qnorm(runif(ns[1],0,pnorm(0,muz[y==0],1)),muz[y==0],1) z[y==1]<-qnorm(runif(ns[2],pnorm(0,muz[y==1],1),1),muz[y==1],1)

# Alternatively, can use rtnorm function from msm package -- this is slower # z[y==0]<-rtnorm(n0,muz[y==0],1,-Inf,0) # z[y==1]<-rtnorm(n1,muz[y==1],1,0,Inf)

# Update beta mbeta <- vbeta%*%(T0%*%beta0+crossprod(X,z)) # Posterior mean of beta beta<-c(rmvnorm(1,mbeta,vbeta))

################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta }

if (i%%100==0) print(i)

}

proc.time()-tmp # MCMC run time -- 0.64 seconds to run 1000 iterations with n=1000 18 / 160

1 Example

The program Probit.r ts the following probit model:

Yi ∼ Bern(πi ) −1 Φ (πi ) = β0 + β1xi , i = 1,..., 1000.

The results, based on 1000 iterations with a burn-in of 500, are:

Table 2: Results for Probit Model. True Posterior Parameter Value MLE (SE) Mean (SD)†

β0 −0.5 −0.64 (0.07) −0.64 (0.07) β1 0.5 0.55 (0.04) 0.56 (0.04) † Based on Albert and Chib data augmentation sampler.

19 / 160 t-Link Models

Albert and Chib also discuss extensions to so-called t-link models that allow Z to assume a t-distribution with heavier tails than the normal distribution

The choice of df will correspond to dierent link functions

A t1 distribution corresponds to a Cauchy link

A t8 distribution approximates a (scaled) logistic distribution

And, in the limit, t∞ implies a probit link

By varying the dfs, we permit exibility in the choice of link function

20 / 160 Latent Variable Interpretation of t-Link Model

0.4 Tβ t3(xi ) Tβ N(xi ,1) 0.3 ) i z 0.2 f( 0.1

Pr(Zi > 0) = Pr(Yi = 1) 0.0 0 Tβ xi

zi

21 / 160 t-Link vs Logistic Quantiles

In particular, a t8/0.634 approximates a logistic distribution

t3 t8 t16 Logistic*0.634 5 0 logistic quantile −5

−4 −2 0 2 4

t quantile

22 / 160 Gibbs Sampler for the t-Link Model∗ Recall that a t-distribution arises as a scaled mixture of normals, where the scale (i.e., variance) parameter follows an IG distribution

Equivalently, the precision of the normal variables is gamma

In particular, a tν random variable, Zi (i = 1,..., n), can be generated by

1 Generating λi from Ga(ν/2, ν/2) (parameterized as rate)

1 2 Generating from N x T −  Zi |λi i β, λi

Marginally, is distributed as x T and the original binary Zi tν( i β) Yi is modeled using the corresponding tν-link function

23 / 160 Gibbs Sampler for the t-Link Model∗

This leads to the following modication for the probit Gibbs sampler 1 1 For 1 , sample from its truncated N x T − distribution i = ,..., n zi ( i β, λi ) analogous to the earlier sampler

2 Sample β from Np(m, V ), where

T −1 V = T 0 + X WX

T  m = V T 0β0 + X W z (analogous to WLS)

W = diag(λi )

2 3 For 1 , sample from Ga 1 2 x T 2 i = ,..., n λi (ν + )/ , (ν + (zi − i β) )/

4 Optionally, place a discrete uniform prior on ν and apply the discrete version of Bayes' Theorem to get the posterior probabilities

5 For the logistic approximation, set ν = 8 and divide the posterior means and SDs by 0.634 to recover the logistic results

See t-Link Logit.r for details

24 / 160 R Code for t-Link Logistic Model

Gibbs Sampler for t-Link Logit Model

# A&C assign diffuse (improper) priors for beta, so no explicit specification

# Initial Values lambda<-rep(1,n) # Weights nu<-8 # t df (8 ~ logistic) -- assume fixed z<-rep(0,n) # Latent z vector ns<-table(y) # Category sample szies

########### # Gibbs # ########### tmp<-proc.time() # Store current time

for (i in1:nsim) {

# Update z using inverse-CDF method muz<-X%*%beta z[y==0]<-qnorm(runif(ns[1],0,pnorm(0,muz[y==0],sqrt(1/lambda[y==0]))),muz[y==0],sqrt(1/lambda[y==0])) z[y==1]<-qnorm(runif(ns[2],pnorm(0,muz[y==1],sqrt(1/lambda[y==1])),1),muz[y==1],sqrt(1/lambda[y==1]))

vbeta<-solve(crossprod(sqrt(lambda)*X)) # Can no longer update outside Gibbs loop betahat<-vbeta%*%(crossprod(lambda*X,z)) beta<-c(rmvnorm(1,betahat,vbeta))

# Update lambda lambda<-rgamma(n,(nu+1)/2,(nu+(z-X%*%beta)^2)/2)

################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta }

if (i%%100==0) print(i)

}

proc.time()-tmp # MCMC run time -- 1 sec to run 1000 iterations with n=1000

# Results mbeta<-colMeans(Beta/.634) # Correction factor is 1/.634 sbeta<-apply(Beta/.634,2,sd)

25 / 160

1 Example

The program t-Link Logit.r generates 1000 observations from the following logistic model:

Yi ∼ Bern(πi )

logit(πi ) = β0 + β1xi , i = 1,..., 1000.

The results, based on 1000 iterations with a burn-in of 500, are:

Table 3: Results for Logistic Model with t-Link. True Posterior Parameter Value MLE (SE) Mean (SD)†

β0 −1 −1.08 (0.08) −1.08 (0.08) β1 1 0.92 (0.09) 0.93 (0.09) † Based on Albert and Chib t-link model.

26 / 160 Using Pólya-Gamma Latent Variables Polson et al. (2012) proposed a alternative Gibbs sampler for logistic and negative binomial models

The approach introduces a vector of latent variables, Zi , that are scale mixtures of normals with independent Pólya-Gamma precision terms rather than Gamma precision terms as in the t-link model

A random variable ω is said to have a Pólya-Gamma distribution with parameters b > 0 and c ∈ <, if

∞ d 1 X gk ω ∼ PG(b, c) = , 2π2 (k − 1/2)2 + c2/(4π2) k=1

where the gk 's are independently distributed according to a Ga(b, 1) distribution 27 / 160 Pólya-Gamma Density Plot

4 PG(1, 0) Ga(2, 10) 3 2 f(x) 1 0

0.0 0.5 1.0 1.5

x

28 / 160 Properties of the Pólya-Gamma Distribution

Polson et al. establish an important property of the PG(b, c) density  namely, that for a ∈ < and η ∈ <,

η a ∞ (e ) Z 2 2−beκη e−ωη /2 0 d b = p(ω|b, ) ω, (1 + eη) 0

where κ = a − b/2 and p(ω|b, 0) denotes a PG(b, 0) density.

The left-hand side has the same functional form as the probability parameter in a logistic regression model

The integrand on the right-hand side is the kernel of a normal density with precision ω (i.e., the conditional density of η) times the prior for ω

29 / 160 Connection to the Logistic Model In particular, under the logistic model, the likelihood for the T binary response vector Y = (Y1,..., Yn) is

n Y p(y|β) = p(yi |β) i=1 1 n  yi   −yi Y exp(ηi ) 1 = 1 + exp(ηi ) 1 + exp(ηi ) i=1 n y Y (eηi ) i = , 1 + eηi i=1

where x T . ηi = i β

The i-th element of the Bernoulli likelihood has the same form as the left-hand expression in the earlier property, with ai = Yi and b = 1. 30 / 160 Connection to the Logistic Model

Thus, we can re-write the Bernoulli likelihood in terms of the T Pólya-Gamma random variables ω = (ω1, . . . , ωn) :

∞ Z 2 κi ηi −ωi η /2 p(yi |β) ∝ e e i p(ωi |1, 0) dωi , 0

where 1 . κi = yi − 2

So, at each iteration, we'll draw ωi , update β based on the normal model for x T , and then average across the ηi = i β iterations to perform the Monte Carlo integration

31 / 160 Distribution of Latent Normals Z By appealing to the above properties the Pólya-Gamma distribution, Polson et al. show that the full conditional distribution of β, given Y and ω, is  1  Y y exp z X T W z X p(β| = , ω) ∝ π(β) −2( − β) ( − β) ,

where • π(β) is the prior distribution for β

yi −1/2 T • For i = 1,..., n, zi = with z = (z1,..., zn) ωi

• W = diag(ωi ) is an n × n precision matrix

It is clear that the random variable Z is normally distributed with mean η = X β and diagonal covariance W −1

32 / 160 Full Conditional for β

−1 Thus, assuming a Np(β0, T 0 ) prior for β, the full conditional for β given Z = z and W is Np(m, V ), where

T −1 V = T 0 + X WX T  m = V T 0β0 + X W z

This leads to the following Gibbs sampler:

1 For i = 1,..., n, update ωi from a PG(1, ηi ) density, where x T ηi = i β

yi −1/2 2 For i = 1,..., n, dene zi = ωi

3 Conditional on z and W , update β from Np(m, V ), where m and V are given above.

33 / 160 Sampling from the PG Density

Acceptance rejection sampling is used to draw from the PG density

This can be implemented using the rpg function from the R package BayesLogit

I have uploaded a zip le of the package onto the course website

Alternatively, you can download from https://mran.microsoft.com/snapshot/2014-10-20/ web/packages/BayesLogit/index.html

Or go to https://github.com/jwindle/BayesLogit for a more recent version

See PG Logit.r for details

34 / 160 R Code for Pólya-Gamma Logistic Model

Logistic Regression Using PG Latent Variables

library(BayesLogit) # For rpg function library(mvtnorm)

# Priors beta0<-rep(0,p) # Prior mean of beta T0<-diag(.01,p) # Prior precision of beta

# Inits beta<-rep(0,p)

################# # Store Samples # ################# nsim<-1000 # Number of MCMC Iterations thin<-1 # Thinning interval burn<-nsim/2 # Burnin lastit<-(nsim-burn)/thin # Last stored value Beta<-matrix(0,lastit,p)

######### # Gibbs # ######### tmp<-proc.time() # Store current time

for (i in1:nsim){ eta<-X%*%beta w<-rpg(n,1,eta) z<-(y-1/2)/w # Or define z=y-1/2 and omit w in posterior mean m below v<-solve(crossprod(X*sqrt(w))+T0) # Or solve(X%*%W%*%X), where W=diag(w) -- but this is slower m<-v%*%(T0%*%beta0+t(w*X)%*%z) # Can omit w here if you define z=y-1/2 beta<-c(rmvnorm(1,m,v))

################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta }

if (i%%100==0) print(i) }

proc.time()-tmp # MCMC run time -- 1.2 seconds to run 1000 iterations with n=1000

35 / 160

1 Example

The program PG Logit.r ts the same logistic model as before:

Yi ∼ Bern(πi )

logit(πi ) = β0 + β1xi , i = 1,..., 1000.

The results, based on 1000 iterations with a burn-in of 500, are identical to the previous ones:

Table 4: Results for Logistic Model. True Posterior Parameter Value MLE (SE) Mean (SD)†

β0 −1 −1.08 (0.08) −1.08 (0.08) β1 1 0.92 (0.09) 0.93 (0.09) † Based on PG sampler.

36 / 160 Bayesian

Albert and Chib (1993) extend their data augmentation sampler to accommodate ordered categorical outcomes

For example, the cumulative logit model with K categories is

logit logit Pr x T 1 1 (φik ) = [ (Yi ≤ k)] = γk + i β, k = ,..., K − ,

• φik is the k-th cumulative probability

• γk is the intercept associated with k-th cumulative logit

• x i is a vector of covariates excluding an intercept • β is a vector of regression coecients common to all cumulative logits (proportional odds)

βj denotes the increase in the log odds of lower category values for a unit increase in xj 37 / 160 Latent Variable Interpretation of Ordinal Model 0.4 0.3

Pr(γ1 < Zi<γ2) = Pr(Yi = 2) ) i z 0.2 f( 0.1

Pr(Zi < γ1) = Pr(Yi = 1) Pr(Zi > γ2) = Pr(Yi = 3) 0.0 γ1 γ2 Tβ xi

zi

38 / 160 Gibbs Sampler for Proportional Odds Model The Gibbs sampler is a slight modication of the previous sampler for probit/logit regression

Here, in addition to updating Zi and β, we must also update T γ = (γ1, . . . , γK−1)

Note that we have the following order constraint: all

Z(y=k) < γk < all Z(y=(k+1)) for k = 1,..., K − 1

This implies, for instance, that γ1 has to lie between the largest Z s.t. Y = 1 and the smallest Z s.t. that Y = 2

We have to obey this constraint when updating γ

39 / 160 Gibbs Sampler for Proportional Odds Model

The sampler proceeds by initializing γ and β and then iterating through the following steps:

1 1 For i=1,. . . ,n, if , draw from N x T − truncated between and yi = k zi ( i β, λi ) γk−1 γk , with γ0 = −∞ and γK = ∞

Note: For the cumulative probit model, λi = 1 ∀i

2 Update β from Np(m, V ),

T  V = T 0 + X WX T  m = V T 0β0 + X W z

where W = I n for probit model and W = diag(λi ) for t-link model

3 for k = 1,..., K − 1, update γk from Unif(max zi : yi = k, min zi : yi = k + 1)

4 For the cumulative logit model, update λi from its Gamma full conditional analogous to the (2-category) logistic model discussed earlier

See Cum_Logit.r and Cum_Probit.r for details

40 / 160 R Code for Cumulative Logit Model Gibbs Sampler for 3-Category Cumulative Logit Model

1 First assign a N(β0, T 0− ) prior to β # Initial Values gam1<-0 gam2<-3 beta<-0 # Only 1 covariate in this example lambda<-rep(1,n) # Weights nu<-8 # 8 df ~ logistic -- assume fixed z<-rep(0,n) # latent z vector

################### # GIBBS SAMPLER # ################### tmp<-proc.time() for (i in1:nsim) {

# Draw latent z using inverse CDF method muz<-x*beta z[y==1]<-qnorm(runif(ns[1],0,pnorm(gam1,muz[y==1],sqrt(1/lambda[y==1]))),muz[y==1],sqrt(1/lambda[y==1])) z[y==2]<-qnorm(runif(ns[2],pnorm(gam1,muz[y==2],sqrt(1/lambda[y==2])),pnorm(gam2,muz[y==2],sqrt(1/lambda[y==2]))),muz[y==2],sqrt(1/lambda[y==2])) z[y==3]<-qnorm(runif(ns[3],pnorm(gam2,muz[y==3],sqrt(1/lambda[y==3])),1),muz[y==3],sqrt(1/lambda[y==3]))

# Update gammas gam1<-runif(1,max(z[y==1]),min(z[y==2])) gam2<-runif(1,max(z[y==2]),min(z[y==3]))

# Update beta vbeta<-solve(crossprod(x*sqrt(lambda))) mbeta<-vbeta%*%(crossprod(x*lambda,z)) beta<-rnorm(1,mbeta,sqrt(vbeta))

# Update lambda lambda<-rgamma(n,(nu+1)/2,(nu+(z-x*beta)^2)/2)

################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j]<-beta Gam1[j]<-gam1 Gam2[j]<-gam2 }

if(i%%100==0) print(i) }

proc.time()-tmp # MCMC run time -- took 4.5 seconds to run 10K iterations with n=500 41 / 160

1 Example The program Cum_Logit.r simulates 500 observations according to the the following three-category cumulative logit model:

logit(φik ) = logit[Pr(Yi ≤ k)] = γk + βxi , k = 1, 2.

The results, based on 10,000 iterations with a burn-in of 5000 and thinning of 10, are:

Table 5: Results for Cumulative Logit Model. True Posterior Parameter Value MLE (SE) Mean (SD)†

γ1 −1 −1.04 (0.17) −1.00 (0.19) γ2 1 0.87 (0.16) 0.95 (0.17) β 0.75 0.73 (0.08) 0.77 (0.08) † Based on Albert and Chib sampler. 42 / 160 Trace Plots

Figure 1: Trace plots show high autocorrelation, esp. among γs −0.6 1 γ −1.0 −1.4

0 100 200 300 400 500

Iteration 1.2 2 1.0 γ 0.8 0.6

0 100 200 300 400 500

Iteration 0.8 β 0.6

0 100 200 300 400 500

Iteration 43 / 160 Multinomial Logit Regression

Let Yi denote an unordered or nominal categorical RV taking values k = 1,..., K with -th probability Pr x , where PK 1 k πik = (Yi = k| i ) k=1 πik =

The multinomial logit model is given by

xT e i βk Pr x 1 (Yi = k| i ) = πik = K xT , k = ,..., K, P e i βj j=1 with 0 for the reference category . Alternatively, we can write βK = K

xT e i βk Pr x 1 1 (Yi = k| i ) = πik = K−1 xT , k = ,..., K − 1 P e i βj + j=1 1 Pr x reference category (Yi = K| i ) = πiK = K−1 xT , 1 P e i βj + j=1 where denotes the regression coecients associated with -th category βk k

In other words, conditional on x i , Yi has a Multi(1, πi ), where πi = (πi1, . . . , πiK ) and PK 1. k=1 πik =

44 / 160 Multinomial Logit Regression

We can also write the multinomial logit model as

Pr(Y = k|x ) π i i ik exp x T 1 1 = = ( i βk ), k = ,..., K − , Pr(Yi = K|x i ) πiK which is the odds of being in category k vs. the reference category K.

Or, equivalently,

 π  ln ik x T = i βk . πiK

In the proportional odds model, only the intercepts αk varied across categories, while β remained constant

Here, all regression parameters vary with respect to (hence ) k βk

There are also partial proportional odds models that allow only some β's to vary across categories

45 / 160 Latent Variable Interpretations

Both the proportional odds and multinomial logit models have latent variable interpretations

For the MNL model, we have K underlying utilities:

Ui1 = Xi β1 + ei1 . . . .

Ui(K−1) = Xi β(K−1) + ei(K−1)

UiK = eiK ,

where the eik 's follow i.i.d. standard Gumbel or Type-I extreme −e−x value distributions with CDFs of the form FX (x) = e

Set Yi = k i.f.f. Uik = max(Ui1,..., UiK ) HW: Show that the dierence of two independent standard Gumbel R.V.s follows a logistic distribution 46 / 160 Models

This forms the basis of so-called random utility or discrete choice models

See Train Discrete Choice Methods with Simulation for more details on discrete choice models

The author discusses pros and cons of the multinomial logit model, including the independence of irrelevant alternatives (IIA) property as exemplied in the classic Red Bus/Blue Bus problem

To circumvent this problem, one can use the nested logit or multinomial probit model

For further details, see Agresti Categorical Data Anaysis Sections 8.5 and 8.6 and references therein 47 / 160 Bayesian Inference for Multinomial Logit Model The Pólya-Gamma data-augmentation approach can be extended to handle multinomial logit regression

Recall that the multinomial logit model can be written as

xT e i βk Pr x 1 (Yi = k| i ) = πik = K xT , k = ,..., K, P e i βj j=1 with 0 for the reference category . βK = K

Let's consider the update for 1 1 conditional βk (k = ,..., K − ) on both y and . βj ∀ j 6= k

The idea is to cycle through the updates of one at a time βk conditional on the other β's

48 / 160 Bayesian Inference for Multinomial Logit Model∗ Following Holmes and Held (2006), we can write the full conditional for , given y and , as a function of a βk βj6=k Bernoulli likelihood

n Y 1 y Uik 1 −Uik f (βk | , βj6=k ) ∝ f (βk ) πik ( − πik ) i=1

where denotes the prior for • f (βk ) βk 1 is an indicator that • Uik = (Yi =k) Yi = k

xT β • Pr Pr 1 e i k πik = (Yi = k) = (Uik = ) = xT β PK e i j j=1

49 / 160 Bayesian Inference for Multinomial Logit Model∗

Further, we can rewrite πik as

xT e i βk −cik π = Pr(U = 1) = ik ik xT β −c 1 + e i k ik

eηik = , 1 + eηik

xT where log P e i βj and x T . cik = j6=k ηik = i βk − cik

xT Note that P e i βj includes the reference category . j6=k K xT Because 0, it follows that e i βK 1 and hence βK = =   X xT X xT log e i βj log 1 e i βj cik = =  +  j6=k j∈{/ k, K}

50 / 160 Bayesian Inference for Multinomial Logit Model

Thus, the full conditional for given y and , is βk βj6=k

n U 1−U  eηik  ik  1  ik y Y f (βk | , βj6=k ) ∝ f (βk ) 1 + eηik 1 + eηik i=1

n U Y (eηik ) ik = f (βk ) , 1 + eηik i=1 which is essentially a logistic regression likelihood.

Thus, we can apply the Pólya-Gamma data-augmentation scheme to update the 's 1 1 one at a time βk (k = ,..., K − ) based on the binary indicators 1 Uik = (Yi =k)

51 / 160 Gibbs Sampler for Multinomial Logit Model

Assuming a N T −1 for , the Gibbs sampler proceeds (β0, 0 ) β1,..., βK−1 as follows:

1 Outside of the Gibbs loop, dene 1 for 1 and uik = (yi =k) i = ,..., n k = 1,..., K − 1

2 For i = 1,..., n and k = 1,..., K − 1, update ωik from a PG 1 density, where x T , and was dened ( , ηik ) ηik = i βk − cik cik earlier

uik −1/2 3 For i = 1,..., n and k = 1,..., K − 1, dene zik = + cik ωik T and let z k = (z1k ,..., znk )

4 For 1 1, update from N m V , where k = ,..., K − βk p( k , k )

T −1 V k = T 0 + X W k X

T  mk = V k T 0β0 + X W k z k ,

and W k = diag(ωik ). See Polson et al. (2012) page 41 for details, but note typo in expression for mj at bottom

52 / 160 Example

The program multinomial.r simulates 1000 observations from the following three-category multinomial model

xT e i βk Pr x 1 3 (Yi = k| i ) = πik = K xT , k = ,..., , P e i βj j=1

with β1 = 0 for reference category 1.

The program generates data using the Gumbel latent utilities

See multinomial.r for details

53 / 160 R Code for Generating Multinomial Logit Data

Data Generation for Multinomial Logit Model

# Multinomial.r # Generate and fit a 3-Category Multinomial Logit # Generate data using latent Gumbel random utilities # See Polson et al. 2012 Appendix S6.3 BUT NOTE TYPO AT BOTTOM OF PAGE 41! # 3-Category outcome: Independent, Republican and Democrat, with Ind as ref group ####################################### library(QRM) # For rGumbel function library(nnet) # To fit multinom function library(BayesLogit) # For rpg function

############################## # Generate Data under # # Random Utility Model (RUM) # ############################## set.seed(060817) n<-1000 K<-3 # Number of response categories female<-rbinom(n,1,.5) X<-cbind(1,female) p<-ncol(X) beta2<-c(1,-.5) # Males much more likely to be Rep than Ind and females "somewhat" more likely beta3<-c(.5,.5) # Males somewhat more likely to be Dem than Indep and females much more likely eta2<-X%*%beta2 eta3<-X%*%beta3 u1<-rGumbel(n, mu =0, sigma =1) u2<-rGumbel(n, mu = eta2, sigma =1) u3<-rGumbel(n, mu = eta3, sigma =1) U<-cbind(u1,u2,u3) y<-c(apply(U,1,which.max)) # Y=1 if Ind (Reference), 2 if Rep, 3 if Dem

fit<- multinom(y~female)

54 / 160

1 R Code for 3-Category Multinomial Logit Model

Gibbs Sampler for Multinomial Logit Model

# Define category-specific binary responses (Note: cat 1 is reference) u2<-1*(y==2) u3<-1*(y==3)

# Priors beta0<-rep(0,p) # Prior mean of beta T0<-diag(.01,p) # Prior precision of beta

# Inits beta2<-beta3<-rep(0,p)

################# # Gibbs sampler # ################# tmp<-proc.time() # Store current time

for (i in1:nsim){

# Update category 2 c2<-log(1+exp(X%*%beta3)) # Note that for Cat 1, beta1=0 so that exp(X%*%beta1)= 1 eta2<-X%*%beta2-c2 w2<-rpg(n,1,eta2) z2<-(u2-1/2)/w2+c2 # Note plus sign before c2 -- Polson has typo # Could also define z2<-u2-1/2+w2*c2 and omit w2 in post. mean m below v<-solve(T0+crossprod(X*sqrt(w2))) m<-v%*%(T0%*%beta0+t(w2*X)%*%z2) beta2<-c(rmvnorm(1,m,v))

# Update category 3 c3<-log(1+exp(X%*%beta2)) eta3<-X%*%beta3-c3 w3<-rpg(n,1,eta3) z3<-(u3-1/2)/w3+c3 v<-solve(T0+crossprod(X*sqrt(w3))) m<-v%*%(T0%*%beta0+t(w3*X)%*%z3) beta3<-c(rmvnorm(1,m,v))

# Store if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-c(beta2,beta3) }

if (i%%100==0) print(i) }

proc.time()-tmp # MCMC run time = 2.5 seconds to run 1000 iterations 55 / 160

1 Results

The results, based on 1000 iterations with a burn-in of 500, are:

Table 6: Results for Multinomial Logit Model. True Posterior Parameter Value MLE (SE) Mean (SD)†

β10 1 1.00 (0.12) 1.00 (0.13) β11 −0.5 −0.42 (0.18) −0.41 (0.18) β20 0.5 0.49 (0.13) 0.50 (0.14) β21 0.5 0.52 (0.18) 0.53 (0.19) † Based on Polson et al. Pólya-Gamma sampler.

56 / 160 Bayesian Inference for Negative Binomial Model Pillow and Scott (2012) extend the Pólya-Gamma sampler to negative binomial (NB) regression setting

Consider the following model for a count r.v. Yi :

d Γ(y + r) i 1 r yi 0 where p(yi |r, β) = ( − ψi ) ψi , r > , Γ(r)yi ! exp x T β exp (η ) ψ = i = i i 1 exp x T  1 exp + i β + (ηi )

Note that the NB probability parameter ψi is parameterized using the expit function

This allows us to apply the same properties of the Pólya-Gamma density as in the logistic case 57 / 160 Bayesian Inference for Negative Binomial Model

The mean and variance of Yi are

rψi E(Yi |r, β) = = r exp(ηi ) = µi 1 − ψi

Var rψi exp 1 exp (Yi |r, β) = 2 = r (ηi )[ + (ηi )] (1 − ψi )

= µi (1 + µi /r)

The parameter α = 1/r captures the overdispersion in the data, such that as α → ∞, the counts become increasingly dispersed relative to the Poisson distribution

58 / 160 Bayesian Inference for Negative Binomial Model

Exploiting the earlier property of the Pólya-Gamma distribution, it follows that

∞ Z 2 κi ηi −ωi η /2 p(Yi |r, β) ∝ e e i p(ωi |r + yi , 0) dωi , 0

where κi = (yi − r)/2 and the ωi 's are independently distributed according to PG(yi + r, ηi ).

59 / 160 Bayesian Inference for Negative Binomial Model

Following Pillow and Scott, the full conditional for β is  1  y exp z X T W z X p(β| , r, ω) ∝ π(β) −2( − β) ( − β) ,

where

yi − r • z is an n × 1 vector with elements zi = 2ωi

• W = diag(ωi )

60 / 160 Gibbs Sampler for Negative Binomial Model

Given current values for β and r, the Gibbs sampler for the NB model proceeds as follows:

1 For i = 1,..., n, draw ωi from its PG(yi + r, ηi ) distribution, where x T ηi = i β

yi − r 2 For i = 1,..., n, dene zi = 2ωi −1 3 Assuming a Np β0, T 0 prior, update β from its Np(m, V ) full conditional, where

T −1 V = T 0 + X WX

T  m = V T 0β0 + X W z

4 Update r using a random-walk Metropolis-Hastings step with a zero-truncated normal proposal density. Alternatively, update r using a conjugate Gamma distribution as described in Dadaneh et al. (2018) See NB_MH.r and NB_Gibbs.r for details 61 / 160 Conjugate Gibbs Update for r: Step 1

Zhou and Carin (2015) and Dadaneh et al. (2018) describe a two-step conjugate Gibbs update for r

The approach introduces a sample of latent counts, li , underlying each observed count yi

Conditional on yi and r, li has a distribution dened by a Chinese restaurant table (CRT) distribution:

y Xi li = uj j=1  r  u ∼ Bern . j r + j − 1 The name Chinese restaurant table derives from the fact that

uj = 1 if a new customer sits and an unoccupied table in a Chinese restaurant (according to a so-called Chinese restaurant

process), and li is the total number of occupied tables in the restaurant after yi customers

So in Step 1, we draw li (i = 1,..., n) according to this CRT distribution 62 / 160 Conjugate Gibbs Update for r: Step 2

In Step 2, the authors exploit the fact that the NB distribution can be derived from a random convolution of logarithmic RVs

Specically, they note that, conditional on r and ψi ,

ind li ∼ Poi[−r ln(1 − ψi )], where exp x T β ψ = i , i = 1,..., n. i 1 exp x T  + i β See Dadaneh et al. (2018) and Zhou and Carin (2015) for details

Thus, if we assume a Ga(a, b) prior for r, then the full conditional for r in Step 2 is

" n n # X X r|l, ψ ∼ Ga a + li , b − ln(1 − ψi ) , i=1 i=1

T T where l = (l1,..., ln) and ψ = (ψ1, . . . , ψn) .

The Gibbs update rst draws li (i = 1,..., n) independently from a CRT distribution, and then r from its full conditional Gamma distribution given l and ψ 63 / 160 R Code for Negative Binomial Model with MH Update for r

Hybrid Gibbs-MH Sampler for Negative Binomial Model

# Priors (diffuse prior for r) beta0<-rep(0,p) T0<-diag(.01,p) s<-0.01 # Proposal variance -- NOTE: may need to lower this as n_i increases

# Inits and Store beta<-rep(0,p) Acc<-0 # MH Acceptance counter

######## # MCMC # ######## for (i in1:nsim){

# Update r eta<-X%*%beta q<-1/(1+exp(eta)) # dnbinom fn uses q=1-psi rnew<-rtnorm(1,r,sqrt(s),lower=0) # Proposal ratio<-sum(dnbinom(y,rnew,q,log=T))-sum(dnbinom(y,r,q,log=T))+ # Likelihood -- diffuse prior for r dtnorm(rnew,r,sqrt(s),0,log=T)-dtnorm(r,rnew,sqrt(s),0,log=T) # Proposal not symmetric if (log(runif(1))

# Update beta w<-rpg(n,y+r,eta) # Polya weights z<-(y-r)/(2*w) # Latent response v<-solve(crossprod(X*sqrt(w))+T0) m<-v%*%(T0%*%beta0+t(sqrt(w)*X)%*%(sqrt(w)*z)) beta<-c(rmvnorm(1,m,v))

# Store if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta R[j]<-r }

if (i%%100==0) print(i) # 11 seconds to run 2000 iterations with n=1000

}

64 / 160

1 R Code for NB Model with Gibbs Update for r

Negative Binomial Sampler with Gibbs Update for r

beta0<-rep(0,p) T0<-diag(.01,p) a<-b<-0.01 # Gamma hyperparms for r

# Inits and Store beta<-rep(0,p) l<-rep(0,n) # Latent counts r<-1

######## # MCMC # ######## for (i in1:nsim){

# Update latent counts, l, using CRT distribution for(j in1:n) l[j]<- sum(rbinom(y[j],1,round(r/(r+1:y[j]-1),6))) # Could try to avoid loop # Rounding avoids numerical instability # Update r from conjugate gamma distribution given l and beta eta<-X%*%beta psi<-exp(eta)/(1+exp(eta)) r<-rgamma(1,a+sum(l),b-sum(log(1-psi)))

# Update beta w<-rpg(n,y+r,eta) # Polya weights z<-(y-r)/(2*w) # Latent response v<-solve(crossprod(X*sqrt(w))+T0) m<-v%*%(T0%*%beta0+t(sqrt(w)*X)%*%(sqrt(w)*z)) beta<-c(rmvnorm(1,m,v))

# Store if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta R[j]<-r }

if (i%%100==0) print(i) # 21 seconds to run 2000 iterations with n=1000

}

65 / 160

1 Example

The program NB.r ts the following NB model:

d Γ(y + r) i 1 r yi 0 where p(yi |r, β) = ( − ψi ) ψi , r > , Γ(r)yi !

exp (β0 + β1xi ) ψi = 1 + exp (β0 + β1xi )

The results, based on 2000 iterations with a burn-in of 1000, are:

Table 7: Results for NB Model. True Posterior Posterior Parameter Value MLE (SE) Mean (SD)† Mean (SD)‡

β1 1.0 0.93 (0.08) 0.92 (0.08) 0.92 (0.07) β2 0.5 0.50 (0.04) 0.50 (0.04) 0.50 (0.04) r 1.0 1.10 (0.08) 1.10 (0.07) 1.11 (0.07) † Pólya-Gamma sampler with MH update for r. Acceptance rate = 37%. ‡ Pólya-Gamma sampler with Gibbs update for r. 66 / 160 Trace Plots for NB Model with MH Update for r 2.2 1 2.0 β 1.8 1.6

0 200 400 600 800 1000

Iteration 0.8 0.6 2 β 0.4 0.2

0 200 400 600 800 1000

Iteration 1.4 1.2 r 1.0 0.8

0 200 400 600 800 1000

Iteration 67 / 160 Bayesian Inference for Zero-Inated Count Data

Zero-inated count data arise when the data contain a larger proportion of zeros than predicted by an ordinary count model such as the NB

Zero-inated models have been proposed to address the over-abundance of zeros

Zero-inated models are mixtures of a point mass at zero, representing the excess zeros, and a count distribution for the remaining values

By construction, zero-inated models partition zeros into two types:

• Structural zeros corresponding to individuals who are not at risk for an event, and therefore have no opportunity for a positive count

• At-risk zeros, which apply to a latent class of individuals who are at risk for an event but nevertheless have an observed response of zero

68 / 160 Zero-Inated Negative Binomial Model

Consider the following zero-inated negative binomial (ZINB) model

r Pr(Yi = 0) = (1 − φi ) + φi (1 − ψi ) Γ(y + r) Pr i 1 r yi 1 2 (Yi = yi ) = φi ( − ψi ) ψi , yi = , ,... Γ(r)yi ! exp x T β exp (η ) ψ = i = i i 1 exp x T  1 exp + i β + (ηi )

The parameter φi denotes the probability of belonging to the at-risk class

1 − φi denotes the probability of an excess (i.e., structural) zero

69 / 160 Bayesian Inference for Zero-Inated Count Data

We can rewrite the ZINB model by introducing a latent at-risk

indicator variable, Wi , 1 1 NB 1 1 Yi ∼ ( − φi ) (Wi =0 ∧ Yi =0) + φi (µi , r) (Wi =1), i = ,..., n Comments:

• With probability 1 − φi , Wi = Yi = 0 (structural zero)

• With probability φi , Wi = 1 and Yi is drawn from a NB distribution with mean µi and dispersion parameter r > 0

• µi = E(Yi |β, r, Wi = 1) is the mean count among those in the at-risk class (conditional on Wi =1) • r captures overdispersion in the at-risk class

• The overall (unconditional) mean is E(Yi |β, r) = φi µi

• If φi = 1, the ZINB reduces to an NB model

• if φi = 0, we have a point mass at zero 70 / 160 Zero-Inated Negative Binomial Model

Typically, we model φi using a logit model and Yi |Wi = 1 using a NB model:

logit logit Pr 1 x T (φi ) = [ (Wi = |α)] = i α = η1i

d Γ(y + r) 1 i 1 r yi s.t. 1 where p(yi |r, β, Wi = ) = ( − ψi ) ψi ∀ i Wi = , Γ(r)yi !

T  exp x β exp (η2 ) ψ = i = i . i 1 exp x T  1 exp + i β + (η2i )

71 / 160 Gibbs Sampler for ZINB Model This suggests the following Gibbs sampler:

1 Given current parameter values, update α using the Gibbs sampler proposed by Polson et al. for logistic regression

2 Conditional on Wi = 1, update β using the NB Gibbs sampler proposed by Pillow and Scott

3 Update r using a random-walk Metropolis-Hastings step or using a conjugate Gamma update as in Dadeneh et al. (2018)

4 Update the latent at-risk indicators, W1,..., Wn, from their discrete full conditional distributions

The only new step is step (4), the update for Wi

72 / 160 Update for Wi

The full conditional for Wi is a discrete distribution with probabilities that depend on whether the observed count, yi , is zero or non-zero

If yi > 0, then subject i belongs to the at-risk class, and hence by denition, Wi = 1 with probability 1

Conversely, if yi = 0, then we observe either a structural zero (implying that Wi = 0) or an at-risk zero (implying Wi = 1)

See Neelon (2018) for details

73 / 160 ∗ Update for Wi

In the latter case, we draw Wi from a Bernoulli distribution with probability

θi = Pr(Wi = 1|yi = 0, rest) = Pr(at-risk zero|at risk or structural zero) φ υr = i i , 1 1 r − φi ( − υi ) where

• φi = exp(η1i )/[1 + exp(η1i )] is the unconditional probability that Wi = 1

• υi = 1 − ψi , where ψi is the NB event probability

See ZINB.r for details 74 / 160 Example Data Histogram

60 AIC for ZINB is 104 points lower than for NB model 50 40 30 Percent 20 10 0

0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 44 60

Count 75 / 160 Hybrid Gibbs-MH Algorithm for ZINB Model

Hybrid Gibbs-MH Sampler for the ZINB Model

for (i in1:nsim){

# Update alpha mu<-X%*%alpha w<-rpg(n,1,mu) z<-(y1-1/2)/w # Latent response v<-solve(crossprod(X*sqrt(w))+T0a) m<-v%*%(T0a%*%alpha0+t(w*X)%*%z) alpha<-c(rmvnorm(1,m,v))

# Update latent class indicator y1 (= W in slides) eta<-X%*%alpha phi<-exp(eta)/(1+exp(eta)) # At-risk probability theta<-phi*(q^r)/(phi*(q^r)+1-phi) # Cond prob that y1=1 given y=0 -- i.e. Pr(chance zero|observed zero) y1[y==0]<-rbinom(n0,1,theta[y==0]) # If y=0, draw "chance zero" w.p. theta; if y=1, then y1=1 n1<-sum(y1)

# Update beta conditional on y1=1 eta1<-X[y1==1,]%*%beta w<-rpg(n1,y[y1==1]+r,eta1) # Polya weights z<-(y[y1==1]-r)/(2*w) # Latent response v<-solve(crossprod(X[y1==1,]*sqrt(w))+T0b) m<-v%*%(T0b%*%beta0+t(sqrt(w)*X[y1==1,])%*%(sqrt(w)*z)) beta<-c(rmvnorm(1,m,v)) eta<-X%*%beta q<-1/(1+exp(eta))

# Update r rnew<-rtnorm(1,r,sqrt(s),lower=0) ratio<-sum(dnbinom(y[y1==1],rnew,q[y1==1],log=T))-sum(dnbinom(y[y1==1],r,q[y1==1],log=T))+ # Diffuse prior for r dtnorm(r,rnew,sqrt(s),0,log=T) - dtnorm(rnew,r,sqrt(s),0,log=T) if (log(runif(1))

# Store if (i> burn & i%%thin==0){ j<-(i-burn)/thin Alpha[j,]<-alpha Beta[j,]<-beta R[j]<-r }

if (i%%100==0) print(i) # 11 seconds to run 2000 iterations with n=1000

} 76 / 160

1 R Code for ZINB Model with Gibbs Update for r

Gibbs Sampler for ZINB Model

for (i in1:nsim){

# Update alpha mu<-X%*%alpha w<-rpg(n,1,mu) z<-(y1-1/2)/w # Latent response v<-solve(crossprod(X*sqrt(w))+T0a) m<-v%*%(T0a%*%alpha0+t(w*X)%*%z) alpha<-c(rmvnorm(1,m,v))

# Update latent class indicator y1 (= W in slides) eta<-X%*%alpha phi<-exp(eta)/(1+exp(eta)) # At-risk probability theta<-phi*(q^r)/(phi*(q^r)+1-phi) # Cond prob that y1=1 given y=0 -- i.e. Pr(chance zero|observed zero) y1[y==0]<-rbinom(n0,1,theta[y==0]) # If y=0, draw "chance zero" w.p. theta; if y=1, then y1=1 n1<-sum(y1)

# Update beta conditional on y1=1 eta1<-X[y1==1,]%*%beta w<-rpg(n1,y[y1==1]+r,eta1) # Polya weights z<-(y[y1==1]-r)/(2*w) # Latent response v<-solve(crossprod(X[y1==1,]*sqrt(w))+T0b) m<-v%*%(T0b%*%beta0+t(sqrt(w)*X[y1==1,])%*%(sqrt(w)*z)) beta<-c(rmvnorm(1,m,v)) eta<-X%*%beta q<-1/(1+exp(eta))

# Update latent counts, l l<-rep(0,n1) ytmp<-y[y1==1] for(j in1:n1) l[j]<- sum(rbinom(ytmp[j],1,round(r/(r+1:ytmp[j]-1),6)))

# Update r from conjugate gamma distribution given l and psi eta<-X[y1==1,]%*%beta psi<-exp(eta)/(1+exp(eta)) r<-rgamma(1,a+sum(l),b-sum(log(1-psi)))

# Store if (i> burn & i%%thin==0){ j<-(i-burn)/thin Alpha[j,]<-alpha Beta[j,]<-beta R[j]<-r }

if (i%%100==0) print(i) # 15 seconds to run 2000 iterations with n=1000 } 77 / 160

1 Example The program ZINB.r ts the following ZINB model:

logit(φi ) = α1 + α2xi

d Γ(y + r) 1 i 1 r yi s.t. 1 where p(yi |r, β, Wi = ) = ( − ψi ) ψi ∀ i Wi = , Γ(r)yi !

exp (β1 + β2xi ) ψi = 1 + exp (β1 + β2xi ) The results, based on 2000 iterations with a burn-in of 1000, are:

Table 8: Results for ZINB Model. True Posterior Posterior Parameter Value MLE (SE) Mean (SD)† Mean (SD)‡

α1 −0.5 −0.63 (0.11) −0.62 (0.10) −0.64 (0.11) α2 0.5 0.56 (0.14) 0.55 (0.14) 0.57 (0.14) β1 2.0 1.94 (0.13) 1.96 (0.12) 1.94 (0.13) β2 0.5 0.38 (0.11) 0.39 (0.11) 0.39 (0.11) r 1.0 1.11 (0.13) 1.10 (0.11) 1.13 (0.13) † Pólya-Gamma sampler with MH update for r. Acceptance rate = 40%. ‡ Pólya-Gamma sampler with Gibbs update for r. 78 / 160 Trace Plots for MH ZINB Model −0.3 1.0 −0.5 1 2 α α 0.6 −0.7 0.2 −0.9

0 200 400 600 800 1000 0 200 400 600 800 1000

1:lastit 1:lastit 0.8 2.2 0.6 1 2 2.0 β β 0.4 1.8 0.2 1.6

0 200 400 600 800 1000 0 200 400 600 800 1000

1:lastit 1:lastit 1.4 1.2 r 1.0 0.8

0 200 400 600 800 1000

1:lastit 79 / 160 Trace Plots for Gibbs ZINB Model −0.2 0.8 1 2 α α −0.6 0.4 −1.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000

1:lastit 1:lastit 0.8 2.2 0.6 1 2 β β 2.0 0.4 1.8 0.2 1.6 0 200 400 600 800 1000 0 200 400 600 800 1000

1:lastit 1:lastit 1.4 1.2 r 1.0 0.8

0 200 400 600 800 1000

1:lastit 80 / 160 Bayesian Modeling Strategies for Generalized Linear Models, Part 2 Reading: Ho Section 7.3; Wakeeld Sections 10.5.1,11.2.8, 11.2.9 (Penalized Regression); Wakeeld Sections 8.6, 8.7 (Linear Mixed Models); Neelon (2018) Sections 2.33.2; Frühwirth-Schnatter and Pyne (2010); Neelon et al. (2015)

Fall 2018

81 / 160 Bayesian Penalized Linear Splines

Let's consider the problem of modeling a mean response using a piecewise linear (PWL) spline function

As an illustration, consider the following scatterplot showing the relationship between age and log C-peptide concentration among 43 diabetic children2 This example is for independent data, but the approach easily extends to repeated measures data

Figure 2: Scatterplot of log C-peptide concentration by age.

● 0.8

● ● ● ●

● ● ● ● ● ● ● ●● ● ●

0.7 ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

● ● Log Concentration

0.6 ● ● ●

● ●

● ● 0.5

0 2 4 6 8 10 12 14 16

Age (Years)

2 c.f., Fitzmaurice et al., Chapter 19 82 / 160 Bayesian Penalized Linear Splines The goal is to exibly model log peptide concentration by age

A relatively simple choice is to t the following PWL model:

K X E(Yij |xij ) = β0 + β1xij + bk (xij − ck )+, i = 1,..., n k=1

where c1,..., cK are interior knot locations, b1,..., bK are spline coecients, and for a number u, u+ = u if u > 0 and 0 otherwise.

Note that if all bk = 0, we reduce to a straight-line regression function

83 / 160 Bayesian Penalized Linear Splines Fitting the model in SAS with 10 knots from ages 5-14, we see that the model is reasonably exible but not smooth

Figure 3: Estimated mean regression function for piecewise linear model.

● 0.8

● ● ● ●

● ● ● ● ● ● ● ●● ● ●

0.7 ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

● ● Log Concentration

0.6 ● ● ●

● ●

● ● 0.5

0 2 4 6 8 10 12 14 16

Age (Years)

84 / 160 Penalized Linear Splines

To impose smoothness, we can shrink the spline coecients

b1,..., bK by penalizing large values Essentially we shrink large spline coecients toward zero, so that there are no extreme values

This can be useful if we have a large number of knots, since this can introduce collinearity among the PWL basis functions

Doing so would lead to large SEs for the spline coecients

Penalizing the coecients stabilizes the estimates and avoids overtting, which can arise when we t an overly noisy curve

85 / 160 Penalized Regression Rather than minimizing the residual sum of squares 2 n " K !# X X Q = Yi − β0 + β1xi + bk (xi − ck )+ , i=1 k=1 penalized regression minimizes Q subject to a constraint that restricts the size of the {bk }, thus smoothing or regularizing the spline curve Popular constraints include3

1 max |bk | < t K 2 P (Lasso constraint) k=1 |bk | < t K 2 T 3 P b b (Ridge constraint) k=1 bk < t = < t T 4 b Db < t for some K × K positive semi-denite penalty matrix, D (e.g., a spatial smoothing matrix) 3Ruppert et al., page 65.

86 / 160 Penalized Regression Penalized regression minimizes the penalized sum of squares

2 n " K !# X X T Q(λ) = Yi − β0 + β1xi + bk (xi − ck )+ + λb Db, i=1 k=1 where • λbT Db is known as the roughness penalty penalizing overly rough or noisy features of the curve, and

• λ is a tuning parameter that controls the degree of smoothness, with increasing smoothness as λ → ∞

λ can be viewed as a Lagrange multiplier, which is a technique commonly used to minimize constrained functions

Note also that we don't penalize the intercept, β0, or the linear coecient, β1 87 / 160 Ridged PWL Regression

Choosing D = I K yields the ridge penalty

2 n " K !# X X T Q(λ) = Yi − β0 + β1xi + bk (xi − ck )+ + λb b, i=1 k=1 n K X  x T z T b2 X 2 = Yi − i β + i + λ bk i=1 k=1 = (Y − X β + Zb)T (Y − X β + Zb) + λbT b

T where X is an n × 2 matrix [1, x], β = (β0, β1) , Z is an n × K spline basis design matrix with (i, j)-th element equal to T (xi − cj )+, and b = (b1,..., bK ) .

Note, the (i, j)-th element of Z ij = 0 if xi ≤ cj and equal to xi − cj if xi > cj for i = 1,..., n and j = 1,..., K.

88 / 160 Structure

Q(λ) has the form of a mixed model with xed eects β and random eects b

We will discuss mixed models shortly, but for now note that the

random eects are not subject-specic (i.e., not bi )

Instead, they are shared by all n subjects

Nevertheless, we can use a mixed model framework (frequentist or Bayesian) to estimate β, λ, b and σ2

89 / 160 Mixed Model Structure In particular, we can write our model as

Y = X β + Z b + e , n×1 n×p p×1 n×K K×1 n×1

T where β = (β0, β1) is a p × 1 (p = 2) vector for xed eects, b N 0 2I is a vector of random eects, e N 0 2I , ∼ ( , σb K ) ∼ ( , σe n) and 2 2 . σb = σe /λ N 0 2I is a prior distribution for b that imposes prior ( , σb K ) shrinkage toward zero.

For xed 2, as 0, 2 . Thus, there is high variability σe λ → σb → ∞ in the prior, and little shrinkage/smoothing

As , 2 0. Here, there is low variability, and b is more λ → ∞ σb → tightly centered about zero (increased shrinkage)

90 / 160 Frequentist Inference for β We can show4 that if 2 and 2 are known, then the BLUE of σe σb β is

T −1 −1 −1 βb = X Σ X X Σ Y , p×1

where 2I 2ZZ T is the marginal variance of Y after Σ = σe n + σb n×n integrating out b.

This is obtained by maximizing the marginal likelihood of Y after integrating out the random eects b.

Clearly, βb is the weighted estimate of βb where Cov(Y |X ) = Σ.

4See Wakeeld, p. 565566.

91 / 160 Frequentist Inference for b

Likewise, if the variance components are known, the best linear unbiased predictor (BLUP) of b is

−1 T −1 −1 T −1 b˜ = G + Z R Z Z R (Y − X βb) K×1 2Z T −1 Y X = σb Σ ( − βb), where G 2I , R 2I , and the last line follows from the = σb k = σe n fact that, for invertible matrices A and C,

1 −1 (A + BCD)− BC = A−1B C −1 + DA−1B

92 / 160 Unknown Variance Components

Since 2 and 2 are unknown, in practice we plug in the REML σe σb estimates of 2, 2 and to derive large-sample estimators and σe σb βb predictors

 −1 −1 −1 βb = X T Σb X X Σb Y and 1 b 2Z T − Y X b = σbb Σb ( − βb) Finally, the REML estimate of the smoothing parameter is ˆ 2 2. λ = σbe /σbb

93 / 160 Bayesian Inference

In the Bayesian setting, we use the same mixed model framework with a N 0 2I prior for b K ( , σb K )

But now we place additional conjugate priors on , 2 and 2 β σe σb

Or, equivalently, we place priors on 1 2 and 1 2 τe = /σe τb = /σb A common choice is:

−1 1 β ∼ N2(β0, T 0 ) 2 2 2 1 (improper prior) π(σe ) ∝ /σe

3 τb ∼ Ga(c, d) The choice of c and d aects smoothness. To select values, can use Bayesian information criteria, or place a discrete bivariate

prior on (c, d) 94 / 160 Bayesian Inference∗ This leads to the following full conditionals:

1 β|y, rest ∼ N2(m, V ), where V T X T X −1, where 1 2 = 0β0 + τe τe = /σe 2×2  T  m = V T 0β0 + τe X (y − Zb) 2×1

2 b|y, rest ∼ NK (m, V ), where

T −1 V = τbI K + τe Z Z K×K T m = τe VZ (y − X β) K×1

n (y − X β − Zb)T (y − X β − Zb) 3 y rest Ga τe | , ∼ 2, 2   K T 4 y rest Ga b b 2 τb| , ∼ c + 2 , d + /

See peptide.r for details

95 / 160 R Code for Penalized PWL Model

Gibbs Sampler for Penalized PWL Spline Model

# Import Data: x=age, y=logc

# PWL splines k<-10 # number of knots grid<-seq(5,14,length=k) # k+1 initial knot locations (includes boundaries) Z<-matrix(0,n,k) for (j in1:k) Z[,j]<- pmax(age-grid[j],0)

# Priors c<-d<-1e-5 # Hyperpriors for taub -- increase to 1 to reduce smoothness # Prior var of taub small and var sigma_b is large = less shrinkage T0<-diag(.0001,p) # Prior Precision

# Inits -- see R code

# Fine grid of x values with which to plot yhat x2<-seq(0,16,by=.01) # Grid of age values X2<-cbind(1,x2) # Fixed effect covs Z2<-matrix(0,length(x2),k) for (j in1:k) Z2[,j]<- pmax(x2-grid[j],0)

######################### # Gibbs of Ridged Model # ######################### for (i in1:nsim){ # Update beta v<-solve(T0+taue*crossprod(X)) m<-v%*%(T0%*%beta0+taue*t(X)%*%(y-Z%*%b)) Beta[i,]<-beta<-c(rmvnorm(1,m,v))

# Update b v<-solve(taub*diag(k)+taue*crossprod(Z)) m<-taue*v%*%t(Z)%*%(y-X%*%beta) B[i,]<-b<-c(rmvnorm(1,m,v))

# Update taue (diffuse) g<-crossprod(y-X%*%beta-Z%*%b) taue<-rgamma(1,n/2,g/2) Sigmae[i]<-1/taue

# Update taub (proper) taub<-rgamma(1,c+k/2,d+crossprod(b)/2) Sigmab[i]<-1/taub

# Yhat Yhat[i,]<-X2%*%beta+Z2%*%b # Predicted values } 96 / 160

1 REML Estimate of Mean Regression Function

Figure 4: REML estimates of the original and ridge-smoothed PWL regression functions.

0.8

n 0.7 o i t a r t n e c n o C

g

o 0.6 L

0.5

0 5 10 15 Age (Years)

Log Concentration Orginal PWL Ridged PWL Lower CI Ridged PWL Upper CI Ridged PWL

97 / 160 Posterior Mean Estimate of the Regression Function

Figure 5: Posterior mean estimates of the original and ridge-smoothed PWL regression functions.

Original PWL ● Ridged

0.80 95% CrI ●

● ●

0.75 ● ●

● ● ● ● ● ● ● ●● ● ●

0.70 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0.65

● ● Log Concentration ● ● ● 0.60

● ● 0.55 ● ● 0.50 ●

0 5 10 15

Age (Years)

98 / 160 Posterior Mean Estimates with hyperparameters c = d = 1

Original PWL

Ridged 0.80 95% CrI 0.75 0.70 0.65 Log Concentration 0.60 0.55 0.50

0 5 10 15

Age (Years)

99 / 160 Linear Mixed Models

Consider a linear mixed model of the form

Y i = X i β + Z i bi + ei , i = 1,..., n ni ×1 ni ×p p×1 ni ×q q×1 ni ×1 where, under classical assumptions,

• bi ∼ Nq(0, G) is a vector of subject-specic random eects (could allow for non-normality)

• Z i is a random eect design matrix e N 0 R N 0 2I • i ∼ ni ( , i ) ∼ ni ( , σe ni )

• bi and ei are independent RVs Note that we allow for unbalanced data with the number of

repeated measurements, ni , varying across subjects 100 / 160 Linear Mixed Models∗

The conditional distribution of Y i , given bi , is y b 2 N X Z b R fY i ( i |β, i , σe ) = ni ( i β + i i , i ).

The marginal distribution of Y i , integrating over bi , is y G 2 N X fY i ( i |β, , σe ) = ni ( i β, Σi ),

where R Z GZ T 2I Z GZ T . Σi = i + i i = σe ni + i i

101 / 160 Frequentist Inference for Linear Mixed Models

In the frequentist setting, we can show that the restricted maximum likelihood (REML) estimator of β is

1 n !− n 1 X X T − X X X T Y βb = i Σb i i i Σb i i , p×1 i=1 i=1

where Σb i is the REML estimator of Σi (in other words, the REML estimators for 2 and G). σbe b

The estimated variance-covariance of βb is

 −1 −1 Cov X T X d(βb) = i Σb i p×p

102 / 160 Frequentist Inference for Linear Mixed Models

Similarly, the best linear unbiased predictor (BLUP) of bi is   b˜ GZ −1 Y X i = i Σi i − i βb .

˜ • Linear: bi is a linear function of Y i ˜ • Predictor: bi is predicting (not estimating) a random variable, bi ˜ • Unbiased:E(bi ) = E(bi ) = 0 ˜ • Best: bi minimizes the mean square prediction error:

h ˜ 2i MSPE = E (bi − bi )  2 h ˜ i = E (bi − bi ) − 0

 2 h ˜ ˜ i = E (bi − bi ) − E(bi − bi ) ˜ = Var(bi − bi )

103 / 160 Bayesian Connection

˜ From a Bayesian standpoint, bi is the marginal posterior mean of bi after integrating over the posterior distribution of β under a diuse prior for β

2 2 That is, b˜ E y b˜ y G E b˜ y G i = β| i ( i | i , β, , σe ) = ( i | i , , σe )

However, this assumes G and 2 are known σe

See Wakeeld pages 376377 for details

˜ ˜ It is easy to show that E(bi ) = E(bi ) = 0 and hence bi is a (linear) unbiased predictor

104 / 160 Frequentist Inference for Linear Mixed Models

The corresponding empirical BLUP (eBLUP) is

−1   b GZ Y X bi = b i Σb i i − i βb ,

1 where G and − are the REML estimates b Σb i

1 However, after plugging in G and − , b is no longer a linear b Σb i bi function of Y i

105 / 160 ˜ Variance-Covariance of bi We can show, after extensive algebra, that

Var b˜ GZ T P Z G ( i ) = i i i , where P −1 I H • i = Σi ( ni − i )

H X X T −1X −1 X T −1 is a generalized hat • i = i i Σi i i Σi matrix (oblique projector)

˜ Note the distinction between Var(bi ) in earlier slide and Var(bi ) above: the latter accounts for the variability in estimating β

Note also the connection to the variance of the residuals in ordinary linear regression: Var e 2 I H . ( ) = σe ( n − ) n×1 106 / 160 Prediction Error Variance

From a Bayesian perspective, a more appropriate measure of

uncertainty for bi is the variance of the prediction error ˜ ˜ Var(bi − bi ) rather than Var(bi ) itself

The former takes into account both the uncertainty in

estimating bi and βb, whereas the latter takes into account the estimation in βb only

It is analogous to computing the prediction interval for a future ˜ response Y i in ordinary linear regression

107 / 160 Prediction Error Variance (Cont'd)

We can show that

˜ ˜ ˜ Var(bi − bi ) = Var(bi ) + Var(bi ) − 2Cov(bi , bi ) ˜ ˜ = Var(bi ) + Var(bi ) − 2Var(bi ) after algebra ˜ = Var(bi ) − Var(bi ) ˜ = G − Var(bi ) G GZ T P Z G = − i i i ,

where Pi was dened earlier.

108 / 160 Connection to Bayesian Inference

Under a non-informative prior for β, the prediction error ˜ variance is also the marginal posterior variance of bi given Y = y after integrating over the posterior distribution of β

And earlier, we showed that the marginal posterior mean of bi ˜ was bi

Thus, under a non-informative prior for β (but assuming known variance components), the marginal posterior distribution of bi is

b Y y 2 G N b˜ G GZ T P Z G f ( i | i = i , σe , ) = q( i , − i i i ), where P −1 I H . i = Σi ( ni − i ) See Wakeeld page 377 for details

109 / 160 Three Variances to Keep in Mind

So far we have discussed three variance estimators:

1

Var b Y y 2 G G GZ T −1Z G ( i | i = i , β, σe , )= − i Σi i 1 G −1 Z T R−1Z − after matrix algebra = + i i i G −1 Z T Z −1 where 1 2 = + τe i i , τe = /σe

This is the posterior variance of b˜ given y , , and G • i i β Σi

• This would be used in drawing bi from its full conditional as part of a Gibbs sampler

h 1 i 2 Var b˜ Var GZ T − Y X GZ T P Z G ( i )= i Σi ( i − i βb) = i i i • This is a variance estimator derived after plugging in βb  • It is also Var Eβ|Y =y [E (bi |Y = y, β, G, Σi )] , i.e., the variance of the marginal posterior mean of bi after integrating across the posterior of β under a vague prior for β (see Wakeeld page 380)

• Takes into account the uncertainty in estimating β 110 / 160 Three Variances (Cont'd)

3) Prediction error variance: Var b˜ b G GZ T P Z G ( i − i ) = − i i i • This is a dual frequentist and Bayesian error variance estimator

• In the Bayesian setting, it is the marginal posterior variance of bi after integrating over the posterior of β

• This diers from the variance of the marginal posterior ˜ mean (i.e., Var(bi ))

• This is the ideal measure of uncertainty

2 However, all estimators assume G and σ (and hence Σi ) are known

• Empirical Bayes plugs in REML estimates

• A fully Bayesian approach assigns priors

111 / 160 Bayesian Inference for Linear Mixed Models

As usual, Bayesian inference begins by assigning prior distributions to all model parameters

Common choices are the conditionally conjugate priors: −1 • β ∼ Np(β0, T 0 )

iid • bi ∼ Nq(0, G)

• For q > 1, G ∼ IW(ν0, C 0), where C 0 is a q × q scale matrix For 1, 1 2 Ga • q = τb = /σb ∼ (c, d) 1 2 Ga • τe = /σe ∼ (c, d) These priors admit closed-form conjugate full conditionals, leading to straightforward Gibbs sampling 112 / 160 Bayesian Inference for Linear Random Intercept Model

Consider the following linear random intercept model

x T 1 1 Yij = ij β + bi + eij , i = ,..., n; j = ,..., ni

where iid N 0 2 and iid N 0 2 . bi ∼ ( , σb) eij ∼ ( , σe )

Or, combining all Pn observations, N = i=1 ni Y = X β + Zb + e,

where X is an N × p xed eects design matrix, Z is an N × n random intercept design matrix, and e is an N × 1 vector of errors.

113 / 160 Random Intercept Model (Cont'd)

Specically, Z is an N × n random eects design matrix of the form

 1 0 0 ··· 0   1 0 0 ··· 0 } repeat n1 times   ···············     0 1 0 ··· 0  Z =    0 1 0 ··· 0 } repeat n2 times   ···············     0 0 0 ··· 1  0 0 0 ··· 1 } repeat nn times

Note, this implies that Zb is an N × 1 vector of random eects with bi repeated ni times for subject i. That is, T Zb = (b1,..., b1, b2,..., b2,..., bn,..., bn) .

114 / 160 Marginal Distribution of Y i Putting it together, it follows that the marginal distribution of

the ni × 1 vector Y i is multivariate normal with mean X and covariance µi = i β  2 2 2 2  σb + σe σb ··· σb 2 2 2 2  σb σb + σe ··· σb  Σi = Σi (θ) =   ni ×ni  ············  2 2 2 2 2 σb σb σb σb + σe

where 2 2 T is the vector of variance components. θ = (σb, σe )

This implies that, marginally, Y i has a exchangeable or compound symmetric covariance structure, with constant 2 σb correlation ρ = 2 2 among all pairs (Yij , Yik ). σb+σe

115 / 160 Conjugate Prior Specication

Conjugate priors for the random intercept model include:

−1 • β ∼ Np(β0, T 0 ), where β0 and T 0 are known quantities and T 0 is the prior precision matrix

2 is IG with known shape and scale parameters and , • σe c d where the IG(c, d) prior is given by d c  d  2 d 2 −(c+1) exp f (σe ; c, d) = (σe ) − 2 Γ(c) σe

Equivalently, 1 2 Ga where is a rate τe = /σe ∼ (c, d) d parameter

Similarly, 1 2 Ga for known shape and rate • τb = /σb ∼ (g, h) parameters g and h

116 / 160 Full Conditionals for Random Intercept Model∗

One can show that the full conditionals for the linear random intercept model are:

1 β|y, rest ∼ Np(m, V ), where

T −1 2 V = T 0β0 + τe X X , where τe = 1/σ p×p e  T  m = V T 0β0 + τe X (y − Zb) p×1

2 for 1 , y rest N , where i = ,..., n bi | i , ∼ (mi , vi )

vi = 1/(τb + ni τe ) m = v τ 1T (y − X β) = v τ Pni (y − xT β) i i e ni i i i e j=1 ij ij n σ2 1 0 xT , where i b = ( − wi ) + wi (¯yi − ¯i β) wi = 2 2 ni σb + σe

 N (y − X β − Zb)T (y − X β − Zb) 3 y rest Ga τe | , ∼ c + 2 , d + 2

 n T  4 y rest Ga b b 2 τb| , ∼ g + 2, h + /

See Random Intercept.r for details 117 / 160 R Code for Random Intercept Model

Gibbs Sampler for Random Intercept Model

########### # Priors # ########### beta0<-rep(0,p) # Prior mean for beta T0<-diag(.01,p) # Prior precision for beta c<-d<-.001 # Gamma hyperpriors for taue, taub

######### # Inits # ######### taue<-1 # Error precision = 1/sigma2 b<-rep(0,n) # Random effects taub<-1 # Random Effects precision

################# # GIBBS SAMPLER # ################# for (i in1:nsim) { # Update beta vbeta<-solve(T0+taue*crossprod(X,X)) mbeta<-vbeta%*%(T0%*%beta0 + taue*crossprod(X,y-rep(b,nis))) beta<-c(rmvnorm(1,mbeta,vbeta))

# Update b vb<-1/(taub+nis*taue) mb<-vb*(taue*tapply(y-X%*%beta,id,sum)) # tapply sums (y-xbeta)'s for each subject b<-rnorm(n,mb,sqrt(vb))

# Update taue tmp<-d+crossprod(y-X%*%beta-rep(b,nis))/2 taue<-rgamma(1,c+N/2,tmp)

# Update taub tmp<-c(d+crossprod(b)/2) taub<-rgamma(1,c+n/2,tmp)

################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigmae[j]<-1/taue Sigmab[j]<-1/taub }

if (i%%100==0) print(i) # 2.6 seconds to run 1000 iterations with n=1000 and N=5525 } 118 / 160

1 Example The program Random Intercept.r ts the following random intercept model:

Yij = β0 + β1xi + β2tij + bi + eij , i = 1,..., 1000; j = 1,..., ni iid N 0 2 bi ∼ ( , σb) iid N 0 2 eij ∼ ( , σe ) The results, based on 1000 iterations with a burn-in of 500, are:

Table 9: Results for Random Intercept Model. True Posterior Parameter Value MLE (SE)† Mean (SD)

β0 10 10.18 (0.13) 10.15 (0.14) β1 2.5 2.25 (0.002) 2.25 (0.002) β2 −1.5 −1.52 (0.19) −1.47 (0.17) 2 2 2 00 0 04 1 99 0 04 σe . ( . ) . ( . ) 2 9 8 23 0 39 8 30 0 40 σb . ( . ) . ( . ) † From SAS Proc NLMIXED. 119 / 160 Skew-Normal Random Intercepts

Obviously, the normal distribution for bi implies symmetry about zero

Suppose, instead, that the distribution is skewed

One way to accommodate skewness is to assume a skew-normal 5 distribution for bi

5O'Hagan and Leonard (1975); Azzalini, 1985

120 / 160 Skew-Normal Distribution

Denition A random variable Y is said to follow a skew-normal (SN) distribution with location µ ∈ <, scale ω > 0, and skewness α ∈ < if 2 y − µ f (y; µ, ω2, α) =d φ Φ αω−1(y − µ) , Y ω ω

where φ(·) and Φ(·) are the density and CDF of a standard normal random variable. α < 0 implies negative skewness and α > 0 implies positive skewness. The distribution reduces to N(µ, ω2) when α = 0.

121 / 160 SN Density Functions

SN Densities

N(0,1) 0.7 SN(0,1,3) SN(0,1,−3) SN(0,2,−3) 0.6 0.5 0.4 f(x) 0.3 0.2 0.1 0.0

−6 −4 −2 0 2 4

x 122 / 160 Stochastic Representation of the Skew-Normal Distribution

Theorem Let Y = µ + ψw + , where √ • ψ = ωα/ 1 + α2 • w ∼ N+(0, 1), where N+(·) denotes a standard normal density truncated below by zero •  ∼ N(0, σ2) • σ2 = ω2/(1 + α2). Then, Y ∼ SN(µ, ω2, α).

Our goal will be to update w, then t a linear model to Y to estimate µ, ψ, and σ2

Then back-transform to recover α and ω 123 / 160 Bayesian Inference for SN Random Intercept Model

Consider the following random intercept model

x T 1 1 Yij = ij β + bi + eij , i = ,..., n; j = ,..., ni ,

iid 2 where bi ∼ SN(0, ω , α). Following the stochastic representation on the prior slide, we can write

bi as

bi = ψwi + i ,

which implies that N 2 with 2 2 1 2 . bi |wi ∼ (ψwi , σb) σb = ω /( + α )

Note: We could additionally allow eij to follow a SN distribution!

There are also skew-t distributions that allow for heavier tails

There are also multivariate extensions of the SN and skew-t distributions

See Frühwirth-Schnatter and Pyne (2010) and Neelon et al. (2015) for details 124 / 160 Prior Distributions for Bayesian Inference for SN Random Intercept Model

The conjugate prior distributions for the SN random intercept model are −1 • β ∼ Np β0, T 0

• τe ∼ Ga(c, d) N 2 • bi |wi ∼ (ψwi , σb)

+ • wi ∼ N (0, 1) N −1 • ψ ∼ (µψ, τψ ) 1 2 Ga • τb = /σb ∼ (g, h)

125 / 160 Gibbs Sampler for SN Random Intercept Model∗

The Gibb sampler for the SN random intercept model is:

1 Draw β from Np(m, V ), where

T −1 V = T 0β0 + τe X X p×p  T  m = V T 0β0 + τe X (y − Zb) p×1

+ 2 For i = 1,..., n, draw wi from N (mi , v), where

2 v = 1/(1 + τbψ )

mi = vτbψbi

3 For i = 1,..., n, update ψ from N(m, v), where

T T v = 1/(τψ + τbw w) where w = (w1,..., wn) T m = v(τψµψ + τbw b)

4 For i = 1,..., n, draw bi from N(mi , vi ), where

vi = 1/(τb + ni τe ) m = v τ ψw + τ 1T (y − X β) i i b i e ni i i

 N (y − X β − Zb)T (y − X β − Zb) 5 Draw from Ga τe ∼ c + 2 , d + 2 h n i 6 Draw from Ga b w T b w τb g + 2, h + ( − ψ ) ( − ψ )

7 Recover original α and ω as √ α = ψ τb p 2 ω = 1/τb + ψ

See SN Random Intercept.r for details 126 / 160 R Code for SN Random Intercept Model

Gibbs Sampler for SN Random Intercept Model

for (i in1:nsim){

# Update beta vbeta<-solve(T0+taue*crossprod(X,X)) mbeta<-vbeta%*%(T0%*%beta0 + taue*crossprod(X,y-rep(b,nis))) beta<-c(rmvnorm(1,mbeta,vbeta))

# Update w v<-1/(1+taub*psi^2) m<-v*taub*psi*b w<-rtnorm(n,m,sqrt(v),lower=0)

# Update psi v<-1/(taupsi+taub*crossprod(w)) m<-v*(taupsi*mupsi+taub*crossprod(w,b)) psi<-rnorm(1,m,sqrt(v))

# Update b vb<-1/(taub+nis*taue) mb<-vb*(taub*psi*w+taue*tapply(y-X%*%beta,id,sum)) b<-rnorm(n,mb,sqrt(vb))

# Update taue taue<-rgamma(1,0.01+N/2,0.01+crossprod(y-X%*%beta-rep(b,nis))/2)

# Update taub tmp<-c(0.01+crossprod(b-psi*w)/2) taub<-rgamma(1,0.01+n/2,tmp)

############################### # Transform and Store Results # ############################### if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Omega[j]<-sqrt(1/taub+psi^2) Alpha[j]<-psi*sqrt(taub) Sigmae[j]<-1/taue B[j,]<-b }

if (i%%100==0) print(i) # 38 seconds to run 10K iterations

}

127 / 160

1 Example The program SN Random Intercept.r ts the following random intercept model:

Yij = β0 + β1xij + bi + eij , i = 1,..., 1000; j = 1,..., ni iid 2 bi ∼ SN(0, ω , α) iid N 0 2 eij ∼ ( , σe ) The results, based on 10,000 iterations with a burn-in of 5000, are:

Table 10: Results for SN Random Intercept Model. True Posterior Parameter Value MLE (SE)† Mean (SD)

β0 1 −0.51 (0.05) 1.01 (0.11) β1 2 1.98 (0.02) 1.99 (0.02) ω 2  2.11 (0.09) α −2  −2.22 (0.36) 2 2 2 05 0 04 2 06 0 04 σe . ( . ) . ( . ) † Assuming normal random eects. 128 / 160 Trace Plots for SN Random Intercept Model 2.4 1.3 2.2 1.1 1 ω β 2.0 0.9 1.8 0.7

0 200 400 600 800 1000 0 200 400 600 800 1000

Iteration Iteration −1.5 2.15 −2.0 e α σ −2.5 2.05 −3.0 1.95 −3.5 0 200 400 600 800 1000 0 200 400 600 800 1000

Iteration Iteration 129 / 160 Random Slope Model

Let's now consider a basic linear random slope model of the form

Yij = β0 + tij β1 + b1i + tij b2i + eij , j = 1,..., ni , where

• tij denotes the timing the j-th measurement for subject i

• β0 is a population average intercept or mean baseline response E(Yij |tij = 0)

• β1 is the slope of the population-average regression line

• b1i is a subject-specic random intercept representing subject i's departure from the population average intercept

• b2i is a subject-specic random slope that describes departures from the population average slope

130 / 160 Random Slope Model (Cont'd) Random slope models allow for individual variation in the baseline measurements (intercepts) and trajectories

50 Population Average Trajectory ● Subject−Specific Trajectories ● ● ● ● Indivdual Data Points ●

● ● ● ● 40

● ● ● ● ● ● ● ●

30 ● ● ● ●

Y ● ● ● ● ●

● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● 0

0 2 4 6 8 10

time

131 / 160 Random Slope Model (Cont'd)

In vector form, we have

Y i = X i β + Z i bi + ei , ni ×1 ni ×p p×1 ni ×q q×1 ni ×1

where here, p = q = 2 and   1 t 1 . i. X Z . .  i = i = . .  1 tini

More generally, we may have p > q if additional xed eects covariates are incorporated

132 / 160 Random Slope Model (Cont'd)

Or, combining all N observations, we have

Y = X β + Z b + e , N×1 N×p p×1 N×qn qn×1 N×1

where, with p = q = 2, we have

• X = [1N , t]

t T • = (t11, t12,..., tnnn )

T • b = (b11, b21,..., b1n, b2n) ,

and ....

133 / 160 Random Slope Model (Cont'd)

 1 t11 0 0 ··· 0 0   1 t12 0 0 ··· 0 0    ··· ··· ··· ··· ··· ··· ··· } repeat n1 times  1 0 0 0 0   t1ni ···    ·····················     0 0 1 t21 ··· 0 0     0 0 1 0 0  Z  t22 ···  =  repeat times N×2n ··· ··· ··· ··· ··· ··· ··· } n2   0 0 1 0 0   t2n2 ···    ·····················     0 0 0 0 ··· 1 tn1     0 0 0 0 ··· 1 t 2   n  ··· ··· ··· ··· ··· ··· ··· } repeat nn times 0 0 0 0 1 ··· tnnn

134 / 160 Marginal Mean, Variance, and Covariance

Returning to our random slope model, we have

x T Yij = ij β + b1i + tij b2i + eij ,

 2  6  T  σ1 σ12 where Cov(bi ) = Cov (b1i , b2i ) = Σb = 2 σ12 σ2 The marginal mean, variance, and covariance are

E x T (Yij ) = ij β V 2 2 2 2 2 (Yij ) = σ1 + tij σ12 + tij σ2 + σe 2 2 Cov(Yij , Yik ) = σ1 + (tij + tik )σ12 + tij tik σ2

6 Note change of notation from G to Σb 135 / 160 Marginal Mean, Variance, and Covariance (Cont'd)

Thus, the inclusion of the random slope implies that the marginal variance of Y can vary over time

Moreover, Cov(Yij , Yik ) is no longer compound symmetric, but instead depends on the time separation between the two occasions

More reasonable in longitudinal settings

136 / 160 Marginal Mean and Covariance: Vector Form

In vector form we have

E(Y i ) = X i β ni ×1

Cov(Y i ) = Cov(X i β + Z i bi + ei ) ni ×ni

= Cov(Z i bi ) + Cov(ei ) by independence of bi and ei Z Z T 2I = i Σb i + σe ni ni ×q q×q q×ni Z Z T R say. = i Σb i + i = Σi , And hence,

Y b N X Z b R i | i ∼ ni ( i β + i i , i ) Y N X i ∼ ni ( i β, Σi )

137 / 160 Conjugate Prior Specication

Conjugate priors for the random slope model are:

−1 • β ∼ Np(β0, T 0 )

• bi ∼ Nq(0, Σb)

• τe ∼ Ga(c, d)

• Σb ∼ IW(ν0, C 0), where C 0 is a q × q scale matrix

Equivalently, −1 Wish C −1 • Σb ∼ (ν0, 0 )

138 / 160 Inverse-Wishart Distribution

Denition

Let Σ ∼ IW(ν0, C 0) with degrees of freedom ν0 and q × q positive-denite scale matrix C 0. Then, the density of Σ is

ν0/2 |C 0| 1 2 1 tr C −1 f (Σ) = |Σ|−(ν0+q+ )/ e− 2 ( 0Σ ), (ν0q/2) 2 Γq(ν0/2)

where Γq(·) is the multivariate gamma function. The expected value of Σ is

C 0 E(Σ) = , ν0 − q − 1

for ν0 > q + 1.

A popular non-informative prior for Σ is to select ν0 = q + 1 and C 0 = I k . This has the appealing property that all of the correlations of Σ are uniform. See Gelman et al. (2014) p. 73 for details. 139 / 160 Relationship between the MVN, Wishart and Inverse-Wishart Distributions

Proposition Let z z iid N 0 C and Z T Z Pν0 z z T . Then, 1 ..., ν0 ∼ q( , 0) = i=1 i i T T −1 −1 Z Z ∼ Wish(ν0, C 0) and Z Z ∼ IW ν0, C 0 .

We can think of ν0 as a prior sample size and C 0 as a prior sum of squares.

Moreover, just as the normal/inverse-gamma joint prior is conjugate for the univariate normal mean model, where Y ∼ N(µ, σ2) and

2 2 2 2 2 π(µ, σ ) = π(µ|σ )π(σ ) = N(µ0, σ /κ0)IG(ν0, σ0), so is the Matrix Normal/Inverse-Wishart conjugate in the multivariate

case where Y ∼ Nk (µ, Σ) and

π(µ, Σ) ∼ Matnorm(µ0, Σ0, Σ)IW(ν0, C 0). For more on Bayesian inference using the Wishart and Inverse-Wishart distributions, please see Ho (2009) Section 7.3. 140 / 160 Full Conditionals for Random Slope Model∗

One can show that the full conditionals for the linear random slope model are:

1 β|y, rest ∼ Np(m, V ), where

T −1 V = T 0β0 + τe X X p×p  T  m = V T 0β0 + τe X (y − Zb) p×1

2 b|y, rest ∼ N2n(m, V ), where −1 T −1 V = (I n ⊗ Σ + τe Z Z) 2n×2n b T m = τe VZ (y − X β) 2n×1

 N (y − X β − Zb)T (y − X β − Zb) 3 y rest Ga τe | , ∼ c + 2 , d + 2   b11 b21 T  . . 4 Σb|y, rest ∼ IW ν0 + n, C 0 + B B , where B =  . .  (extensions to n×2   b1n b2n q > 2 are straightforward)

See Random Slope.r for details 141 / 160 R Code for Random Slope Model

Gibbs Sampler for Random Slope Model

########### # Priors # ########### beta0<-rep(0,p) # Prior Mean for beta T0<-diag(.01,p) # Prior Precision Matrix of beta (vague), independent d0<-g0<-.001 # Hyperpriors for tau nu0<-3 # DF for Wishart prior on Sigmab (G in notes) C0<-diag(2) # Scale matrix for IW Prior on Sigmab

d<-d0+N/2 # Posterior df for taue nu<-nu0+n # Posterior df for Sigmab

################# # GIBBS SAMPLER # ################# for (i in1:nsim) { # Update Beta vbeta<-solve(T0+taue*crossprod(X,X)) mbeta<-vbeta%*%(T0%*%beta0 + taue*crossprod(X,y-Z%*%b)) beta<-c(rmvnorm(1,mbeta,vbeta))

# Update b precb<-diagn%x%Taub+taue*crossprod(Z) # Posterior precision mb<-taue*crossprod(Z,y-X%*%beta) # Likelihood contribution to posterior mean b<-rmvnorm.canonical(1,mb,precb)[1,] # Update without inverting using spam package

btmp[,]<-b # More efficient to fill in a pre-defined matrix object bmat<-t(btmp)

# Update taue zb<-Z%*%b g<-g0+crossprod(y-X%*%beta-zb)/2 taue<-rgamma(1,d,as.numeric(g))

# Update Taub Sigma.b<-riwish(nu,C0+crossprod(bmat,bmat)) Taub<-solve(Sigma.b)

################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigmae[j]<-1/taue Sigmab[j,]<-c(Sigma.b) }

if (i%%50==0) print(i) # Very slow! } 142 / 160 1 Conditional Update for bk

In general, with q random eects, the full-conditional update for b multivariate normal with dimension qn

For example, previous example with n = 1000 subjects and N = 5525 total observations took 2.4 minutes to run 1000 iterations

An alternative is to write out a conditional update for each n × 1 vector bk given the others, where

T bk = (bk1,..., bkn) , k = 1,..., q.

For example, with q = 2, we rst update b1|b2 then b2|b1

143 / 160 Conditional Update for bk

To do so, we rst note that b1i depends on b2i only through the bivariate normal prior distribution

T bi = (b1i , b2i ) ∼ N2(0, Σb), where  2   2  σ1 σ12 σ1 ρσ1σ2 Σb = 2 = 2 σ12 σ2 ρσ1σ2 σ2 Appealing to properties of the bivariate normal distributions, the

conditional prior for b1i |b2i is   N 2 N σ1 2 1 2 b1i |b2i ∼ (µ1|2, σ1|2) = ρ b2i , σ1( − ρ ) σ2 Likewise,   N 2 N σ2 2 1 2 b2i |b1i ∼ (µ2|1, σ2|1) = ρ b1i , σ2( − ρ ) σ1

144 / 160 Conditional Update for bk

Note that we've moved from a bivariate normal prior for bi to two univariate conditional priors

T This allows us to construct vector b1 = (b11,..., b1n) and T b2 = (b21,..., b2n) , with conditional priors   σ1 2 2 b1|b2 ∼ Nn ρ b2, σ1(1 − ρ )I n and σ2   σ2 2 2 b2|b1 ∼ Nn ρ b1, σ2(1 − ρ )I n σ1

This leads to univariate full conditional updates for b1 and b2, which are much faster than jointly updating the 2n × 1 vector T b = (b11, b21,..., b1n, b2n)

Extensions to q > 2 random eects are straightforward  one only needs to appeal to conditional properties of the q-dimensional MVN distribution

See Neelon (2015, 2018) for details in the context of semicontinuous and zero-inated data, including extensions to multivariate spatial data 145 / 160 Modied Gibbs Sampler

This leads to a modied Gibbs sampler:

1 Update β, τe , Σb as before

2 From Σb, form

σ12 ρ = σ1σ2 1 τ1 2 = | σ2(1 − ρ2) 1 1 τ2|1 = 2 2 σ2(1 − ρ )

3 For i = 1,..., n, update b1i given b2i from N(mi , vi ), where 1 vi = (note the similarity to the random intercept model) τ1|2 + τe ni ni X x T mi = vi [τ1|2µ1|2 + τe (yij − ij β − tij b2i )] j=1 σ1 µ1|2 = ρ b2i σ2

4 For i = 1,..., n, update b2i given b1i from N(mi , vi ), where 1 v = i Pni 2 τ2|1 + τe j=1 tij ni X x T mi = vi [τ2|1µ2|1 + τe tij (yij − ij β − b1i )] j=1 σ2 µ2|1 = ρ b1i σ1

See Random Slope Conditional.r for details

146 / 160 R Code for Conditional Gibbs Sampler

Conditional Gibbs Sampler for Random Slope Model

# Inits taue<-tau12<-tau21<-1 # Error precision = 1/sigma2, conditional precision of b1|b2, etc sigmab1<-sigmab2<-1 # Marginal variances of b1 and b2 rhob<-0 # Corr(b1,b2) b1<-b2<-rep(0,n) # Random effects (int and slope) beta<-rep(0,p) # Posterior mean and var of beta

# Gibbs for (i in1:nsim) { # Update beta vbeta<-solve(prec0+taue*crossprod(X,X)) mbeta<-vbeta%*%(prec0%*%beta0 + taue*crossprod(X,y-rep(b1,nis)-t*rep(b2,nis))) beta<-c(rmvnorm(1,mbeta,vbeta))

# Update b1|b2 vb<-1/(tau12+taue*nis) mu12<-rhob*sqrt(sigmab1/sigmab2)*b2 # Prior Mean of b1|b2 mb<-vb*(tau12*mu12+taue*tapply(y-X%*%beta-t*rep(b2,nis),id,sum)) b1<-rnorm(n,mb,sqrt(vb))

# Update b2|b1 vb<-1/(tau21+taue*tapply(t^2,id,sum)) mu21<-rhob*sqrt(sigmab2/sigmab1)*b1 # Prior Mean of b2|b1 mb<-vb*(tau21*mu21+taue*tapply(t*(y-X%*%beta-rep(b1,nis)),id,sum)) b2<-rnorm(n,mb,sqrt(vb))

# Update taue g<-g0+crossprod(y-X%*%beta-rep(b1,nis)-t*rep(b2,nis))/2 taue<-rgamma(1,d0+N/2,g)

# Update Sigma.b bmat<-cbind(b1,b2) Sigma.b<-riwish(nu0+n,C0+crossprod(bmat)) sigmab1<-Sigma.b[1,1] # Marginal variance of b1 sigmab2<-Sigma.b[2,2] # Marginal variance of b2 rhob<-Sigma.b[1,2]/sqrt(sigmab1*sigmab2) # Corr(b1,b2) tau12<-1/(sigmab1*(1-rhob^2)) # Conditional precision of b1|b2 tau21<-1/(sigmab2*(1-rhob^2)) # Conditional precision of b2|b1 ################# # Store Results # ################# if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigmae[j]<-1/taue Sigmab[j,]<-c(Sigma.b) B1[j,]<-b1[1:10] # Store first 10 random effects B2[j,]<-b2[1:10] } } # 66 seconds to run 10,000 iterations with n=1000

1 147 / 160 Example The program Random Slope Conditional.r ts the following random slope model:

Yij = β0 + β1tij + b1i + b2i tij + eij , i = 1,..., 1000; j = 1,..., ni iid bi ∼ N2(0, Σb) iid N 0 2 eij ∼ ( , σe ) The results, based on 10,000 iterations with a burn-in of 5000 and thinning of 5, are:

Table 11: Results from Conditional Gibbs Algorithm. True Posterior Parameter Value MLE (SE)† Mean (SD)

β0 −1 −1.03 (0.06) −1.03 (0.06) β1 1 0.97 (0.05) 0.96 (0.04) 2 σ1 2 2.04 (0.16) 2.01 (0.15) σ12 1 1.08 (0.09) 1.08 (0.09) 2 σ2 2 2.04 (0.10) 2.04 (0.10) 2 2 2 04 0 05 2 05 0 04 σe . ( . ) . ( . ) † From SAS Proc Mixed. 148 / 160 Comments and Extensions to GLMMs

The conditional approach is especially useful for joint modeling (e.g., correlated outcomes or ZI models with joint eects)

Easily accommodates dierent samples sizes for each outcome

More generally, Bayesian random eects models can be extended to accommodate discrete outcomes

149 / 160 Logistic Random Intercept Model

Consider the following logistic random intercept model

logit Pr 1 logit x T [ (Yij = )] = (φij ) = ij β + bi N 0 −1 1 1 bi ∼ , τb , i = ,..., n; j = ,..., ni . We can modify the Pólya-Gamma data-augmentation sampler to

accommodate the random eect bi :

1 For all (i, j), update the PG weights, ωij , to include bi

2 Form the corresponding latent normal variables, Zij

3 Conditional on the latent normal variables, update β and bi (i = 1,..., n) from the LMM formulas described earlier

150 / 160 Gibbs Sampler for Logistic Random Intercept Model

Specically, the Gibbs sampler proceeds as follows:

1 For all , update from PG 1 , where x T (i, j) ωij ( , ηij ) ηij = ij β + bi

yij −1/2 2 For all (i, j), dene zij = ωij

3 Update β from Np(m, V ), where

T −1 V = T 0 + X ΩX   T m = V T 0β0 + X Ω (z − W b) , p×N N×N | {z } N×1

where Ω = diag(ωij ) and W denotes the random eects design N×N N×n matrix.

4 For all i, update bi from N(mi , vi ), where 1 v = i Pni τb + j=1 ωij   W T z X mi = vi  i Ωi ( i − i β) 1 n ×n ×ni i i | {z } ni ×1 ni X x T = vi ωij (zij − ij β) j=1

 n T  5 Update from Ga b b 2 , where and are τb c + 2, d + / c d hyperpriors

See Logistic Random Intercept.r for details 151 / 160 R Code for Gibbs Sampler

Gibbs Sampler for Logistic Random Intercept Model

# Priors beta0<-rep(0,p) # Prior mean for beta T0<-diag(.01,p) # Prior precision for beta c<-d<-0.01 # Gamma hyperpriors

# Inits beta<-rep(0,p) b<-rep(0,n) # Random effect taub<-1 # Random effect precision

######### # Gibbs # ######### tmp<-proc.time()

for (i in1:nsim){

# Update z mu<-X%*%beta+rep(b,nis) omega<-rpg(N,1,mu) z<-(y-1/2)/omega

# Update beta v<-solve(crossprod(sqrt(omega)*X)+T0) m<-v%*%(T0%*%beta0+t(sqrt(omega)*X)%*%(sqrt(omega)*(z-rep(b,nis)))) beta<-c(rmvnorm(1,m,v))

# Update b vb<-1/(taub+tapply(omega,id,sum)) mb<-vb*(tapply(omega*(z-X%*%beta),id,sum)) b<-rnorm(n,mb,sqrt(vb))

# Update taub taub<-rgamma(1,c+n/2,d+crossprod(b)/2)

# Store if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigmab[j]<-1/taub }

if (i%%100==0) print(i) }

tot.time<-proc.time()-tmp # 8.5 secs to run 1000 iterations 152 / 160

1 Negative Binomial Random Intercept Model

The Gibbs sampler for the negative binomial model has a similar form:

1 For all (i, j), draw ωij from its PG(yij + r, ηi ) distribution, where x T ηij = i β + bi

yij − r 2 For all (i, j), dene zij = 2ωij

3 Update β from its Np(m, V ) full conditional, where expressions for m and V are identical to the logistic GLMM

4 Update b and τb from their full conditionals analogous to the ones for the logistic model

5 Update r using random-walk MH or a conjugate Gamma update following Dadaneh et al.

See NB Random Intercept.r for details 153 / 160 R Code for Hybrid Gibbs-MH Sampler

Gibbs Sampler for NB Random Intercept Model

# Priors similar to fixed effects NB model, but with gamma for taub

# Inits beta<-rep(0,p) b<-rep(0,n) # Random effects taub<-1 # Random effect precision r<-1 # Inverse Dispersion Acc<-0 # Acceptance rate indicator s<-.005 # Proposal variance

######## # MCMC # ######## for (i in1:nsim){ # Update z and beta eta<-X%*%beta+rep(b,nis) omega<-rpg(N,y+r,eta) # Polya weights z<-(y-r)/(2*omega) v<-solve(crossprod(X*sqrt(omega))+T0) m<-v%*%(T0%*%beta0+t(sqrt(omega)*X)%*%(sqrt(omega)*(z-rep(b,nis)))) beta<-c(rmvnorm(1,m,v))

# Update b vb<-1/(taub+tapply(omega,id,sum)) mb<-vb*(tapply(omega*(z-X%*%beta),id,sum)) b<-rnorm(n,mb,sqrt(vb))

# Update taub taub<-rgamma(1,c+n/2,d+crossprod(b)/2)

# Update r q<-1/(1+exp(eta)) # dnegbinom uses q=1-p rnew<-rtnorm(1,r,sqrt(s),lower=0) ratio<-sum(dnbinom(y,rnew,q,log=T))-sum(dnbinom(y,r,q,log=T))+ dtnorm(rnew,r,sqrt(s),0,log=T)-dtnorm(r,rnew,sqrt(s),0,log=T) if (log(runif(1))

if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigmab[j]<-1/taub R[j]<-r }

if (i%%100==0) print(i) # 23 secs to run 1000 iterations with n=1000 and N=5515 } 154 / 160

1 R Code with Gibbs update for r

Gibbs Sampler for NB Random Intercept Model with Gibbs Update for r

# Priors similar to fixed effects NB gibbs model, but with gamma for taub

# Inits beta<-rep(0,p) b<-rep(0,n) # Random effects taub<-1 # Random effect precision r<-1 # Inverse Dispersion

######## # MCMC # ######## for (i in1:nsim){

# Update z eta<-X%*%beta+rep(b,nis) omega<-rpg(N,y+r,eta) # Polya weights z<-(y-r)/(2*omega)

# Update beta v<-solve(crossprod(X*sqrt(omega))+T0) m<-v%*%(T0%*%beta0+t(sqrt(omega)*X)%*%(sqrt(omega)*(z-rep(b,nis)))) beta<-c(rmvnorm(1,m,v))

# Update b vb<-1/(taub+tapply(omega,id,sum)) mb<-vb*(tapply(omega*(z-X%*%beta),id,sum)) b<-rnorm(n,mb,sqrt(vb))

# Update taub taub<-rgamma(1,c+n/2,d+crossprod(b)/2)

# Update latent counts, l, using CRT distribution for(j in1:N) l[j]<- sum(rbinom(y[j],1,round(r/(r+1:y[j]-1),6))) # Could try to avoid loop # Rounding avoids numerical instability # Update r from conjugate gamma distribution given l and psi eta<-X%*%beta+rep(b,nis) psi<-exp(eta)/(1+exp(eta)) r<-rgamma(1,a+sum(l),b-sum(log(1-psi)))

if (i> burn & i%%thin==0){ j<-(i-burn)/thin Beta[j,]<-beta Sigmab[j]<-1/taub R[j]<-r B[j,]<-b }

if (i%%100==0) print(i) # 49 secs to run 1000 iterations with n=1000 and N=5515 155 / 160

1 Joint Probit-Normal Model

The conditional prior structure can be extended to accommodate joint models that are linked via correlated random eects

For example, consider the following bivariate probit-normal regression model for a longitudinal binary variable Y1 and a longitudinal continuous variable Y2:

−1 Pr 1 x T Φ [ (Y1ij = )] = ij β + b1i x T Y2ij = ij γ + b2i + eij   b1i ∼ N2(0, Σb) b2i  2  σ1 σ12 Σb = 2 σ12 σ2 N 0 −1 1 1 eij ∼ ( , τe ) i = ,..., n; j = ,..., ni

156 / 160 Gibbs Sampler for the Probit-Normal Model

Using the conditional update for bi , the Gibbs sampler for the probit-normal model is straightforward:

1 Augment the Albert & Chib sampler with a random intercept to update the probit model parameters

2 Fit a linear random intercept model to update the linear model parameters

3 Update the random intercepts using the conditional formula described above

4 Update Σb from its IW full conditional

Note that the association between the outcomes comes from σ12 the o-diagonal of Σb, which is a fairly weak form of dependence 157 / 160 Gibbs Sampler for the Probit-Normal Model

The specic details are:

1 Probit model:

For all , draw a latent normal from a N xT 1 • (i, j) zij ( ij β + bi , ) distribution truncated below (above) by 0 for Y1ij = 1 (=0)

• Conditional on z, update β and b1i from their normal full conditionals using expressions for a linear random intercept model

2 Linear model:

• Update γ and b2i from there normal full conditionals using expressions for a linear random intercept model

3 Update Σb from its conjugate IW full conditional and retrieve τ1|2, τ2|1, and ρ as described earlier

See Probit-Normal Random Intercept.r for details

158 / 160 R Code for Gibbs Sampler

Gibbs Sampler for Probit-Normal Intercept Model

# Priors same as for probit and linear model with IW prior for (b1,b2)

vbeta<-solve(T0+crossprod(X,X)) # Posterior Var(beta) -- can do outside loop

# GIBBS SAMPLER for (i in1:nsim) { # Binomial Model # Draw Latent Variable, z muz<-X%*%beta+rep(b1,nis) # Mean of z z[y1==0]<-qnorm(runif(N,0,pnorm(0,muz)),muz)[y1==0] z[y1==1]<-qnorm(runif(N,pnorm(0,muz),1),muz)[y1==1]

# Update beta mbeta <- vbeta%*%(T0%*%beta0+crossprod(X,z-rep(b1,nis))) beta<-c(rmvnorm(1,mbeta,vbeta))

# Update b1|b2 taub12<-taub1/(1-rhob^2) # Prior precision of b1|b2 mb12<-rhob*sqrt(taub2/taub1)*b2 # Prior mean of b1|b2 vb1<-1/(nis+taub12) # Posterior var of b1|b2,y mb1<-vb1*(taub12*mb12+tapply(z-X%*%beta,id,sum)) # Posterior mean of b1|b2,y b1<-rnorm(n,mb1,sqrt(vb1))

# Linear Model # Update gamma vgam<-solve(G0+taue*crossprod(X)) # G0 is prior precision of gamma mgam<-vgam%*%(G0%*%gamma0 + taue*crossprod(X,y2-rep(b2,nis))) gamma<-c(rmvnorm(1,mgam,vgam))

# Update taue l<-l0+crossprod(y2-X%*%gamma-rep(b2,nis))/2 taue<-rgamma(1,d0+N/2,l)

# Update b2|b1 taub21<-taub2/(1-rhob^2) # Prior precision of b2|b1 mb21<-rhob*sqrt(taub1/taub2)*b1 # Prior mean of b2|b1 vb2<-1/(taue*nis+taub21) # Posterior var of b2|b1 mb2<-vb2*(taub21*mb21+taue*tapply(y2-X%*%gamma,id,sum)) b2<-rnorm(n,mb2,sqrt(vb2)) b<-cbind(b1,b2)

# Update variance components Sigmab<-riwish(nu0+n,c0+crossprod(b)) rhob<-Sigmab[1,2]/sqrt(Sigmab[1,1]*Sigmab[2,2]) taub1<-1/Sigmab[1,1] taub2<-1/Sigmab[2,2] } # 7.4 seconds to run 1000 iterations with n=1000 and N=5441 159 / 160

1 A Look Ahead

There are countless settings in which Bayesian inference is useful, including

• Latent class and state-space modeling

• Variable selection

• Density estimation

• Spatial and data analysis and disease mapping

• Pattern recognition and language processing

• Functional data analysis

Please see me if you're interested in working on Bayesian methods for your dissertation

Homework: The nal homework will be posted on the course website this evening and will be due via e-mail 2 weeks from today at 5pm 160 / 160