Generalized linear models Further extensions

Regression analysis: Extensions

Patrick Breheny

February 12

Patrick Breheny BST 701: Bayesian Modeling in 1/32 Robust regression Nonlinear regression Generalized linear models Further extensions Introduction

Last time we explored the basic model from a Bayesian perspective Today, we will look at various ways in which the linear regression model can be extended Specifically, we will consider four examples, all of which have some sort of frequentist analog; however, as we will see, the flexibility of the Bayesian approach makes these and further extensions straightforward to implement and interpret

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 2/32 Robust regression Nonlinear regression Generalized linear models Further extensions Scottish hill races

Our first extension is robust regression: the normality assumption of OLS linear regression renders it quite sensitive to A classic set in the this literature pertains to hill racing (apparently a somewhat popular sport in Scotland) The hills.txt contains information on the winning times in 1984 for 35 Scottish hill races, as well as two factors which presumably influence the duration of the race: dist: The distance of the race (in miles) climb: The elevation change (in feet)

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 3/32 Robust regression Nonlinear regression Generalized linear models Further extensions Residuals from OLS fit

Fitting a model for time assuming additive linear relationships for dist and climb, we obtain the following residuals:

● 60 40

● 20 Residual

● ● ● ● ● ●● ●● ● ● 0 ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

50 100 150

Fitted value

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 4/32 Robust regression Nonlinear regression Generalized linear models Further extensions Using a thicker-tailed distribution

To reduce the impact of the fit of the model, we can replace the normal distribution with a thicker-tailed distribution A natural choice is the t-distribution, which can be implemented by simply replacing the normal likelihood with: time[i] ~ dt(mu[i], tau, nu) Recall that as ν → ∞, the t-distribution resembles the normal, but for small ν has considerably thicker tails

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 5/32 Robust regression Nonlinear regression Generalized linear models Further extensions Contrasting the two posteriors

Climb (100 fit) Dist. (1 mi) SD Mean SD Normal 1.11 0.21 6.21 0.62 t5 0.81 0.15 6.59 0.29

Recall that there are two large outliers in this data set; as they are in some sense downweighted, there is a modest change in the posterior (the posterior mean for distance goes up, while the one for climb goes down), and a sizeable drop (roughly 2-fold) in the posterior SD

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 6/32 Robust regression Nonlinear regression Generalized linear models Further extensions Setting a prior on ν

Of course, one may ask, why a t5 distribution? Since we do not actually know ν, it would be more reasonable to include ν as a in our model; the only condition is that we must place a prior on it A reasonably uninformative prior would seem to be ν ∝ ν−1; i.e., nu ∼ dgamma(.001,.001) in BUGS

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 7/32 Robust regression Nonlinear regression Generalized linear models Further extensions Posterior

Climb (100 ft): Mean SD 0.8 Normal 1.11 0.21 t5 0.81 0.15 0.6 ν ∝ ν−1 0.69 0.10

0.4 Density Distance (1 mi): 0.2 Mean SD Normal 6.21 0.62 0.0 t5 6.59 0.29 0 1 2 3 4 5 ν ∝ ν−1 6.56 0.24 ν

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 8/32 Robust regression Nonlinear regression Generalized linear models Further extensions Puromycin data

Another desirable extension is the ability to fit a nonlinear model Our illustrating data set here is a study of the reaction kinetics of an enzyme called galactosyltransferase The recorded variables are Conc, the concentration of the enzyme’s substrate (in ppm) and Rate, the reaction rate (in DPM/min) Furthermore, there were two experimental groups: 12 from cells treated with an antibiotic called puromycin, and an untreated control group of size 11

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 9/32 Robust regression Nonlinear regression Generalized linear models Further extensions Puromycin data: Illustration

0.0 0.2 0.4 0.6 0.8 1.0 treated untreated ● 200 ● ● ●

● ● ● ● 150 ● ● ● ● ● ● Rate ● 100 ● ● ● ● ●

50 ● ●

0.0 0.2 0.4 0.6 0.8 1.0 Conc

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 10/32 Robust regression Nonlinear regression Generalized linear models Further extensions MichaelisMenten kinetics

The standard model for the study of these kinds of reactions is the Michaelis-Menten model: V [S] v = m , K + [S]

where v is the reaction rate, [S] is the substrate concentration, Vm is the maximum rate, and K is the substrate concentration at which v is one-half Vm

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 11/32 Robust regression Nonlinear regression Generalized linear models Further extensions BUGS code

We can implement this model in BUGS as follows:

for (i in 1:n) { Rate[i] ~ dnorm(mu[i], tau) mu[i] <- (Vm[State[i]]*Conc[i])/(K[State[i]]+Conc[i]) } for (j in 1:2) { Vm[j] ~ dunif(0,500) K[j] ~ dunif(0,2) } tau ~ dgamma(0.001, 0.001)

Note the use of “nested indexing” to match observation i up with its correct state

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 12/32 Robust regression Nonlinear regression Generalized linear models Further extensions

Posterior: Vm

treated untreated

0.06

0.05 90% CI 0.04 Mean Lower Upper 0.03 Treated 213.6 202.3 225.4 Density 0.02 Untreated 161.5 149.0 175.1

0.01 ∆ 52.0 35.0 69.6

0.00

140 160 180 200 220 240

Vm

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 13/32 Robust regression Nonlinear regression Generalized linear models Further extensions Posterior: K

treated untreated

50

40 90% CI 30 Mean Lower Upper Treated 0.066 0.052 0.081 Density 20 Untreated 0.050 0.035 0.068 10 ∆ 0.015 -0.006 0.037

0

0.02 0.04 0.06 0.08 0.10 0.12

Km

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 14/32 Robust regression Nonlinear regression Generalized linear models Further extensions Posterior: Efficiency

A commonly used measure of enzyme efficiency is Vm/K, the slope at [S] = 0:

treated untreated

0.0012

0.0010 90% CI 0.0008 Mean Lower Upper 0.0006 Treated 3295 2750 3908 0.0004 Untreated 3324 2524 4297

0.0002 ∆ -29 -1163 984

0.0000

2000 3000 4000 5000 6000 7000

Efficiency

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 15/32 Robust regression Nonlinear regression Generalized linear models Further extensions Fitted curves

treated untreated

250 250

● ● ● 200 ● 200

● ● ● ● ● 150 ● 150 ● ● ● ● ● Rate Rate 100 ● 100 ● ● ● ● 50 ● 50 ●

0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Concentration Concentration

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 16/32 Robust regression Nonlinear regression Generalized linear models Further extensions Overview

Of course, a very important class of extensions to linear regression are the generalized linear models (GLMs) Analogous to frequentist GLMs, in which there is no closed-form solution for βb, we do not generally have conjugate relationships in Bayesian GLMs Nevertheless, fitting Bayesian GLMs is straightforward with MCMC; indeed, the MCMC approach is more flexible than the iteratively reweighted approach used in frequentist GLMs, and can even applied to distributions outside the

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 17/32 Robust regression Nonlinear regression Generalized linear models Further extensions

Perhaps the common specific model in the GLM family is logistic regression:

Yi|θi ∼ Binom(ni, θi)

g(θi) = ηi T ηi = xi β In logistic regression, we use the “canonical” link:   θi ηi = log 1 − θi eηi θi = ; 1 + eηi

this function is known as the “logit” of θi

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 18/32 Robust regression Nonlinear regression Generalized linear models Further extensions Beetle data

A classic data set used to illustrate logistic regression is the beetle mortality data from Bliss (1935) The data (beetles.txt) consists of 8 in which beetles were exposed to various concentrations of carbon disulphide (a fumigant) for five hours The response variable is the number of beetles killed by the fumigant, with dose being the explanatory variable

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 19/32 Robust regression Nonlinear regression Generalized linear models Further extensions Beetle data: summary

Dose n Killed 1.69 59 6 1.72 60 13 1.76 62 18 1.78 56 28 1.81 63 52 1.84 59 53 1.86 62 61 1.88 60 60

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 20/32 Robust regression Nonlinear regression Generalized linear models Further extensions Implementation

The logistic regression model is easily implemented in BUGS:

## Likelihood for (i in 1:N) { y[i] ~ dbinom(theta[i], n[i]) logit(theta[i]) <- beta[1] + beta[2]*Dose[i] } ## Prior for (j in 1:2) { beta[j] ~ dnorm(0, 0.0001) }

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 21/32 Robust regression Nonlinear regression Generalized linear models Further extensions Posterior

1.0 ● ●

● 0.8 Posterior and 95% CI 0.6 for a difference of 0.05 units: ●

P(Kill) 0.4

● Mean Lower Upper 0.2 ● 5.8 4.5 7.4 ●

0.0

1.65 1.70 1.75 1.80 1.85 1.90

Dose

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 22/32 Robust regression Nonlinear regression Generalized linear models Further extensions Lethal dose

A common quantity of interest in a study such as this one is the appropriate dose to kill a certain percent of beetles, typically abbreviated LD50 for the lethal dose For a specified percent, this is merely a function of the and thus straightforward to sample from:

  π   LD = β−1 log − β π 1 1 − π 0

For this , we have LD50 = 1.77(1.765, 1.778) and LD99 = 1.90(1.89, 1.93)

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 23/32 Robust regression Nonlinear regression Generalized linear models Further extensions Introduction

One of the strengths of BUGS and MCMC approaches is the ease with which models can be extended in order to account for the complexities of real data We will look at one such example here, in which we extend a model with a to account for an excess of zeros in the data

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 24/32 Robust regression Nonlinear regression Generalized linear models Further extensions Rooting dataset

Our data here come from a horticultural experiment in which apple shoots of the “Trajan” cultivar were grown under various experimental conditions The outcome variable is the number of roots produced by the plant Two possible explanatory variables are Photoperiod, the length of daily exposure to light (in hours) and Dose, the soil concentration of a plant growth ctyokinin called BAP

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 25/32 Robust regression Nonlinear regression Generalized linear models Further extensions Roots data

0.20

0.15

Percent 0.10

0.05

0.00 0 1 2 3 4 5 6 7 8 9 10 12 14 16

Roots

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 26/32 Robust regression Nonlinear regression Generalized linear models Further extensions Mixture distribution

The covariates are able to explain some of the mass at zero, but it is clear that there are simply far more zeros in the data set than the can account for One possibility, then, is to assume that Y follows a mixture distribution: ( Pois(µ) with probability π Y |µ, π ∼ 0 with probability 1-π

This sort of model is known as a zero-inflated Poisson, or ZIP model

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 27/32 Robust regression Nonlinear regression Generalized linear models Further extensions ZIP model

In the presence of covariates, we would need to include a model for how µ and π depend on the explanatory variables, the most natural model being

 π  log(µ) = Xβ log = Zγ 1 − π

This model allows for different effects and even different covariates to be involved with each aspect of the mixture; a simpler and more stable, albeit less flexible, model is to assume  π  log(µ) = Xβ log = τXβ 1 − π

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 28/32 Robust regression Nonlinear regression Generalized linear models Further extensions BUGS implementation

Here is an implementation in BUGS taking the first approach, and leaving out Dose for the sake of simplicity

## Likelihood for (i in 1:n) { Roots[i] ~ dpois(m[i]) m[i] <- group[i] * mu[Photoperiod[i]] group[i] ~ dbern(pi[Photoperiod[i]]) } ## Prior for (j in 1:2) { mu[j] ~ dgamma(0.5, 0.0001) pi[j] ~ dunif(0, 1) }

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 29/32 Robust regression Nonlinear regression Generalized linear models Further extensions Posterior: π

8 16

30 Posterior means, 90% intervals

20 Mean Lower Upper

Density 8 hr 0.980 0.957 0.995

10 16 hr 0.525 0.453 0.598

0

0.4 0.5 0.6 0.7 0.8 0.9 1.0

π

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 30/32 Robust regression Nonlinear regression Generalized linear models Further extensions Posterior: µ

8 16

1.5 Posterior means, 90% intervals

1.0 Mean Lower Upper 8 hr 7.2 6.8 7.6 Density 16 hr 5.5 5.0 5.9 0.5 ∆ 1.7 1.1 2.3

0.0

5 6 7 8 µ Note that the actual difference in means in the two groups is much larger than the difference in means for the Poisson portion of the mixture

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 31/32 Robust regression Nonlinear regression Generalized linear models Further extensions Distribution

Observed Fit 8 hr 16 hr 0.5 0.4 0.20 0.3 0.2 0.10 Probability Probability 0.1 0.0 0.00 0 5 10 15 0 5 10 15

Roots Roots

Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 32/32