<<

STAT 405 – (Fall 2016) Handout 17 – Poisson Regression ______

Example 1: Mating Success of African Elephants

In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful matings during the 8 years was recorded. The objective was to learn whether older animals are more successful at mating or whether they have diminished success after reaching a certain age. The set Elephants.csv contains the following variables:

• Y = Number of matings in the 8 year follow-up period • X = Age (yrs.) of elephant at the start of the study

> plot(Matings~Age)

Consider fitting a model using ordinary (OLS) regression:

> ele.lm = lm(Matings~Age, data=Elephants) > summary(ele.lm)

1

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Call: lm(formula = Matings ~ Age, data = Elephants)

Residuals: Min 1Q 3Q Max -4.1158 -1.3087 -0.1082 0.8892 4.8842

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.50589 1.61899 -2.783 0.00826 ** Age 0.20050 0.04443 4.513 5.75e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual : 1.849 on 39 degrees of freedom Multiple R-squared: 0.343, Adjusted R-squared: 0.3262 F-: 20.36 on 1 and 39 DF, p-value: 5.749e-05

> abline(ele.lm)

> resplot(ele.lm)

Do these plots suggest any violations with OLS regression assumptions?

2

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

One approach to attempting to correct the problem is to transform the response, using a stabilizing transformation.

> elesq.lm = lm(sqrt(Matings)~Age,data=Elephants) > summary(elesq.lm)

Call: lm(formula = sqrt(Matings) ~ Age, data = Elephants)

Residuals: Min 1Q Median 3Q Max -1.90532 -0.33654 0.07767 0.45871 1.09468

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.81220 0.56867 -1.428 0.161187 Age 0.06320 0.01561 4.049 0.000236 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6493 on 39 degrees of freedom Multiple R-squared: 0.296, Adjusted R-squared: 0.2779 F-statistic: 16.4 on 1 and 39 DF, p-value: 0.0002362

> resplot(elesq.lm)

While this may seem like a satisfactory model, the interpretation of the model coefficients is difficult since the response is now in the square root scale.

In this case, Poisson regression may be a better option.

3

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Introduction to Poisson Regression

Recall the is given by

( = ) = = 0,1,2, … > 0. −𝜇𝜇! 𝑦𝑦 𝑒𝑒 𝜇𝜇 𝑃𝑃 𝑌𝑌 𝑦𝑦 𝑦𝑦 𝑎𝑎𝑎𝑎𝑎𝑎 𝜇𝜇 The response Y𝑦𝑦 is a discrete that represents the number of occurrences per time or space unit. In Poisson regression, we seek a model for the of the response ( ) as a function of terms based upon a set of predictors , , … , . For a Poisson random variable, the 𝜇𝜇 mean and variance are both µ, so traditional OLS regression will not be adequate because the 𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑝𝑝 constant error variance assumption would be violated.

The model that we have been studying is one type of a broader class of models called Generalized Linear Models. Generalized linear models are an extension of regular linear models that allow: (1) the mean of a population to depend on a linear function of terms through a nonlinear link function and (2) the response to be any member of a special class of distributions referred to as the . The exponential family contains the normal distribution (used in OLS), the binomial distribution (used in logistic regression), and the Poisson distribution.

The link function is a function that relates the mean of the response = ( ) linearly to a set of terms based on the explanatory variables (i.e., the predictors). 𝜇𝜇𝑖𝑖 𝐸𝐸 𝑌𝑌𝑖𝑖

OLS Regression

For a normally distributed response, the link function is the identity function, ( ) = ; thus,

𝑔𝑔 𝜇𝜇 𝜇𝜇 g( ) = + u + + u .

µ η0 η1 1 ⋯ ηk−1 k−1 We typically write the model for the mean as follows:

( | ) = + + + .

𝐸𝐸 𝑌𝑌 𝑿𝑿 η0 𝜂𝜂1𝑢𝑢1 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1

Logistic Regresion

For a binomial response we know that

( ) = = + + + 1 𝜇𝜇 𝑔𝑔 𝜇𝜇 𝑙𝑙𝑙𝑙 � � 𝜂𝜂0 𝜂𝜂1𝑢𝑢1 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1 − 𝜇𝜇 4

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

We expressed this as:

( ) = + + + . ( ) 𝜃𝜃 𝑥𝑥� 𝑜𝑜 1 1 𝑘𝑘−1 𝑘𝑘−1 𝑙𝑙𝑙𝑙 �1−𝜃𝜃 𝑥𝑥� � 𝜂𝜂 𝜂𝜂 𝑢𝑢 ⋯ 𝜂𝜂 𝑢𝑢

Poisson Regression

For a Poisson distributed response variable, the link function is ( ) = ln ( ); so,

𝑔𝑔 𝜇𝜇 𝜇𝜇 ln( ) = + + + .

𝜇𝜇 𝜂𝜂𝑜𝑜 𝜂𝜂1𝑢𝑢1 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1 Thus, = exp( + + + ).

𝜇𝜇 𝜂𝜂0 𝜂𝜂1𝑢𝑢1 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1

Fitting the Poisson Regression Model in R

As the number of matings per 8 year period is likely to be well-modeled using a Poisson distribution, we will now consider Poisson regression model for the Elephants data.

> ele.glm = glm(Matings~Age,family="poisson") > summary(ele.glm)

Call: glm(formula = Matings ~ Age, family = "poisson")

Deviance Residuals: Min 1Q Median 3Q Max -2.80798 -0.86137 -0.08629 0.60087 2.17777

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.58201 0.54462 -2.905 0.00368 ** Age 0.06869 0.01375 4.997 5.81e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion for poisson family taken to be 1)

Null : 75.372 on 40 degrees of freedom Residual deviance: 51.012 on 39 degrees of freedom AIC: 156.46

Number of Fisher Scoring iterations: 5

5

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

> par(mfrow=c(2,2)) > plot(ele.glm)

> par(mfrow=c(1,1)) > plot(Age,Matings,xlab="Age of Elephant",ylab="Num. of Matings") > lines(Age,fitted(ele.glm)) > title(main="Plot of Matings vs. Age of Elephant w/ Poisson Fit")

6

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Interpretation of Coefficients in the Poisson Regression Model

The coefficients in the Poisson regression model can be interpreted as follows. Assume that we change one of the explanatory terms (for example, the first one) by one unit from u to u+1 while holding all other terms fixed. The percent increase (or decrease) in the mean response can then be calculated as follows:

exp( + ( + 1) + + ) exp ( + + + ) 100 exp ( + + + ) 𝜂𝜂𝑜𝑜 𝜂𝜂1 𝑢𝑢 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1 − 𝜂𝜂𝑜𝑜 𝜂𝜂1𝑢𝑢 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1

𝑜𝑜 1 𝑘𝑘−1 𝑘𝑘−1 = 100[exp( ) 1]% .𝜂𝜂 𝜂𝜂 𝑢𝑢 ⋯ 𝜂𝜂 𝑢𝑢

𝜂𝜂1 −

Alternatively we can simply take the ratio

exp( + ( + 1) + + ) = exp ( + + + ) 𝜂𝜂𝑜𝑜 𝜂𝜂1 𝑢𝑢 ⋯ 𝜂𝜂𝑘𝑘−1𝑢𝑢𝑘𝑘−1 𝜂𝜂1

𝑜𝑜 1 𝑘𝑘−1 𝑘𝑘−1 𝑒𝑒 𝜂𝜂 𝜂𝜂 𝑢𝑢 ⋯ 𝜂𝜂 𝑢𝑢 which says the mean of the response gets a multiplicative increase of units per unit increase in the term . 𝜂𝜂1 𝑒𝑒

𝑢𝑢1 Interpretation of the estimated coefficient for age:

The estimated coefficient for Age is = .0687. Thus, we have a 100[ . 1] = 7.11% increase in the number of matings in the 8 year period per one year of age0687 at the start of the 𝜂𝜂̂1 𝑒𝑒 − study. Expressed as a multiplicative increase, this would be 1.0711.

For a 5 year difference in initial age, we would expect a 100[ × . 1] = 40.99% increase in the number of matings in the following 8 year period. Expressed5 0 0687 as a multiplicative increase, 𝑒𝑒 − this would be 1.4099.

7

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Wald Intervals and Tests for

95% CI for : ± 1.96 ( )

𝜼𝜼𝒊𝒊 𝜂𝜂�𝚤𝚤 ∙ 𝑆𝑆𝑆𝑆 𝜂𝜂�𝚤𝚤 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.58201 0.54462 -2.905 0.00368 ** Age 0.06869 0.01375 4.997 5.81e-07 ***

> confint.default(ele.glm) 2.5 % 97.5 % (Intercept) -2.64944613 -0.51456980 Age 0.04175158 0.09563404

A for the multiplicative increase in the response is then given as follows.

95% CI for : exp ( ± 1.96 ( )) 𝜼𝜼𝟏𝟏 i 𝑖𝑖 𝒆𝒆 η� ∙ 𝑆𝑆𝑆𝑆 𝜂𝜂̂ Questions:

1. Find a 95% CI for the effect of a 1-year Age difference.

2. Find a 95% CI for the effect of a 5-year Age difference.

Using a large test for significance of “slope” parameter (ηi ) :

H :η = 0 o i H a :ηi ≠ 0

ηˆ z = i ≈ N(0,1) ηˆ SE( i )

~ 2 2 𝑧𝑧 𝜒𝜒 8

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Example 2: Reproduction of Ceriodaphnia Organisms

In this study, the number of Ceriodaphnia organisms are counted in a controlled environment in which reproduction occurs among the organisms. Two different strains of organisms are involved, and the environment is changed by adding varying amounts of a chemical component intended to impair reproduction. Initial population sizes are the same. The data can be found in the file Ceriodaphnia.csv.

> head(Ceriodaph) Cerio Conc Strain 1 82 0.0 0 2 106 0.0 0 3 63 0.0 0 4 99 0.0 0 5 101 0.0 0 6 45 0.5 0 … … … …

> cerio.glm = glm(Cerio~Conc+Strain,family="poisson") > summary(cerio.glm)

Call: glm(formula = Cerio ~ Conc + Strain, family = "poisson")

Deviance Residuals: Min 1Q Median 3Q Max -2.6800 -0.6766 0.1528 0.6787 2.0774

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.72961 0.07676 61.617 < 2e-16 *** Conc -1.54308 0.04660 -33.111 < 2e-16 *** Strain -0.27497 0.04837 -5.684 1.31e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1359.381 on 69 degrees of freedom Residual deviance: 86.376 on 67 degrees of freedom AIC: 415.95

Number of Fisher Scoring iterations: 4

Interpret the coefficients:

9

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Poisson Regression in JMP

To fit a Poisson regression model for the number of ceriodaphnia as a function of the concentration and stain we again use Analyze > Fit Model to set up the model as shown below:

The results of the model fit are shown below:

10

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Example 3: Caesarean Sections in Private vs. Public Hospitals

Births by caesarean sections are said to be more frequent in private (fee paying) hospitals (coded as Type=0) as compared to non-fee paying public hospitals (coded as Type=1). Data on total annual births and the number of caesarean sections carried out were obtained from the records of 4 private hospitals and 16 public hospitals. These are tabulated in the file Caesarean_data.csv.

As the number of caesareans performed at a hospital is clearly a count of the number of occurrences, a Poisson regression for these data is appropriate. Also, our focus is on what role the type of hospital plays the number of caesareans; however, the number of caesarean births is clearly going to be dependent on the number of births performed at the hospital, overall. We will therefore fit a Poisson regression model for the number of caesarean births using both the number of total births and hospital type as predictors in the model.

> cb.glm = glm(Caesareans~Type+Births,family="poisson") > summary(cb.glm)

Call: glm(formula = Caesareans ~ Type + Births, family = "poisson")

Deviance Residuals: Min 1Q Median 3Q Max -2.3270 -0.6121 -0.0899 0.5398 1.6626

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.351e+00 2.501e-01 5.402 6.58e-08 *** Type 1.045e+00 2.729e-01 3.830 0.000128 *** Births 3.261e-04 6.032e-05 5.406 6.45e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 99.990 on 19 degrees of freedom Residual deviance: 18.039 on 17 degrees of freedom (50 observations deleted due to missingness) AIC: 110.8

Number of Fisher Scoring iterations: 4

Questions:

1. Are type of hospital and number of births both significant predictors of the number of caesarean births? Explain.

11

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

2. Since Births is a continuous predictor, we need to pick an incremental value (c) to use when interpreting the parameter estimate. Suppose we use c = 1000 births. Use the appropriate parameter estimate to describe the predicted relationship between number of births and number of caesarean births (after adjusting for hospital type).

3. Use the appropriate parameter estimate to describe the predicted relationship between hospital type and the number of caesarean births (after adjusting for the total number of births).

4. Find and interpret confidence intervals to address the above questions, as well.

12

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression ______

Appendix: Code for some useful R functions for OLS Regression

Studresid = function (lm1, lms = summary(lm1), lmi = lm.influence(lm1)) { y <- resid(lm1) y2 <- y^2 sy2 <- sum(y2) npred <- lm1$rank l <- length(resid(lm1)) rse <- sy2/(l - npred) rses <- sqrt(rse) h <- lmi$hat (resid(lm1))/(rses * (1 - h)^0.5) }

resplot = function (lm1, lms = summary(lm1)) { par(mfrow = c(2, 2), pty = "m") y <- resid(lm1) qqnorm(Studresid(lm1), main = "Normal Probability Plot", ylab = "Residuals") abline(0, sqrt(var(Studresid(lm1)))) plot(fitted(lm1), Studresid(lm1), xlab = "Fitted Values", ylab = "Studentized Residuals", main = "Plot of Studentized Residuals vs. Fitted", cex = 0.65) x <- fitted(lm1) y <- Studresid(lm1) f <- 0.5 xs <- sort(x, index = T) x <- xs$x ix <- xs$ix y <- y[ix] trend <- lowess(x, y, f) e2 <- (y - trend$y)^2 scatter <- lowess(x, e2, f) uplim <- trend$y + sqrt(abs(scatter$y)) lowlim <- trend$y - sqrt(abs(scatter$y)) lines(trend$x, trend$y, col = "Blue") lines(scatter$x, uplim, col = "Red") lines(scatter$x, lowlim, col = "Red") abline(h = 0, lty = 2, col = 2) plot(fitted(lm1), sqrt(abs(Studresid(lm1))), main = "Loess Fit of Residuals", ylab = "Absolute Stud. Residuals", xlab = "Fitted Values", cex = 0.7) lines(lowess(fitted(lm1), sqrt(abs(Studresid(lm1)))), lty = 1, col = 3) abline(h = mean(sqrt(abs(Studresid(lm1)))), col = "blue", lty = 3) par(mfrow = c(1, 2)) par(ask = T) yl <- c(min(resid(lm1), fitted(lm1) - mean(fitted(lm1))), max(resid(lm1), fitted(lm1) - mean(fitted(lm1)))) fit <- fitted(lm1) p <- sort(fit - mean(fit)) pp <- ppoints(p) res <- resid(lm1) pr <- sort(res) ppr <- ppoints(pr) plot(pp, p, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Fitted values", cex = 0.7) plot(ppr, pr, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Residuals", cex = 0.7) par(mfrow = c(1, 1)) par(ask = F) invisible() }

13