Poisson Regression ______

STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful matings during the 8 years was recorded. The objective was to learn whether older animals are more successful at mating or whether they have diminished success after reaching a certain age. The data set Elephants.csv contains the following variables: • Y = Number of matings in the 8 year follow-up period • X = Age (yrs.) of elephant at the start of the study > plot(Matings~Age) Consider fitting a model using ordinary least squares (OLS) regression: > ele.lm = lm(Matings~Age, data=Elephants) > summary(ele.lm) 1 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ Call: lm(formula = Matings ~ Age, data = Elephants) Residuals: Min 1Q Median 3Q Max -4.1158 -1.3087 -0.1082 0.8892 4.8842 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.50589 1.61899 -2.783 0.00826 ** Age 0.20050 0.04443 4.513 5.75e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.849 on 39 degrees of freedom Multiple R-squared: 0.343, Adjusted R-squared: 0.3262 F-statistic: 20.36 on 1 and 39 DF, p-value: 5.749e-05 > abline(ele.lm) > resplot(ele.lm) Do these plots suggest any violations with OLS regression assumptions? 2 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ One approach to attempting to correct the problem is to transform the response, using a variance stabilizing transformation. > elesq.lm = lm(sqrt(Matings)~Age,data=Elephants) > summary(elesq.lm) Call: lm(formula = sqrt(Matings) ~ Age, data = Elephants) Residuals: Min 1Q Median 3Q Max -1.90532 -0.33654 0.07767 0.45871 1.09468 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.81220 0.56867 -1.428 0.161187 Age 0.06320 0.01561 4.049 0.000236 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6493 on 39 degrees of freedom Multiple R-squared: 0.296, Adjusted R-squared: 0.2779 F-statistic: 16.4 on 1 and 39 DF, p-value: 0.0002362 > resplot(elesq.lm) While this may seem like a satisfactory model, the interpretation of the model coefficients is difficult since the response is now in the square root scale. In this case, Poisson regression may be a better option. 3 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ Introduction to Poisson Regression Recall the Poisson distribution is given by ( = ) = = 0,1,2, … > 0. −! The response Y is a discrete random variable that represents the number of occurrences per time or space unit. In Poisson regression, we seek a model for the mean of the response ( ) as a function of terms based upon a set of predictors , , … , . For a Poisson random variable, the mean and variance are both µ, so traditional OLS regression will not be adequate because the 1 2 constant error variance assumption would be violated. The logistic regression model that we have been studying is one type of a broader class of models called Generalized Linear Models. Generalized linear models are an extension of regular linear models that allow: (1) the mean of a population to depend on a linear function of terms through a nonlinear link function and (2) the response probability distribution to be any member of a special class of distributions referred to as the exponential family. The exponential family contains the normal distribution (used in OLS), the binomial distribution (used in logistic regression), and the Poisson distribution. The link function is a function that relates the mean of the response = ( ) linearly to a set of terms based on the explanatory variables (i.e., the predictors). OLS Regression For a normally distributed response, the link function is the identity function, ( ) = ; thus, g( ) = + u + + u . µ η0 η1 1 ⋯ ηk−1 k−1 We typically write the model for the mean as follows: ( | ) = + + + . η0 11 ⋯ −1−1 Logistic Regresion For a binomial response we know that ( ) = = + + + 1 � � 0 11 ⋯ −1−1 − 4 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ We expressed this as: ( ) = + + + . ( ) � 1 1 −1 −1 �1− � � ⋯ Poisson Regression For a Poisson distributed response variable, the link function is ( ) = ln ( ); so, ln( ) = + + + . 11 ⋯ −1−1 Thus, = exp( + + + ). 0 11 ⋯ −1−1 Fitting the Poisson Regression Model in R As the number of matings per 8 year period is likely to be well-modeled using a Poisson distribution, we will now consider Poisson regression model for the Elephants data. > ele.glm = glm(Matings~Age,family="poisson") > summary(ele.glm) Call: glm(formula = Matings ~ Age, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max -2.80798 -0.86137 -0.08629 0.60087 2.17777 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.58201 0.54462 -2.905 0.00368 ** Age 0.06869 0.01375 4.997 5.81e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 75.372 on 40 degrees of freedom Residual deviance: 51.012 on 39 degrees of freedom AIC: 156.46 Number of Fisher Scoring iterations: 5 5 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ > par(mfrow=c(2,2)) > plot(ele.glm) > par(mfrow=c(1,1)) > plot(Age,Matings,xlab="Age of Elephant",ylab="Num. of Matings") > lines(Age,fitted(ele.glm)) > title(main="Plot of Matings vs. Age of Elephant w/ Poisson Fit") 6 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ Interpretation of Coefficients in the Poisson Regression Model The coefficients in the Poisson regression model can be interpreted as follows. Assume that we change one of the explanatory terms (for example, the first one) by one unit from u to u+1 while holding all other terms fixed. The percent increase (or decrease) in the mean response can then be calculated as follows: exp( + ( + 1) + + ) exp ( + + + ) 100 exp ( + + + ) 1 ⋯ −1−1 − 1 ⋯ −1−1 1 −1 −1 = 100[exp( ) 1]% . ⋯ 1 − Alternatively we can simply take the ratio exp( + ( + 1) + + ) = exp ( + + + ) 1 ⋯ −1−1 1 1 −1 −1 ⋯ which says the mean of the response gets a multiplicative increase of units per unit increase in the term . 1 1 Interpretation of the estimated coefficient for age: The estimated coefficient for Age is = .0687. Thus, we have a 100[ . 1] = 7.11% increase in the number of matings in the 8 year period per one year of age0687 at the start of the ̂1 − study. Expressed as a multiplicative increase, this would be 1.0711. For a 5 year difference in initial age, we would expect a 100[ × . 1] = 40.99% increase in the number of matings in the following 8 year period. Expressed5 0 0687 as a multiplicative increase, − this would be 1.4099. 7 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ Wald Intervals and Tests for Parameters 95% CI for : ± 1.96 ( ) � ∙ � Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.58201 0.54462 -2.905 0.00368 ** Age 0.06869 0.01375 4.997 5.81e-07 *** > confint.default(ele.glm) 2.5 % 97.5 % (Intercept) -2.64944613 -0.51456980 Age 0.04175158 0.09563404 A confidence interval for the multiplicative increase in the response is then given as follows. 95% CI for : exp ( ± 1.96 ( )) i η� ∙ ̂ Questions: 1. Find a 95% CI for the effect of a 1-year Age difference. 2. Find a 95% CI for the effect of a 5-year Age difference. Using a large sample test for significance of “slope” parameter (ηi ) : H o :ηi = 0 H a :ηi ≠ 0 ηˆ z = i ≈ N(0,1) ηˆ SE( i ) ~ 2 2 8 STAT 405 – BIOSTATISTICS (Fall 2016) Handout 17 – Poisson Regression _____________________________________________________________________________ Example 2: Reproduction of Ceriodaphnia Organisms In this study, the number of Ceriodaphnia organisms are counted in a controlled environment in which reproduction occurs among the organisms. Two different strains of organisms are involved, and the environment is changed by adding varying amounts of a chemical component intended to impair reproduction. Initial population sizes are the same. The data can be found in the file Ceriodaphnia.csv. > head(Ceriodaph) Cerio Conc Strain 1 82 0.0 0 2 106 0.0 0 3 63 0.0 0 4 99 0.0 0 5 101 0.0 0 6 45 0.5 0 … … … … > cerio.glm = glm(Cerio~Conc+Strain,family="poisson") > summary(cerio.glm) Call: glm(formula = Cerio ~ Conc + Strain, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max -2.6800 -0.6766 0.1528 0.6787 2.0774 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.72961 0.07676 61.617 < 2e-16 *** Conc -1.54308 0.04660 -33.111 < 2e-16 *** Strain -0.27497 0.04837 -5.684 1.31e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1359.381 on 69 degrees of

Load more