Negative Binomial Regression

STATISTICS COLUMN Negative Binomial Regression Shengping Yang PhD ,Gilbert Berdine MD In the hospital stay study discussed recently, it was mentioned that “in case overdispersion exists, Poisson regression model might not be appropriate.” I would like to know more about the appropriate modeling method in that case. Although Poisson regression modeling is wide- regression is commonly recommended. ly used in count data analysis, it does have a limita- tion, i.e., it assumes that the conditional distribution The Poisson distribution can be considered to of the outcome variable is Poisson, which requires be a special case of the negative binomial distribution. that the mean and variance be equal. The data are The negative binomial considers the results of a se- said to be overdispersed when the variance exceeds ries of trials that can be considered either a success the mean. Overdispersion is expected for contagious or failure. A parameter ψ is introduced to indicate the events where the first occurrence makes a second number of failures that stops the count. The negative occurrence more likely, though still random. Poisson binomial distribution is the discrete probability func- regression of overdispersed data leads to a deflated tion that there will be a given number of successes standard error and inflated test statistics. Overdisper- before ψ failures. The negative binomial distribution sion may also result when the success outcome is will converge to a Poisson distribution for large ψ. not rare. To handle such a situation, negative binomial 0 5 10 15 20 25 0 5 10 15 20 25 Figure 1. Comparison of Poisson and negative binomial distributions. Figure 1 shows that when ψ is small (e.g., ψ=5), Corresponding author: Shengping Yang a negative binomial distribution is more spread than Contact Information: [email protected]. a Poisson distribution with the same mean. However, DOI: 10.12746/swrccc2015.0310.135 when ψ is large (e.g., ψ =500), the two distributions mostly overlap. 50 The Southwest Respiratory and Critical Care Chronicles 2015;3(10) Shengping Yang et al. Negative Binomial Regression 1. The negative binomial regression model. where the conditional variance is greater than the conditional mean of the outcome variable (overdisper- Negative binomial distribution is defined as a sion). In addition, it is clear that when ψ is very large, discrete distribution of the number of successes in a µ 2 i → 0, Var ( y ; ψ, µ )~ ~ µ = E ( y ; ψ, µ ) , and the sequence of independent and identically distributed ψ i i i i i Bernoulli trials before a specified number of failures expected value or mean converges to the variance. are observed. More intuitively, it can be viewed as a Poisson distribution with parameter λ, where λ itself is 2. Application of negative binomial regression in not fixed but a random variable which follows a Gam- count data analysis. ma distribution. The Gamma distribution is a contin- uous probability function with a shape parameter ψ, Applying a negative binomial regression to a rate parameter δ, and the Gamma function Г. The model count data with overdispersion is straightfor- physical meaning of the Gamma distribution is an av- ward using available statistical software. For exam- erage or expected waiting time for a random event. ple, the SAS code (see below) for negative binomial regression modeling is very similar to that for Poisson Now, suppose that variable vi follows a Gamma distribution that regression. In the hospital length of stay study, the outcome variable is LOS (length of stay), and the risk factors of interest are corticosteroid (whether or not k(v ;ψ;δ) = δ ψ v ψ-1 e -vi δ, where v > 0, δ > 0, ψ > 0, i i i a patient was treated with corticosteroids within 60 Г (ψ) minutes), race, and gender. Comparing with Poisson regression, the only change is to replace the dist (dis- 2 and E(vi) = ψ / δ , Var(vi) = ψ / δ . By setting ψ = δ, tribution) option “poisson” with “nb”, which is the ab- and transforming the one parameter Gamma distribu- breviation for “negative binomial”. tion as a function of the Poisson mean λ , we get ψ i ψ ψ-1 -λi /µ proc genmod; g(λi ;ψ; µi ) = ( ψ / µi ) λi e i . Г (ψ) class corticosteriods race gender; model LOS = corticosteriods race gender To get the marginal distribution of the outcome <other risk factors>/dist=nb; run; variable yi , we have As previously described, the class statement is used to define that corticosteriods, race, and gender- are all categorical variables. And the dist=nb option is used to indicate that the conditional distribution of LOS is assumed to be negative binomial. ψ yi = yi + ψ - 1 ψ µi , Applying a negative binomial regression using R ( ψ - 1 ) µ + ψ µ + ψ software is also straightforward. Instead of using the ( i ) ( i ) glm function previously used for applying Poisson re- which is the probability mass function of a neg- gression, the glm.nb function, which is modified ver- ative binomial distribution (NB2). Note that sion of the glm function, can be readily used. Since the glm.nb function is specifically developed for nega- E(y µ ) = µ , and i ; ψ, i i tive binomial regression modeling, we do not need to 2 include the “family=” option. Var (yi ; ψ, µi ) = µi + µi , ψ The Southwest Respiratory and Critical Care Chronicles 2015;3(10) 51 Shengping Yang et al. Negative Binomial Regression glm.nb (LOS ~ corticosteroid + race + gender + of re-admission is 0 for these patients). If this is the <other risk factors>, data=data). cause of overdispersion, then zero-inflated regression is appropriate to model these data (Lambert, 1992). 3. Assumptions of Negative binomial regression. Considering that zero-inflated count data are Negative binomial regression shares many generated by a mixture of two statistical processes, common assumptions with Poisson regression, such i.e., the first one always generates zero counts and as linearity in model parameters, independence of the second one generates both zero and nonzero individual observations, and the multiplicative effects counts. Correspondingly, a logit model can be used to of independent variables. However, comparing with determine if the outcome of an individual observation Poisson regression, negative binomial regression al- is from the always-zero or not-always-zero groups, lows the conditional variance of the outcome variable and then a Poisson or negative binomial model can to be greater than its conditional mean, which offers be used to model the counts for the not-always-zero greater flexibility in model fitting. Note that negative group. We will not discuss the details of zero-inflated binomial regression does not handle the underdis- regression here; however, the implementation of ze- persion situation, where the conditional variance is ro-inflated regression is straightforward in SAS, and smaller than the conditional mean. Fortunately, un- the example code is derdispersion is rare in practice. proc genmod; 4. Interpretation of the result of a Poisson regres- class <corticosteriods risk factors>; sion. model count = <risk factors> /dist=zip; zeromodel = <risk factors>/link=logit; A statistical test is highly recommended after run; running a Poisson regression to determine whether overdispersion exists. One such test is to apply an The dist (distribution) option is defined as “zip” ordinary least square regression without the intercept (zero-inflated Poisson), and an addition statement, term (Cameron and Trivedi, 1996). The dependent zeromodel is used to model which group (either the 2 always-zero or the not-always-zero) an individual ob- variable is defined as (yi - yi ) - yi ,where yi = exp (Xi b) servation comes from. yi is the predicted value obtained using the Poisson re- In contrast, sometimes, the data-generating gression. By using the yi as the independent variable in the ordinary least square regression, a t test can be process does not allow 0s; for example, if we are in- performed, and a small p value (p<0.05) implies that terested in modeling the number of wounds for pa- overdispersion exists. tients at a trauma center, we might find that all the patients have at least one wound. Under this situation, 5. Zero-inflated and zero-truncated Poisson/nega- a zero-truncated negative binomial regression can be tive binomial regressions. used. There are many potential causes of overdispersion; one is the excessive zeros in the data. For example, suppose that the outcome of interest is the number of hospital re-admissions, and then it is quite obvious that there might be a large number of patients who do not have any re-admission (the number 52 The Southwest Respiratory and Critical Care Chronicles 2015;3(10) Shengping Yang et al. Negative Binomial Regression Author affiliation : Shengping Yang is a biostatistician in the Department of Pathology at TTHUSC. Gilbert Berdine is a pulmonary physician in the Department of Internal Medicine at TTUHSC. Submitted: 3/26/2015 Accepted: 4/7/2015 Published electronically: 4/15/2015 References 1. Cameron AC, Trivedi P K. Count data models for financial data. Handbook of Statistics 1996, 14, Statistical Methods in Finance, 363-392, Amsterdam 2. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992; 34(1): 1-14. 3. North-Holland C, Greenwood M, Yule GU. An enquiry into the nature of frequency distributions representative of multiple happenings with particular reference of multiple attacks of disease or of repeated accidents. J R Statist Soc 1920; 83: 25–279 The Southwest Respiratory and Critical Care Chronicles 2015;3(10) 53.

Negative Binomial Regression

Lecture 22: Bivariate Normal Distribution Distribution

An Introduction to Poisson Regression Russ Lavery, K&L Consulting Services, King of Prussia, PA, U.S.A

Overdispersion, and How to Deal with It in R and JAGS (Requires R-Packages AER, Coda, Lme4, R2jags, Dharma/Devtools) Carsten F

Poisson Versus Negative Binomial Regression

Section 1: Regression Review

Generalized Linear Models

(Introduction to Probability at an Advanced Level) - All Lecture Notes

Modeling Overdispersion with the Normalized Tempered Stable Distribution

Variance Partitioning in Multilevel Logistic Models That Exhibit Overdispersion

Meta-Analysis of Count Data What Is Count Data Complicated by What Is

Generalized Linear Models and Point Count Data: Statistical Considerations for the Design and Analysis of Monitoring Studies

Large Scale Conditional Covariance Matrix Modeling, Estimation and Testing