Annotated Output for Binomial Regression

Total Page:16

File Type:pdf, Size:1020Kb

Annotated Output for Binomial Regression SUPPLEMENTAL MATERIAL Appendix 1: Annotated output for logistic regression Appendix 2: Annotated output for binomial regression Appendix 3: Annotated output for proportional odds logistic regression Appendix 4: Annotated output for multinomial regression Appendix 5: Annotated output for Poisson regression Appendix 6: Special cases of Poisson Regression Appendix 1: Annotated output for logistic regression Understanding how to interpret the output R presents can be challenging. The following figures are designed to help demystify the R output for Logistic regression. All data and code can be found here: https://github.com/ejtheobald/BeyondLinearRegression. In addition, a table, akin to what a researcher might publish, is displayed based on the R output. Figure A1.1: Logistic regression annotated output Presenting logistic regression results in manuscripts Table A1.1: Regression coefficients (β), standard error (SE), p-value, and odds-ratio for the logistic regression predicting students reporting being likely to take a mathematical modeling in biology course. The odds of a student reporting being likely to take a mathematical modeling in biology course increased with increasing interest in using math to understand biology and decreased with increasing cost of incorporating math into biology courses. Additionally, the odds of a fourth-year student reporting being likely to take a mathematical modeling course are lower than those of a first-year student. Predictor β SE p-value Odds-ratio Interest 0.71 0.07 < 0.0001 2.04 Utility value 0.11 0.08 0.18 1.12 Cost -0.27 0.06 <0.0001 0.77 Gender (ref: Male) Female -0.22 0.19 0.26 0.81 Year in School (ref: 1st-year) 2nd year -0.16 0.23 0.49 0.85 3rd year -0.19 0.22 0.39 0.82 4th-year -0.52 0.24 0.03 0.60 Highest High School Math Course (ref: Calculus) Pre-calculus -0.03 0.21 0.87 0.97 Algebra/Geometry -0.06 0.38 0.87 0.94 Stats 0.11 0.34 0.75 1.12 Appendix 2: Annotated output for binomial regression Understanding how to interpret the output R presents can be challenging. The following figures are designed to help demystify the R output for binomial regression. All data and code can be found here: https://github.com/ejtheobald/BeyondLinearRegression. In addition, a table, akin to what a researcher might publish, is displayed based on the R output. Figure A2.1: Binomial regression annotated output Presenting binomial regression results in manuscripts Table A2.1: Regression coefficients (β), standard error (SE), p-value, and odds-ratio for the binomial regression predicting the odds of a student completing an optional practice exam. The odds of a student completing a practice exam decreased with increasing GPA. Additionally, the odds of a female student completing a practice exam are higher than those of a male student. Predictor β SE p-value Odds-ratio GPA -1.40 0.09 < 0.0001 0.25 Gender (ref: Male) Female 0.17 0.06 < 0.0001 1.19 First-generation (ref: Continuing-generation) First-generation -0.04 0.09 0.64 0.96 Appendix 3: Annotated output for proportional odds logistic regression Understanding how to interpret the output R presents can be challenging. The following figures are designed to help demystify the R output for proportional odds logistic regression. All data and code can be found here: https://github.com/ejtheobald/BeyondLinearRegression. In addition, a table, akin to what a researcher might publish, is displayed based on the R output. Figure A3.1: Proportional Odds Logistic Regression annotated output Presenting proportional odds results, using model selection, in manuscripts When model selection is conducted to test hypotheses, it is important to indicate the starting model, the best model, and the comparison of the best model to the null model (to show how much better the best model fits; done here with ΔAIC). Here is one way to do that for the hypothesis we tested. Table A3.1: Students were less likely to report a dominator after the interactive activity, compared to the constructive activity. Additionally, students with higher course grades were less likely to report a dominator than students with lower course grades. The table show odds ratios. Outcome Ethnicity1 Activity Type2 Course Grade ΔAIC3 Dominator4 Asian American5 1.67 0.56 0.76 20.207 International 3.35 URM 1.39 1 Reference group is White students 2 Reference group is Constructive activity; effect shown of Interactive activity. 3 In comparison to the null model 4 Outcome was measured on a 5-point Likert scale. Higher numbers indicate more agreement that one person dominated the group. 5 Bold face coefficients show statistically significant relationships but note that interpreting t- values for significance testing is unreliable with a small sample. 6 Units are shown as odds ratios. Appendix 4: Annotated output for multinomial regression Understanding how to interpret the output R presents can be challenging. The following figures are designed to help demystify the R output for multinomial regression. All data and code can be found here: https://github.com/ejtheobald/BeyondLinearRegression. In addition, a table, akin to what a researcher might publish, is displayed based on the R output. Figure A4.1: Multinomial regression annotated R output Presenting multinomial regression results in manuscripts Table A4.1: Model selection using backwards model selection and the likelihood ratio test. In each comparison the bolded term is the one being tested. Degrees Likelihood Models of Ratio p-value Freedom Statistic Gender + Class Standing + University GPA + Gender x Class Standing + Gender x University GPA vs. 8 11.1 0.195 Gender + Class Standing + University GPA + Gender x University GPA Gender + Class Standing + University GPA + Gender x University GPA 4 9.6 0.048 vs. Gender + Class Standing + College GPA Gender + Class Standing + University GPA + Gender x University GPA 8 5.2 0.736 vs. Gender + University GPA + Gender x University GPA Gender + University GPA + (Gender x University GPA) vs. 20 50.0 < 0.0001 Null (no predictors) Best model: Gender + University GPA + (Gender x University GPA) Table A4.2: Regression coefficients from multinomial regression exploring the impact of gender, class standing, and university GPA on the role students assume in groups. Collaborator is the reference level for the outcome variables, so all the other roles are compared to it and the estimates are the log odds of being in that role versus being a collaborator. Table shows estimates, standard error, and p-value from the Wald statistic (in parentheses). Gender x Outcome Gender: Male University GPA Intercept University GPA (ref: Female) at start of class (ref: Female) -1.17 ± 0.272 1.4 ± 0.366 -1.9 ± 0.723 2.7 ± 0.903 Leader vs. Collaborator (< 0.001) (< 0.001) (0.009) (0.002) Listener vs. Collaborator -1.54 ± 0.308 0.12 ± 0.522 -1.2 ± 0.839 0.76 ± 1.13 (< 0.001) (0.812) (0.156) (0.502) Recorder vs. Collaborator -1.75 ± 0.354 -1.0 ± 0.848 0.53 ± 0.955 0.63 ± 2.06 (< 0.001) (0.227) (0.578) (0.760) Other vs. Collaborator -1.29 ± 0.286 -0.12 ± 0.514 -2.0 ± 0.753 1.2 ± 1.03 (< 0.001) (0.820) (0.007) (0.255) Summarizing the output from the effects package verbally example. Figure 2D visually summarizes the output from the effects package, but sometimes researchers may want to explain a particular variable in more detail. Here we present how one might write up the effects package output for the gender x GPA interaction found in our example. Women with a college GPA that is 0.25 points below the mean have a 42% chance of reporting being a collaborator, at the mean they have a 50% chance and 0.5 points above the mean they have a 64% chance (table below). Males with the same range of GPAs do not see the upward shift in percent chance of being a collaborator: at -0.25 points below mean: 37%; at mean: 36%; 0.5 points above mean: 29%). Instead, as GPA increases males become increasingly likely to report preferring to be leaders: -0.25 points below mean: 38%; at mean GPA: 45%; 0.5 points above mean: 57%. The table below shows these probabilities as well as the 95% confidence intervals (shown in parentheses). No other outcome categories see shifts based on gender (as indicated by the 95% confidence interval on all their estimates overlapping). Table A4.3: The probability of preferring to be a collaborator varied with student reported gender and with GPA. Mean GPA Mean GPA Mean GPA - 0.25 pts + 0.5 pts Collaborator 42% 50% 64% Women (31 – 53%) (42 – 60%) (50 – 76%) Men 37% 36% 29% (27 – 50%) (26 – 46%) (18 – 45%) Leader 38% 45% 57% Men (28 – 51%) (34 – 56%) (41 – 71%) Appendix 5: Annotated output for Poisson regression Understanding how to interpret the output R presents can be challenging. The following figures are designed to help demystify the R output for Poisson regression. All data and code can be found here: https://github.com/ejtheobald/BeyondLinearRegression. In addition, a table, akin to what a researcher might publish, is displayed based on the R output. Figure A5.1: Poisson Regression annotated output Presenting Poisson regression results in manuscripts Table A5.1: Regression coefficients (β), standard error (SE), p-value, and the effect of the predictor on the outcome variable (eβ) for the Poisson regression predicting the number of times students raise their hands in class. The number of times a student raises their hand in class increases if they have higher total exam points and if they are a physics major.
Recommended publications
  • Statistical Approaches for Highly Skewed Data 1
    Running head: STATISTICAL APPROACHES FOR HIGHLY SKEWED DATA 1 (in press) Journal of Clinical Child and Adolescent Psychology Statistical Approaches for Highly Skewed Data: Evaluating Relations between Child Maltreatment and Young Adults’ Non-Suicidal Self-injury Ana Gonzalez-Blanks, PhD Jessie M. Bridgewater, MA *Tuppett M. Yates, PhD Department of Psychology University of California, Riverside Acknowledgments: We gratefully acknowledge Dr. Chandra Reynolds who provided generous instruction and consultation throughout the execution of the analyses reported herein. *Please address correspondence to: Tuppett M. Yates, Department of Psychology, University of California, Riverside CA, 92521. Email: [email protected]. STATISTICAL APPROACHES FOR HIGHLY SKEWED DATA 2 Abstract Objective: Clinical phenomena often feature skewed distributions with an overabundance of zeros. Unfortunately, empirical methods for dealing with this violation of distributional assumptions underlying regression are typically discussed in statistical journals with limited translation to applied researchers. Therefore, this investigation compared statistical approaches for addressing highly skewed data as applied to the evaluation of relations between child maltreatment and non-suicidal self-injury (NSSI). Method: College students (N = 2,651; 64.2% female; 85.2% non-White) completed the Child Abuse and Trauma Scale and the Functional Assessment of Self-Mutilation. Statistical models were applied to cross-sectional data to provide illustrative comparisons across predictions to a) raw, highly skewed NSSI outcomes, b) natural log, square-root, and inverse NSSI transformations to reduce skew, c) zero-inflated Poisson (ZIP) and negative-binomial zero- inflated (NBZI) regression models to account for both disproportionate zeros and skewness in the NSSI data, and d) the skew-t distribution to model NSSI skewness.
    [Show full text]
  • Negative Binomial Regression Models and Estimation Methods
    Appendix D: Negative Binomial Regression Models and Estimation Methods By Dominique Lord Texas A&M University Byung-Jung Park Korea Transport Institute This appendix presents the characteristics of Negative Binomial regression models and discusses their estimating methods. Probability Density and Likelihood Functions The properties of the negative binomial models with and without spatial intersection are described in the next two sections. Poisson-Gamma Model The Poisson-Gamma model has properties that are very similar to the Poisson model discussed in Appendix C, in which the dependent variable yi is modeled as a Poisson variable with a mean i where the model error is assumed to follow a Gamma distribution. As it names implies, the Poisson-Gamma is a mixture of two distributions and was first derived by Greenwood and Yule (1920). This mixture distribution was developed to account for over-dispersion that is commonly observed in discrete or count data (Lord et al., 2005). It became very popular because the conjugate distribution (same family of functions) has a closed form and leads to the negative binomial distribution. As discussed by Cook (2009), “the name of this distribution comes from applying the binomial theorem with a negative exponent.” There are two major parameterizations that have been proposed and they are known as the NB1 and NB2, the latter one being the most commonly known and utilized. NB2 is therefore described first. Other parameterizations exist, but are not discussed here (see Maher and Summersgill, 1996; Hilbe, 2007). NB2 Model Suppose that we have a series of random counts that follows the Poisson distribution: i e i gyii; (D-1) yi ! where yi is the observed number of counts for i 1, 2, n ; and i is the mean of the Poisson distribution.
    [Show full text]
  • Measures of Fit for Logistic Regression Paul D
    Paper 1485-2014 SAS Global Forum Measures of Fit for Logistic Regression Paul D. Allison, Statistical Horizons LLC and the University of Pennsylvania ABSTRACT One of the most common questions about logistic regression is “How do I know if my model fits the data?” There are many approaches to answering this question, but they generally fall into two categories: measures of predictive power (like R-square) and goodness of fit tests (like the Pearson chi-square). This presentation looks first at R-square measures, arguing that the optional R-squares reported by PROC LOGISTIC might not be optimal. Measures proposed by McFadden and Tjur appear to be more attractive. As for goodness of fit, the popular Hosmer and Lemeshow test is shown to have some serious problems. Several alternatives are considered. INTRODUCTION One of the most frequent questions I get about logistic regression is “How can I tell if my model fits the data?” Often the questioner is expressing a genuine interest in knowing whether a model is a good model or a not-so-good model. But a more common motivation is to convince someone else--a boss, an editor, or a regulator--that the model is OK. There are two very different approaches to answering this question. One is to get a statistic that measures how well you can predict the dependent variable based on the independent variables. I’ll refer to these kinds of statistics as measures of predictive power. Typically, they vary between 0 and 1, with 0 meaning no predictive power whatsoever and 1 meaning perfect predictions.
    [Show full text]
  • Multiple Binomial Regression Models of Learning Style Preferences of Students of Sidhu School, Wilkes University
    International Journal of Statistics and Probability; Vol. 7, No. 3; May 2018 ISSN 1927-7032 E-ISSN 1927-7040 Published by Canadian Center of Science and Education Multiple Binomial Regression Models of Learning Style Preferences of Students of Sidhu School, Wilkes University Adekanmbi, D. B1 1 Department of Statistics, Ladoke Akintola University of Technology, Nigeria Correspondence: Adekanmbi, D.B, Department of Statistics, Ladoke Akintola University of Technology, Ogbomoso, Nigeria. Received: November 6, 2017 Accepted: November 27, 2017 Online Published: March 13, 2018 doi:10.5539/ijsp.v7n3p9 URL: https://doi.org/10.5539/ijsp.v7n3p9 Abstract The interest of this study is to explore the relationship between a dichotomous response, learning style preferences by university students of Sidhu School, Wilkes University, as a function of the following predictors: Gender, Age, employment status, cumulative grade point assessment (GPA) and level of study, as in usual generalized linear model. The response variable is the students’ preference for either Behaviorist or Humanist learning style. Four different binomial regression models were fitted to the data. Model A is a logit regression model that fits all the predictors, Model B is a probit model that fits all the predictors, Model C is a logit model with an effect modifier, while Model D is a probit model also with an effect modifier. Models A and B appeared to have performed poorly in fitting the data. Models C and D fit the data well as confirmed by the non-significant chi-square lack of fit with p-values 0.1409 and 0.1408 respectively. Among the four models considered for fitting the data, Model D, the probit model with effect modifier fit best.
    [Show full text]
  • Regression Models for Count Data Jason Brinkley, Abt Associates
    SESUG Paper SD135-2018 Regression Models for Count Data Jason Brinkley, Abt Associates ABSTRACT Outcomes in the form of counts are becoming an increasingly popular metric in a wide variety of fields. For example, studying the number of hospital, emergency room, or in-patient doctor’s office visits has been a major focal point for many recent health studies. Many investigators want to know the impact of many different variables on these counts and help describe ways in which interventions or therapies might bring those numbers down. Traditional least squares regression was the primary mechanism for studying this type of data for decades. However, alternative methods were developed some time ago that are far superior for dealing with this type of data. The focus of this paper is to illustrate how count regression models can outperform traditional methods while utilizing the data in a more appropriate manner. Poisson Regression and Negative Binomial Regression are popular techniques when the data are overdispersed and using Zero-Inflated techniques for data with many more zeroes than is expected under traditional count regression models. These examples are applied to studies with real data. INTRODUCTION Outcomes in the form of counts are becoming an increasingly popular metric in a wide variety of fields. In the health sciences examples include number of hospitalizations, chronic conditions, medications, and so on. Many investigators want to know the impact of many different variables on these counts and help describe ways in which interventions or therapies might bring those numbers down. Standard methods (regression, t-tests, ANOVA) are useful for some count data studies.
    [Show full text]
  • Generalized Linear Models and Point Count Data: Statistical Considerations for the Design and Analysis of Monitoring Studies
    Generalized Linear Models and Point Count Data: Statistical Considerations for the Design and Analysis of Monitoring Studies Nathaniel E. Seavy,1,2,3 Suhel Quader,1,4 John D. Alexander,2 and C. John Ralph5 ________________________________________ Abstract The success of avian monitoring programs to effec- Key words: Generalized linear models, juniper remov- tively guide management decisions requires that stud- al, monitoring, overdispersion, point count, Poisson. ies be efficiently designed and data be properly ana- lyzed. A complicating factor is that point count surveys often generate data with non-normal distributional pro- perties. In this paper we review methods of dealing with deviations from normal assumptions, and we Introduction focus on the application of generalized linear models (GLMs). We also discuss problems associated with Measuring changes in bird abundance over time and in overdispersion (more variation than expected). In order response to habitat management is widely recognized to evaluate the statistical power of these models to as an important aspect of ecological monitoring detect differences in bird abundance, it is necessary for (Greenwood et al. 1993). The use of long-term avian biologists to identify the effect size they believe is monitoring programs (e.g., the Breeding Bird Survey) biologically significant in their system. We illustrate to identify population trends is a powerful tool for bird one solution to this challenge by discussing the design conservation (Sauer and Droege 1990, Sauer and Link of a monitoring program intended to detect changes in 2002). Short-term studies comparing bird abundance in bird abundance as a result of Western juniper (Juniper- treated and untreated areas are also important because us occidentalis) reduction projects in central Oregon.
    [Show full text]
  • Lecture 3 Residual Analysis + Generalized Linear Models
    Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1 Residual Analysis 2 Atmospheric CO2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date 3 360 co2 350 1988 1992 1996 date 2.5 0.0 resid −2.5 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on average … 4 360 co2 350 1988 1992 1996 date 2.5 0.0 resid −2.5 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on average … 4 2.5 0.0 resid −2.5 1988 1992 1996 date 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date and then? Well there is some periodicity lets add the month … 5 2.5 0.0 resid −2.5 1988 1992 1996 date 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date and then? Well there is some periodicity lets add the month … 5 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date 0.8 0.4 0.0 resid3 −0.4 −0.8 1988 1992 1996 date and then and then? Maybe there is some different effect by year … 6 1.0 0.5 0.0 resid2 −0.5 −1.0 1988 1992 1996 date 0.8 0.4 0.0 resid3 −0.4 −0.8 1988 1992 1996 date and then and then? Maybe there is some different effect by year … 6 Too much (lm = lm(co2~date + month + as.factor(year), data=co2_df)) ## ## Call: ## lm(formula = co2 ~ date + month + as.factor(year), data = co2_df) ## ## Coefficients: ## (Intercept) date monthAug ## -2.645e+03 1.508e+00 -4.177e+00 ## monthDec monthFeb monthJan ## -3.612e+00 -2.008e+00 -2.705e+00 ## monthJul monthJun monthMar ## -2.035e+00 -3.251e-01 -1.227e+00 ## monthMay monthNov monthOct ## 4.821e-01 -4.838e+00 -6.135e+00 ## monthSep as.factor(year)1986 as.factor(year)1987 ## -6.064e+00 -2.585e-01 9.722e-03 ## as.factor(year)1988 as.factor(year)1989 as.factor(year)1990 ## 1.065e+00 9.978e-01 7.726e-01 ## as.factor(year)1991 as.factor(year)1992 as.factor(year)1993 ## 7.067e-01 1.236e-02 -7.911e-01 ## as.factor(year)1994 as.factor(year)1995 as.factor(year)1996 ## -4.146e-01 1.119e-01 3.768e-01 ## as.factor(year)1997 7 ## NA 360 co2 350 1988 1992 1996 date 8 Generalized Linear Models 9 Background A generalized linear model has three key components: 1.
    [Show full text]
  • The Overlooked Potential of Generalized Linear Models in Astronomy, I: Binomial Regression
    Astronomy and Computing 12 (2015) 21–32 Contents lists available at ScienceDirect Astronomy and Computing journal homepage: www.elsevier.com/locate/ascom Full length article The overlooked potential of Generalized Linear Models in astronomy, I: Binomial regression R.S. de Souza a,∗, E. Cameron b, M. Killedar c, J. Hilbe d,e, R. Vilalta f, U. Maio g,h, V. Biffi i, B. Ciardi j, J.D. Riggs k, for the COIN collaboration a MTA Eötvös University, EIRSA ``Lendulet'' Astrophysics Research Group, Budapest 1117, Hungary b Department of Zoology, University of Oxford, Tinbergen Building, South Parks Road, Oxford, OX1 3PS, United Kingdom c Universitäts-Sternwarte München, Scheinerstrasse 1, D-81679, München, Germany d Arizona State University, 873701, Tempe, AZ 85287-3701, USA e Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91109, USA f Department of Computer Science, University of Houston, 4800 Calhoun Rd., Houston TX 77204-3010, USA g INAF — Osservatorio Astronomico di Trieste, via G. Tiepolo 11, 34135 Trieste, Italy h Leibniz Institute for Astrophysics, An der Sternwarte 16, 14482 Potsdam, Germany i SISSA — Scuola Internazionale Superiore di Studi Avanzati, Via Bonomea 265, 34136 Trieste, Italy j Max-Planck-Institut für Astrophysik, Karl-Schwarzschild-Str. 1, D-85748 Garching, Germany k Northwestern University, Evanston, IL, 60208, USA article info a b s t r a c t Article history: Revealing hidden patterns in astronomical data is often the path to fundamental scientific breakthroughs; Received 26 September 2014 meanwhile the complexity of scientific enquiry increases as more subtle relationships are sought. Con- Accepted 2 April 2015 temporary data analysis problems often elude the capabilities of classical statistical techniques, suggest- Available online 29 April 2015 ing the use of cutting edge statistical methods.
    [Show full text]
  • Negative Binomial Regression Second Edition, Cambridge University Press
    Negative Binomial Regression Second edition, Cambridge University Press Joseph M Hilbe ERRATA & COMMENTS as of 12 January, 2012 Note: Data sets and software code can be downloaded from: http://works.bepress.com/joseph_hilbe/ ERRATA P. xv: The inset following the 1st paragraph on page, replace Stata bookstore (Stata files/commands) http://www.Stata.com/bookstore/nbr2.html with Data sets, software code, and electronic Extensions to the text can be downloaded from: http://works.bepress.com/joseph_hilbe/ p 17 Table 2.2. second to last line in table, final term should read, prop.r=FALSE, not pror.r. Read as CrossTable(survived, age, prop.t=FALSE, prop.r=FALSE, prop.c=FALSE, prop.chisq=FALSE) p 18 Equation 2.3. The second or middle term should have (implied) product signs, not division. Read as: p. 20 Table 2.4. The final three lines can be reduced to one line: irr*rse. Revise Table 2.4 to appear as: Table 2.4: R – Poisson model with robust standard errors ==================================================== titanic$class <- relevel(factor(titanic$class), ref=3) tit3 <- glm(survived ~ factor(class), family=poisson, data=titanic) irr <- exp(coef(tit3)) # vector of IRRs library("sandwich") rse <- sqrt(diag(vcovHC(tit3, type="HC0"))) # coef robust SEs irr*rse # IRR robust SEs ==================================================== P 22 Table 2.5. The final three lines of the code may be condensed to one line, as it was in Table 2.4 above. Replace ec2 <- c(irr2) rs2 <- c(rse2) ec2 * rs2 with irr2 * rse2 P 27 The interpretation of the model (in italics) can better be expressed as: “the odds of a child surviving on the Titanic is nearly twice that of adults.” The line below the italicized interpretation should be amended to that this section in the book should read as: ----------------------------------------------------------------------------------------------- The odds of a child surviving on the Titanic is nearly twice that of adults.
    [Show full text]
  • R-Squared Measures for Count Data Regression Models with Applications to Health Care Utilization
    R-Squared Measures for Count Data Regression Models With Applications to Health Care Utilization A. Colin Cameron Dept. of Economics University of California Davis CA 95616-8578 USA Frank A.G. Windmeijer Dept. of Economics University College London London WC1E 6BT UK April 1995 Journal of Business and Economic Statistics (forthcoming) Abstract R-squared measures of goodness of fit for count data are rarely, if ever, reported in empirical studies or by statistical packages. We propose several R-squared measures based on various definitions of residuals, for the basic Poisson regression model and for more general models such as negative binomial that accommodate overdispersed data. The preferred R-squared is based on the deviance residual. An application to data on health care service utilization measured in counts illustrates the performance and usefulness of the various R-squareds. KEY WORDS: Goodness-of-fit, Poisson regression, negative binomial regression, deviance, deviance residual, Pearson residual. 1. INTRODUCTION R-squared (R 2 ) measures were originally developed for linear regression models with homoscedastic errors. Extensions to models with heteroscedastic errors with known variance were proposed by Buse (1973). Extensions to other models are rare, with the notable exceptions of logit and probit models, see Windmeijer (1994) and the references therein, and tobit models, surveyed by Veall and Zimmermann (1994). In this paper we investigate R 2 measures for Poisson and other related count data regression models. Surprisingly, R 2 is rarely reported in empirical studies or by statistical packages for count data. (For Poisson, exceptions are Merkle and Zimmermann (1992) and the statistical package STATA.
    [Show full text]
  • Negative Binomial Regression
    NCSS Statistical Software NCSS.com Chapter 326 Negative Binomial Regression Introduction Negative binomial regression is similar to regular multiple regression except that the dependent (Y) variable is an observed count that follows the negative binomial distribution. Thus, the possible values of Y are the nonnegative integers: 0, 1, 2, 3, and so on. Negative binomial regression is a generalization of Poisson regression which loosens the restrictive assumption that the variance is equal to the mean made by the Poisson model. The traditional negative binomial regression model, commonly known as NB2, is based on the Poisson-gamma mixture distribution. This formulation is popular because it allows the modelling of Poisson heterogeneity using a gamma distribution. Some books on regression analysis briefly discuss Poisson and/or negative binomial regression. We are aware of only a few books that are completely dedicated to the discussion of count regression (Poisson and negative binomial regression). These are Cameron and Trivedi (2013) and Hilbe (2014). Most of the results presented here were obtained from these books. This program computes negative binomial regression on both numeric and categorical variables. It reports on the regression equation as well as the goodness of fit, confidence limits, likelihood, and deviance. It performs a comprehensive residual analysis including diagnostic residual reports and plots. It can perform a subset selection search, looking for the best regression model with the fewest independent variables. It provides confidence intervals on predicted values. The Negative Binomial Distribution The Poisson distribution may be generalized by including a gamma noise variable which has a mean of 1 and a scale parameter of ν.
    [Show full text]
  • Multiple Approaches to Analyzing Count Data in Studies of Individual
    TIONAL AND P SY CHOLOGICAL MEASUREMENT MAN Multiple Approaches to Analyzing Count Data in Studies of Individual Differences: The Propensity for Type 1 Errors, Illustrated with the Case of Absenteeism Prediction Michael C, Sturman, Louisiana State University The present study compares eight models for analyzing count data: ordinary least squares (OLS), OLS with a transformed dependent variable, Tobit, Poisson, overdispersed Poisson, negative binomial, ordinal logistic, and ordinal probit regressions. Simulation reveals the extent that each model produces false positives. Results suggest that, despite methodological expectations, OLS regression does not produce more false positives than expected by chance. The Tobit and Poisson models yield too many false positives. The negative binomial models produce fewer than expected false positives. A fundamental problem with scientific research is that the way we try to solve a problem affects what kind of results we see (Kuhn, 1970). This is partly reflected in the social sciences because the statistical method used to analyze data affects what kind of relationships we observe (Schmidt, 1996). When the assumptions of the employed statistical model are met, the observed coefficients usually describe the actual relationship well (Greene, 1993). However, when these assumptions are violated, such as when using ordinary least squares (OLS) regression to analyze nonnormal data, resulting estimates may not be meaningful. This can result in true relationships not being discovered (i.e., Type II errors) or the misidentification of nonexistent relationships (i.e., Type I errors). Analyses of count data in studies of individual differences have often been criticized for the use of inappropriate methodologies. An example for human resources is research on absenteeism.
    [Show full text]