Measures of Explained Variation and the Base-Rate Problem for Logistic Regression
Total Page:16
File Type:pdf, Size:1020Kb
American Journal of Biostatistics 2 (1): 11-19, 2011 ISSN 1948-9889 © 2011 Science Publications Measures of Explained Variation and the Base-Rate Problem for Logistic Regression 1Dinesh Sharma, 2Dan McGee and 3B.M. Golam Kibria 1Department of Mathematics and Statistics, James Madison University, Harrisonburg, VA 22801 2Department of Statistics, Florida State University, Tallahassee, FL 32306 3Department of Mathematics and Statistics, Florida International University, Miami, FL 33199 Abstract: Problem statement: Logistic regression, perhaps the most frequently used regression model after the General Linear Model (GLM), is extensively used in the field of medical science to analyze prognostic factors in studies of dichotomous outcomes. Unlike the GLM, many different proposals have been made to measure the explained variation in logistic regression analysis. One of the limitations of these measures is their dependency on the incidence of the event of interest in the population. This has clear disadvantage, especially when one seeks to compare the predictive ability of a set of prognostic factors in two subgroups of a population. Approach: The purpose of this article is to study the base-rate sensitivity of several R2 measures that have been proposed for use in logistic regression. We compared the base-rate sensitivity of thirteen R2 type parametric and nonparametric statistics. Since a theoretical comparison was not possible, a simulation study was conducted for this purpose. We used results from an existing dataset to simulate populations with different base-rates. Logistic models are generated using the covariate values from the dataset. Results: We found nonparametric R2 measures to be less sensitive to the base-rate as compared to their parametric counterpart. Logistic regression is a parametric tool and use of the nonparametric R2 may result inconsistent results. Among the parametric R2 measures, the likelihood ratio R2 appears to be least dependent on the base-rate and has relatively superior interpretability as a measure of explained variation. Conclusion/Recommendations: Some potential measures of explained variation are identified which tolerate fluctuations in base-rate reasonably well and at the same time provide a good estimate of the explained variation on an underlying continuous variable. It would be, however, misleading to draw strong conclusions based only on the conclusions of this research only. Key words: Base-rate sensitivity, coefficient of determinant, latent scale linear model, R2 statistic INTRODUCTION variable is dichotomous, logistic regression model is the most popular choice. In most instances, interest lies in The Search for R2 analogs in logistic regression: determining how well the model predicts the probability Prediction of future outcomes based on a given set of of group membership with respect to the dependent covariates is a key component of regression analysis. In variable. Unlike the OLS regression, more than a dozen Ordinary Least Squares (OLS) regression analysis, the of R2 measures have been suggested for the logistic predictive accuracy of a linear model is often judged regression model (Mittlbock and Schemper, 1996; using the R2 statistic. This statistic has several Menard, 2000; DeMaris, 2002; Liao and McGee, 2003). mathematically equivalent definitions and multiple But the best form of R2 is not clear yet. Mittlbock and interpretations such as the proportion of variation in the Schemper (1996) reviewed 12 measures of explained dependent variable explained by the regressors, a variation for logistic regression, Menard (2000) six and measure of the strength of relationship between the DeMaris (2002) seven, with some overlap. Other authors covariate(s) and the response and the squared have proposed adjusted R2 analogs (see, Liao and correlation between the observed and the predicted McGee, 2003; Mittlbock and Schemper, 2002), for response. This statistic is usually not used as a measure exmaple). Recommendations based on various of goodness-of-fit as other tools are better suited to that researches were different as different criteria were used purpose (Hosmer et al ., 2011). When the outcome to evaluate the R2 analogs. Corresponding Author: Dinesh Sharma, Department of Mathematics and Statistics, James Madison University, Harrisonburg, VA 22801 11 Am. J. Biostatistics 2 (1): 11-19, 2011 Kvalseth's sixth criteria for a “good” R2 statistic for R2-measures in logistic regression: We present some the linear model (Kvalseth, 1985) requires an R2 of the R2 measures which have been proposed in the measure to be comparable across different models fitted literature to estimate explained variation in logistic to the same data. Menard (2000) extended this criteria regression. Consider n observations (y i, x i) on a binary 2 requiring an R measure to be comparable not only response variable y and a covariate vector x = (x 1…x p). across different predictors but also across different The relationship between y and x is modeled by a dependent variables and different subsets of the dataset. logistic model Eq. 1: With the help of an empirical example Menard (2000) β 2 o + β demonstrated that R measures in logistic regression are = ≡π = e 'x Pr(yi 1x) i ii (x) β (1) sensitive to the incidence of the event of interest in the 1e+o + β 'x population. Even if the coefficients associating particular variables to the outcome are the same in where β is a (p+1)-dimensional parameter vector. We different populations, the value of the R2 for denote the estimates from a logistic regression by ⌢ ⌢ populations with different incidence rates tend to be = = π⌢ = ≡ = n Pr(y 1|x) (x) and Pr(yi 1) y∑ (y/n) i . different. This phenomena is sometimes referred as the i i i i= 1 “base-rate” problem (Menard, 2000). For logistic model with binary y it can be shown that 2 Having an R measure that depends on the y = π , the mean of conditional probability of success incidence of the response has disadvantage if one is for all possible combinations of the covariate values. seeking to compare the predictive ability of two different sets of prognostic factors or to compare the Ordinary Least Squares R2 (R2 ) : It is a natural same set of factors in two subgroups of a population or OLS 2 extension of the coefficient of determination in OLS in two different populations. If a R measure depends on regression to the case of a binary y and is given by Eq. 2: the underlying incidence of the disease under study then the R2 values for these two cases could differ because n2 n 2 =− −π⌢ − 2 of the difference in the underlying incidence and not ROLS 1∑ (y i i )/ ∑ (y i y) (2) because of different predictive abilities. This i1= i1 = phenomena is illustrated with the help of an empirical 2 (R2 ) study in the following subsection. Gini's Concentration R G : Gini's concentration π=−s π 2 measure C( ) 1 ∑ j= 1 j is proposed as a measure of An empirical example: The data used in this example are a subset of the Framingham Heart Study with dispersion of a nominal random variable γ that assumes known values of the covariates (age, systolic blood the integral values j, 1 ≤ j ≤ s, with probability πj pressure, serum cholesterol, current cigarette smoking (Haberman, 1982). If the outcome variable is binary, status and diabetic status). A logistic model with the C(π) reduces to 2π (1-π), where π is the probability that ten-year incidence of Coronary Heart Disease (CHD) Y=1. The Gini's Concentration R2 is the given by Eq. 3: was estimated and thirteen different R2 measures were 2 n calculated. Table 1 presents estimated R s for each of 2 =−π−π⌢ ⌢ − 2 RG 1∑ i (1 i )/[ny(1 y)] (3) the thirteen R measures. The measures are larger for i= 1 the female group than the male group, with only a few 2 2 exceptions. If we had performed OLS regression, we The Likelihood Ratio R (RL ) : Let L0 be the would claim that we are able to predict CHD better in likelihood of the model containing only the intercept women than in men. However, women developed CHD and LM be the likelihood of the model containing all of the at only half the rate that men did and if our measures predictors. The quantity D =-2logL represents the SSE are affected by the underlying rate of disease, then it M M for the full model and D0 = -2logL 0 represents the SSE of would be misleading to make such a claim. the model with only the intercept included, analogs to the For a more detailed examination of the effect of the total sum of squares (SST) in OLS. Thus the likelihood base-rate on potential measures of explained variance, ratio R2 for a logistic model becomes Eq. 4: we conducted a simulation study. The purpose of this article is to study the base-rate 2 2 = − sensitivity of several R measures in logistic regression. RL 1 log(L M ) / log(L o ) (4) We use an actual dataset to simulate populations with different base-rate. Logistic models are generated using R2 Based Upon Geometric Mean Squared actually occurring covariate values. The organization of 2 2 Improvement (R ) : In the linear regression model with the study is as follows: We introduce the R measures M to be examined. Simulation methods and simulation normally distributed errors with zero mean and constant 2 2/n results are discussed. Summary and concluding remarks variance it can be shown that R =1-(L o/L M) are given. (DeMaris, 2002). 12 Am. J. Biostatistics 2 (1): 11-19, 2011 2 Table 1: Male and female R from a single cohort study 2 2 2 2 2 2 2 2 2 2 2 2 R R R R R R R ACU τ τ R R p OLS G L M N C CS a b D R s Females ( π = 0.058) 0.060 0.060 0.062 0.110 0.048 0.133 0.047 0.151 0.756 0.003 0.029 0.512 0.043 Males ( π = 0.119) 0.040 0.040 0.048 0.059 0.042 0.081 0.041 0.097 0.686 0.006 0.029 0.372 0.043 ⌢ Since the method of maximum likelihood is the primary squared correlation between y and y , its sample fitted method of parameter estimation in the logistic value according to the model.