Evaluating Goodness-Of-Fit for a Logistic Regression Model Using the Hosmer-Lemeshow Test on Samples from a Large Data Set THESI

Evaluating goodness-of-fit for a logistic regression model using the Hosmer-Lemeshow test on samples from a large data set THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Adam Bartley Graduate Program in Public Health The Ohio State University 2014 Master's Examination Committee: Michael Pennell, Advisor Stanley Lemeshow Copyright by Adam Bartley 2014 Abstract The Hosmer-Lemeshow test is a commonly used assessment of goodness-of-fit for logistic regression models. As widely used as the Hosmer-Lemeshow test is, it can yield a high rejection rate of acceptable models when large samples are used. Several studies have suggested that one way around this would be to perform the test on random samples of fewer observations from the original data. This procedure would be easy to do and certainly reduce the power of the test, but no guidelines were given for how to implement the procedure or how to interpret results. At least two studies have used this technique with little justification for their conclusions. The purpose of this thesis was to evaluate the method proposed by others and give a recommendation for implementation. Results of a simulation study suggested that when one hundred subsets of five thousand observations were taken, if more than 10 had significant Hosmer-Lemeshow tests then the fit of the model should be considered suspect. ii Acknowledgments I would like to thank Dr. Pennell, Dr. Lemeshow, and Gary Phillips for their assistance during my time as a student. They have been generous with their time and advice. This thesis would not have been possible without them and all of their support. iii Vita June 2011 .......................................................B.S. Food, Agricultural, and Biological Engineering, The Ohio State University Fields of Study Major Field: Public Health iv Table of Contents Abstract ............................................................................................................................... ii Acknowledgments.............................................................................................................. iii Vita ..................................................................................................................................... iv List of Tables ..................................................................................................................... vi List of Figures ................................................................................................................... vii Chapter 1: Introduction ...................................................................................................... 1 Chapter 2: Simulation Study ............................................................................................... 7 2.1 Introduction ....................................................................................................... 7 2.2 Simulation Scenarios ........................................................................................ 7 2.3 Methods........................................................................................................... 11 2.4 Results 12 Chapter 3: Discussion 17 References ......................................................................................................................... 21 v List of Tables Table 1. Simulation Study Scenarios ................................................................................. 8 Table 2. # of Significant subsets when subset size = 1000 observations 14 Table 3. # of Significant subsets when subset size = 2000 observations 15 Table 4. # of Significant subsets when subset size = 5000 observations 16 vi List of Figures Figure 1. True vs. Fitted models ......................................................................................... 10 Figure 2. Simulation Procedure ....................................................................................... 11 vii Chapter 1: Introduction Logistic regression is a statistical technique often used when studying binary outcomes. This procedure models the log-odds of an event occurring, g(X) (sometimes referred to as the logit), given a set of predictors : (1) The logistic regression model is attractive because each has a meaningful interpretation where is the odds ratio between 2 observations with a 1 unit difference in , controlling for all other predictors. In logistic regression the probability of an event, , given X, can be calculated by: ( ) (2) ( ) As is the case with all modeling, goodness-of-fit is of interest in logistic regression. The most common method for assessing goodness-of-fit in logistic regression is the Hosmer- Lemeshow test. In the Hosmer-Lemeshow test, estimated probabilities of an event are ordered and put into g groups, usually based on percentiles. The way that observations are grouped can vary but creating 10 groups based on deciles of is most common. Within each group, observed and expected event frequencies are compared using the following statistic [1]: 1 ̂ ̂ ̂ ∑ [ ] (3) ̂ ̂ where = number of observed events in group k, = number of observed nonevents in group k, ̂ = the number of expected events in group k, and ̂ = the number of expected nonevents in group k. The Hosmer-Lemeshow ̂ statistic follows a chi-square distribution with (g-2) degrees of freedom [1]. As widely used as the Hosmer-Lemeshow test is, it can yield a high rejection rate when large samples are used. Though a good model may be specified, the true model is always unknown, so there will usually be small differences between the observed and expected number of events within a group. When many observations are collapsed into 10 groups, these differences quickly add up and increase the power of the Hosmer-Lemeshow test, resulting in the rejection of incorrect though passable models. Kramer and Zimmerman [2] demonstrated this effect directly through a simulation study. A sample of 50,000 observations was simulated from a logistic regression model. For probabilities in the upper and lower third, a small deviation was induced to the probability used to assign values for the dependent variable. This deviation created a 0.4% difference between observed and expected outcomes in the tails. The true logistic regression model was fit to the data and the Hosmer-Lemeshow test was significant (p < 0.05). This process was repeated 1000 times and the small differences between observed and expected values were enough to reject the model 100% of the time. When the simulation was repeated 2 with 5000 observations per data set, only 9.7% of the replications were rejected [2]. Large data sets are desirable for statistical inference but excessive power is an unwelcome consequence for goodness-of-fit tests. Ideally, a goodness-of-fit test performed on the same model with different sample sizes would reach similar conclusions. Paul et al. [3] suggested that the power of the Hosmer- Lemeshow could be standardized across two sample sizes by adjusting the number of groups used in the test. Assuming the non-centrality parameter λ, is independent of the number of groups g, if power is to remain constant with sample size n, then (4) (equation 12 from [3]) must hold. Thus if there are two samples then (5) (equation 13 from [3]) must be true for the Hosmer-Lemeshow test to have the same power for two samples. In their simulation, two models with limited lack-of-fit had a low rejection rate when 10 groups were used with 1000 observations. This suggested that power is not excessive when 10 groups are used for a sample of 1000. Using this as a standard, a Hosmer-Lemeshow test with observations would need the following number of groups to have power comparable to the test with 1000 observations using 10 groups: ( ) (6) 3 Unfortunately, they found this rule cannot be used universally because the chi-square assumption was violated when the number of groups was approximately equal to the number of events. To prevent violations of the chi-square assumption, it was proposed that the ceiling for number of groups be half the number of events. This criterion was combined with their previous recommendation to give a general guideline for selecting the number of groups to use in the Hosmer-Lemeshow test: ( ( ( ) )) (7) (equation 15 from [3]) where m = # of subjects with an event. Since large sample size is often cited as the reason for significant Hosmer-Lemeshow tests [4, 5, 6] this approach can help but it may not be widely accepted. The fact that the number of groups can be changed makes some uneasy about using the Hosmer- Lemeshow test because conclusions can be sensitive to even minor changes in this choice [7, 8]. Instead of changing the number of groups or abandoning the test altogether it may be helpful to employ the test in an alternative way that can be used on data sets with large sample sizes. In the same study where they recommended standardizing power by changing the number of groups [3], Paul et al. proposed an interesting strategy where the Hosmer-Lemeshow test is repeatedly done using 10 groups on random subsets of observations. Kramer and Zimmerman [2] offered the same recommendation and suggested subsets of 5000 observations. If significant Hosmer-Lemeshow tests are the result of excessive power 4 from large samples and not a poor model, then it is worth exploring how the Hosmer- Lemeshow test would evaluate the same model applied to fewer observations. This procedure, which is easy to do and sure

Load more