<<

Evaluating goodness-of-fit for a model using the Hosmer-Lemeshow test on samples from a large set

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Adam Bartley

Graduate Program in Public Health

The Ohio State University

2014

Master's Examination Committee:

Michael Pennell, Advisor

Stanley Lemeshow

Copyright by

Adam Bartley

2014

Abstract

The Hosmer-Lemeshow test is a commonly used assessment of goodness-of-fit for logistic regression models. As widely used as the Hosmer-Lemeshow test is, it can yield a high rejection rate of acceptable models when large samples are used. Several studies have suggested that one way around this would be to perform the test on random samples of fewer observations from the original data. This procedure would be easy to do and certainly reduce the power of the test, but no guidelines were given for how to implement the procedure or how to interpret results. At least two studies have used this technique with little justification for their conclusions. The purpose of this thesis was to evaluate the method proposed by others and give a recommendation for implementation. Results of a simulation study suggested that when one hundred subsets of five thousand observations were taken, if more than 10 had significant Hosmer-Lemeshow tests then the fit of the model should be considered suspect.

ii

Acknowledgments

I would like to thank Dr. Pennell, Dr. Lemeshow, and Gary Phillips for their assistance during my time as a student. They have been generous with their time and advice. This thesis would not have been possible without them and all of their support.

iii

Vita

June 2011 ...... B.S. Food, Agricultural, and Biological

Engineering, The Ohio State University

Fields of Study

Major Field: Public Health

iv

Table of Contents

Abstract ...... ii

Acknowledgments...... iii

Vita ...... iv

List of Tables ...... vi

List of Figures ...... vii

Chapter 1: Introduction ...... 1

Chapter 2: Simulation Study ...... 7

2.1 Introduction ...... 7

2.2 Simulation Scenarios ...... 7

2.3 Methods...... 11

2.4 Results 12

Chapter 3: Discussion 17

References ...... 21

v

List of Tables

Table 1. Simulation Study Scenarios ...... 8

Table 2. # of Significant subsets when subset size = 1000 observations 14

Table 3. # of Significant subsets when subset size = 2000 observations 15

Table 4. # of Significant subsets when subset size = 5000 observations 16

vi

List of Figures

Figure 1. True vs. Fitted models ...... 10

Figure 2. Simulation Procedure ...... 11

vii

Chapter 1: Introduction

Logistic regression is a statistical technique often used when studying binary outcomes.

This procedure models the log-odds of an event occurring, g(X) (sometimes referred to as the logit), given a set of predictors :

(1)

The logistic regression model is attractive because each has a meaningful interpretation where is the between 2 observations with a 1 unit difference in , controlling for all other predictors. In logistic regression the probability of an event,

, given X, can be calculated by:

( ) (2) ( )

As is the case with all modeling, goodness-of-fit is of interest in logistic regression. The most common method for assessing goodness-of-fit in logistic regression is the Hosmer-

Lemeshow test. In the Hosmer-Lemeshow test, estimated probabilities of an event are ordered and put into g groups, usually based on . The way that observations are grouped can vary but creating 10 groups based on deciles of is most common.

Within each group, observed and expected event frequencies are compared using the following [1]:

1

̂ ̂ ̂ ∑ [ ] (3) ̂ ̂

where = number of observed events in group k, = number of observed nonevents in group k, ̂ = the number of expected events in group k, and ̂ = the number of expected nonevents in group k. The Hosmer-Lemeshow ̂ statistic follows a chi-square distribution with (g-2) degrees of freedom [1].

As widely used as the Hosmer-Lemeshow test is, it can yield a high rejection rate when large samples are used. Though a good model may be specified, the true model is always unknown, so there will usually be small differences between the observed and expected number of events within a group. When many observations are collapsed into 10 groups, these differences quickly add up and increase the power of the Hosmer-Lemeshow test, resulting in the rejection of incorrect though passable models. Kramer and Zimmerman

[2] demonstrated this effect directly through a simulation study. A sample of 50,000 observations was simulated from a logistic regression model. For probabilities in the upper and lower third, a small deviation was induced to the probability used to assign values for the dependent variable. This deviation created a 0.4% difference between observed and expected outcomes in the tails. The true logistic regression model was fit to the data and the Hosmer-Lemeshow test was significant (p < 0.05). This process was repeated 1000 times and the small differences between observed and expected values were enough to reject the model 100% of the time. When the simulation was repeated

2 with 5000 observations per data set, only 9.7% of the replications were rejected [2].

Large data sets are desirable for but excessive power is an unwelcome consequence for goodness-of-fit tests.

Ideally, a goodness-of-fit test performed on the same model with different sample sizes would reach similar conclusions. Paul et al. [3] suggested that the power of the Hosmer-

Lemeshow could be standardized across two sample sizes by adjusting the number of groups used in the test. Assuming the non-centrality parameter λ, is independent of the number of groups g, if power is to remain constant with sample size n, then

(4)

(equation 12 from [3]) must hold. Thus if there are two samples then

(5)

(equation 13 from [3]) must be true for the Hosmer-Lemeshow test to have the same power for two samples. In their simulation, two models with limited lack-of-fit had a low rejection rate when 10 groups were used with 1000 observations. This suggested that power is not excessive when 10 groups are used for a sample of 1000. Using this as a standard, a Hosmer-Lemeshow test with observations would need the following number of groups to have power comparable to the test with 1000 observations using 10 groups:

( ) (6)

3

Unfortunately, they found this rule cannot be used universally because the chi-square assumption was violated when the number of groups was approximately equal to the number of events. To prevent violations of the chi-square assumption, it was proposed that the ceiling for number of groups be half the number of events. This criterion was combined with their previous recommendation to give a general guideline for selecting the number of groups to use in the Hosmer-Lemeshow test:

( ( ( ) )) (7)

(equation 15 from [3]) where m = # of subjects with an event.

Since large sample size is often cited as the reason for significant Hosmer-Lemeshow tests [4, 5, 6] this approach can help but it may not be widely accepted. The fact that the number of groups can be changed makes some uneasy about using the Hosmer-

Lemeshow test because conclusions can be sensitive to even minor changes in this choice

[7, 8]. Instead of changing the number of groups or abandoning the test altogether it may be helpful to employ the test in an alternative way that can be used on data sets with large sample sizes.

In the same study where they recommended standardizing power by changing the number of groups [3], Paul et al. proposed an interesting strategy where the Hosmer-Lemeshow test is repeatedly done using 10 groups on random subsets of observations. Kramer and

Zimmerman [2] offered the same recommendation and suggested subsets of 5000 observations. If significant Hosmer-Lemeshow tests are the result of excessive power

4 from large samples and not a poor model, then it is worth exploring how the Hosmer-

Lemeshow test would evaluate the same model applied to fewer observations. This procedure, which is easy to do and sure to reduce the power, is tempting to pursue and has been tried a couple of times.

In a study conducted at the Philip R. Lee Institute for Health Policy Studies at University

California San Francisco logistic regression was used to predict mortality from intensive care units [9]. Goodness-of-fit was assessed on a validation set of 27,187 observations by performing the Hosmer-Lemeshow test (using 10 groups) on 11 random samples of 5000 observations. Nine of 11 (81.8%) were non-significant and the model was deemed adequately calibrated. Nine out of 11 tests seems like evidence of a good model but it may have been more convincing to take more than 11 samples; though it should be noted that the authors also assessed goodness-of-fit using a graphical comparison of observed and expected numbers of events. No justification was given for why 11 samples were used.

This procedure was also performed in a study by Gomes et al. [10]. Logistic regression was used to build a predictive model for mortality using data from the hospital information system of the Brazilian National Health System (SUS). The validation set contained almost 150,000 observations, so to avoid the excessive power of the Hosmer-

Lemeshow test, it was decided that the test would be done on random samples of 5000 observations. The study concluded adequate calibration but only one p-value of 0.256

5 was given; it is not clear if this came from a single sample or an average across multiple samples. No other method was used to justify the conclusion of good fit for the model.

The subsampling procedure suggested by the earlier studies [2, 3] might provide a way to assess the fit of models built with large samples, but neither Kramer and Zimmerman nor

Paul et al. provided sufficient guidelines for implementation. Their suggestions did not state how many subsets to draw though they both offered guidelines on the size of the subsets. Paul et al. did include the caveat that conclusions would be subjective but since this method has been used in several instances already, it would help to have a reference to base decisions on. While potentially useful, it is important to investigate how this method performs and determine its capability as an alternative to current goodness-of-fit methods. This thesis presents a simulation study that was done to investigate the feasibility of assessing goodness-of-fit by performing the Hosmer-Lemeshow test on random subsets from a large data set.

6

Chapter 2: Simulation Study

2.1 Introduction

Simulated data sets were used to investigate the behavior of a subsampling strategy. The best-fit logistic regression model may contain interactions and polynomial terms, but often times it is preferable to exclude such terms in favor of a simpler model to simplify interpretation. Excluding these predictors can contribute to lack-of-fit. A variety of scenarios were tested to represent situations that may occur in logistic regression modeling such as excluding interactions and nonlinear predictors. This simulation study will show how the subsampling approach works in these scenarios. Specifically, the number of subsets with significant tests and how the number of observations in each subset affects these results will be examined. A recommendation for implementation of this procedure will then be given.

2.2 Simulation Scenarios

Eight scenarios shown in Table 1 were considered. The fitted model differed from the true model by omitting interactions or nonlinear terms except for the first scenario, where

7 the model fitted to the data was identical to the true model. Observations in the data set included the covariates

.

Table 1: Simulation Study Scenarios Scenario True Logit Fitted Model ̂ ̂ ̂ 1

̂ ̂ 2

̂ ̂ ̂ 3

̂ ̂ ̂ 4

̂ ̂ 5

̂ ̂ 6

̂ ̂ ̂ 7

̂ ̂ 8

Though not perfect, some of the fitted models that were specified may be good and would ideally not be rejected by the Hosmer-Lemeshow test. To help determine which models were acceptable, the consequences of mis-modeling were depicted graphically. This was done by simulating 50,000 observations using the logit in column 2 of Table 1 and then fitting the incorrect model in column 3 of Table 1 to the data. The event probabilities of the fitted and true models for each observation were graphed as a function of so they

8 could be compared visually in Figure 1. No plot is shown for Scenario 1 since the fitted model was the same as the true model. No plot is shown for Scenario 2 either. In the case where a binary predictor is omitted, the estimated probability is averaged between the two levels of the binary predictor.

Scenarios 1, 2, 3, and 5 are cases where the Hosmer-Lemeshow test would ideally be non-significant, regardless of sample size. The coefficient of the omitted in

Scenario 3 was small which resulted in small differences between the true and fitted models (Scenario 3, Figure 1). When the fitted model omitted an interaction with a larger coefficient in Scenario 4, the estimated probabilities of the fitted model were noticeably different from the truth (Scenario 4, Figure 1). Similarly between scenarios 5 and 6, when the coefficient of the quadratic term increased, the difference between the true and fitted models became more apparent (Scenarios 5 and 6, Figure 1). Scenario 7 is a case where the fitted model omits both a quadratic term and interaction. The coefficient is small on the quadratic term but large on the interaction. The result is an obvious deviation between the true and fitted models (Scenario 7, Figure 1). Unfortunately, lack-of-fit is not always so apparent. The fitted model in Scenario 8 was a good match except for in the tails

(Scenario 8, Figure 1). It may often be the case that the fit of a simpler model with easier interpretation, is borderline.

9

- - - - True

_____Fitted

Figure 1: Estimated probability of True vs. Fitted models as a function of x

10

2.2 Methods

For each scenario a data set of 25,000, 50,000, 100,000, and 250,000 observations was

simulated. Logistic regression was performed using the corresponding “Fitted model” in

Table 1 and the model coefficients were obtained. Next, a subset of observations was

taken (sampled with replacement) and the Hosmer-Lemeshow test was conducted using

the coefficients from the model that was fit to the complete data. Ten degrees of freedom

were used since coefficients were not recalculated and p-values < 0.05 were considered

significant. One hundred subsets were taken and the number of subsets that had

significant Hosmer-Lemeshow tests was counted. This entire process was repeated 100

times for every scenario and each of the sample sizes. Figure 7 helps illustrate the

procedure. Since one suggestion for subset size was 1000 [2] and the other was 5000 [3],

the procedure was tried with subset sizes of 1000, 2000, 5000, and 2% of the original

sample.

Data Set Data Set 1 … 100

Subset Subset Subset Subset Subset Subset 1 2 … 100 1 2 … 100

H-L H-L H-L H-L H-L H-L test test test test test test Figure 2: Simulation procedure. Repeated for all scenarios of all 4 sample sizes.

11

2.3 Results

Tables 2-4 show the effects of using the Hosmer-Lemeshow test on subsets of a large data set. Each subset size that was tried is presented in a separate table since this was a parameter of interest. Columns are not mutually exclusive i.e. if a data set had more than

10 significant subsets it also had more than 5 and so it is represented in each column.

In general, the number of significant tests was smaller when a good model was fit

(Scenarios 1, 2, 3, 5) though that number did start to get larger in Scenario 5 when a quadratic term with a small coefficient was omitted. In Scenario 6 when the coefficient on the quadratic term being omitted in the true model was larger, the number of significant tests increased noticeably. Scenario 7 was interesting because when the model omitted an interaction with a large coefficient, 100% of the subsets were significant for each subset sizes tried. The Hosmer-Lemeshow test’s lack of power to detect interactions has been previously observed [3, 7, 11] and was seen to a degree here in Scenario 4 when the coefficient on the interaction being omitted was big enough to appear important graphically (Figure 2), but lack-of-fit was insignificant for most subsets. This did improve when the subset size was 5000.

Comparing results from Tables 2-4, it can be seen that taking larger subsets generally resulted in more significant Hosmer-Lemeshow tests, which is not surprising. Regardless

12 of how much the fitted model deviated from the true model, samples frequently had more than 5% of their subsets produce significant tests, though it was rarely the case when more than 50% of the subsets were significant either. The exception is Scenario 7 where the estimated model was missing the interaction with a large coefficient and the model appeared to be highly mischaracterized.

The size of the sample that the subset was taken from had a varying effect on results. For example in Table 3 under Scenario 4, when the original sample was 25,000, 70% of the data sets had more than 10 significant subsets. When that original sample was 50,000, only 55% of the data sets had more than 10 significant subsets. Results from the alternative, 2% subset size approach are not shown but the results varied across sample sizes too.

13

Table 2: # of Significant subsets when subset size = 1000 observations Scenario Original Size > 5 > 10 > 15 > 20 > 50 1 25,000 56 7 0 0 0 50,000 27 2 0 0 0 100,000 47 0 0 0 0 250,000 39 0 0 0 0

2 25,000 46 2 0 0 0 50,000 47 2 0 0 0 100,000 45 0 0 0 0 250,000 37 2 0 0 0

3 25,000 54 2 0 0 0 50,000 42 1 0 0 0 100,000 39 1 0 0 0 250,000 33 1 0 0 0

4 25,000 83 17 1 0 0 50,000 72 17 1 0 0 100,000 76 16 1 0 0 250,000 72 14 0 0 0

5 25,000 74 11 1 0 0 50,000 64 10 0 0 0 100,000 60 4 0 0 0 250,000 60 3 0 0 0

6 25,000 99 60 16 1 0 50,000 98 54 7 0 0 100,000 94 39 7 0 0 250,000 95 44 5 0 0

7 25,000 100 100 100 100 100 50,000 100 100 100 100 100 100,000 100 100 100 100 100 250,000 100 100 100 100 100

8 25,000 76 11 0 0 0 50,000 67 5 0 0 0 100,000 67 6 0 0 0 250,000 53 1 0 0 0

14

Table 3: # of Significant subsets when subset size = 2000 observations Scenario Original Size > 5 > 10 > 15 > 20 > 50 1 25,000 76 7 0 0 0 50,000 51 4 0 0 0 100,000 46 4 0 0 0 250,000 37 1 0 0 0

2 25,000 73 9 0 0 0 50,000 52 4 0 0 0 100,000 50 1 0 0 0 250,000 32 0 0 0 0

3 25,000 65 12 0 0 0 50,000 46 4 1 0 0 100,000 44 0 0 0 0 250,000 42 2 0 0 0

4 25,000 97 70 34 3 0 50,000 97 55 16 4 0 100,000 94 46 10 2 0 250,000 92 48 9 0 0

5 25,000 88 40 8 0 0 50,000 86 25 5 0 0 100,000 85 16 1 0 0 250,000 80 12 1 0 0

6 25,000 100 98 81 56 0 50,000 100 98 79 34 0 100,000 100 95 68 29 0 250,000 100 95 68 25 0

7 25,000 100 100 100 100 100 50,000 100 100 100 100 100 100,000 100 100 100 100 100 250,000 100 100 100 100 100

8 25,000 89 48 8 1 0 50,000 91 29 5 0 0 100,000 87 22 2 0 0 250,000 84 22 2 0 0

15

Table 4: # of Significant subsets when subset size = 5000 observations Scenario Original Size > 5 > 10 > 15 > 20 > 50 1 25,000 89 43 9 4 0 50,000 76 7 0 0 0 100,000 60 1 0 0 0 250,000 40 1 0 0 0

2 25,000 92 41 9 2 0 50,000 76 23 2 0 0 100,000 62 3 0 0 0 250,000 45 1 0 0 0

3 25,000 89 41 13 3 0 50,000 71 13 0 0 0 100,000 50 6 1 0 0 250,000 48 3 0 0 0

4 25,000 100 97 90 79 3 50,000 100 100 94 75 0 100,000 100 100 93 76 0 250,000 100 99 95 67 0

5 25,000 100 91 63 36 1 50,000 97 83 46 19 0 100,000 97 72 34 8 0 250,000 99 71 24 4 4

6 25,000 100 100 100 99 59 50,000 100 100 100 100 46 100,000 100 100 100 100 31 250,000 100 100 100 100 21

7 25,000 100 100 100 100 100 50,000 100 100 100 100 100 100,000 100 100 100 100 100 250,000 100 100 100 100 100

8 25,000 100 91 78 51 0 50,000 100 91 65 35 0 100,000 98 84 55 17 0 250,000 98 89 38 11 0

16

Chapter 3: Discussion

The goal of subsampling is to show evidence of good fit for a model built with large data sets where differences between observed and predicted outcomes are small, but still be able to identify mis-modeling when the differences are larger. The decision as to how many significant tests indicates lack-of-fit is ultimately up to the researcher, but this simulation offers some guidelines. Even when a perfect model was fit, it was not unusual for a data set to have more than 5% of its subsets give significant tests. This may seem odd for a test with a 5% type I error rate, but subsets were taken from the same sample and thus were not independent. Using 5% as the threshold is too low but a number slightly higher would work. Researchers should also not expect to have a large amount of significant tests every time there is lack-of-fit, especially when considering interactions.

Even from a poorly fit model, the majority of subsets may not be significant unless the model is very inaccurate (as in Scenario 7).

Not surprisingly, the number of significant tests depended on how many observations were in the subsets. Taking 1000 observations may not offer enough power to detect nonlinear or interaction terms. Table 2 shows that data sets simulated under Scenario 4 rarely had more than 10% of their subsets give significant tests even though the model was missing an interaction. Similarly, in Scenario 6 where a quadratic term was missing,

17 the procedure did not show many instances where the data sets had more than 10% of their subsets give significant tests. Some increase in sensitivity was seen when size increased to 2000 observations and even more so when the size increased to 5000.

The fact that results differed across the original size is concerning, although this effect was not as drastic when only samples of 100,000 or more were considered. A possible explanation for this is that there is more similarity among subsets from the smaller populations. Taking a sample of 2000 observations from a population of 25,000 is not the same as taking a sample of 2000 observations from a population of 250,000. To address this, equally proportioned subset sizes of 2% were tried, but this did not resolve the issue and results still differed across population sizes. Again, this is not surprising since the subsets with 500 observations (2% of 25,000) have much less power than subsets with

5000 observations (2% of 250,000). This approach suffers greatly when the original sample is too small or too large. If for example, the sample is 2 million then a subset of even 2% would be 40,000.

To give subsets sufficient power to be effective but ensure that the correlation between them is small, the subsampling strategy should only be considered when there are

100,000 or more observations. One recommendation is to take 100 subsets of 5000 observations and if more than 10 of the subsets give significant Hosmer-Lemeshow tests, then suspect lack-of-fit and consider adjusting the model. Results from this study show that this approach can identify missing interactions and nonlinear terms when they have a

18 large effect on outcome (Scenarios 4 and 6). When omitting the interaction with the small effect (Scenario 3), the recommendation did not result in 10 significant subsets very often. The Hosmer-Lemeshow test appeared to be sensitive to omitting the quadratic term with the small coefficient (Scenario 5) and results show that these guidelines will result in a rejection more often than desired in a situation like this. Choosing a different threshold is an easy way around this. If the recommendation were that 15 significant results indicate lack-of-fit, then the rejection rate of Scenario 5 would be lower. The problem is that this alternative threshold decreased the chance of concluding lack-of-fit for Scenario

8. The threshold of 10 significant subsets is conservative in terms of concluding good-fit, but a cautious approach seems justified when assessing goodness-of-fit.

One limitation of this study is that only a few scenarios were tested. However, these models are common in applications and represent a common situation where the simplest models may be chosen to provide easy interpretations. Another limitation is that there is no recommendation for subsampling data sets with 50,000 observations or less even though excessive power can still be an issue for models being tested on populations of this size. The method of changing the number of groups proposed by Paul et al. can provide a way to assess goodness-of-fit with samples up to 25,000. Although the recommendation requires more than 100,000 observations, studies are becoming increasingly capable of obtaining such samples. Feudtner et al. [12] built a logistic regression model to predict pediatric death using 678,861 observations. Kemper et al.

19

[13] used 441,584 observations enrolled in Medicaid to study vision care. Improvements in have made it possible to perform studies with large samples.

As stated earlier, some studies have already tried this technique but provided little justification for their methods or conclusions. Performing the test on smaller subsets can avoid the excessive power of large data sets, but blindly reducing the power should not be the objective of the approach. This simulation shows that simply taking a small number of subsets and seeing that a majority of them are not significant may not necessarily be conclusive of adequate fit. Researchers must give careful consideration when deciding whether a significant Hosmer-Lemeshow test is due to sample size or lack-of-fit. As with other statistical procedures, conclusions will depend on what an individual thinks is significant. The recommendation of 10 significant subsets indicating lack-of-fit is a conservative choice, but is based on at least some simulation results. This study showed that performing the Hosmer-Lemeshow on random samples from a large data set can reduce power but still detect lack-of-fit in a variety of scenarios.

20

References

1. Hosmer DW, Lemeshow S. Goodness of fit tests for multiple logistic regression model. Communications in -Theory and Methods. 1980; 9(10): 1043-1069. DOI: 10.1080/03610928008827941.

2. Kramer AA, Zimmerman JE. Assessing the calibration of mortality benchmarks in critical care: The Hosmer-Lemeshow test revisited. Critical Care Medicine. Sept 2007; 35(9):2052-2056. DOI: 10.1097/01.CCM.0000275267.64078.B0.

3. Paul P, Pennell ML, Lemeshow S. Standardizing the power of the Hosmer-Lemeshow goodness of fit test in large data sets. Statistics in Medicine. Jan 2013; 32:67-80. DOI: 10.1002/sim.5525.

4. Seymour CW, Kahn JM, Cooke CR, Watkins TR, Heckbert SR, Rea TD. of Critical Illness During Out-of-Hospital Emergency Care. JAMA. Aug 2010;304(7):747-754. DOI: 10.1001/jama.2010.1140.

5. Glance LG, Dick AW, Osler TM, Mukamel DB, Li Y, Stone PW. The association between nurse staffing and hospital outcomes in injured patients. BMC Health Services Research. Aug 2012; 12:247. DOI: 10.1186/1472-6963-12-247.

6. Afessa B, Gajic O, Morales IJ, Keegan MT, Peters SG, Hubmayr RD. Association Between ICU Admission During Morning Rounds and Mortality. Chest. Dec 2009; 136(6):1489-95. DOI: 10.1378/chest.09-0529.

7. Allison PD. Logistic Regression Using SAS: Theory and Application. Cary, NC: SAS Institute, 2012.

8. Pigeon JG, Heyse JF. A cautionary note about assessing the fit of logistic regression models. Journal of Applied Statistics. 1999; 26(7):847-853. DOI: 10.1080/02664769922089.

9. ICU Outcomes (Mortality and Length of Stay) Methods, Data Collection Tool and Data. Philip R. Lee Institute for Health Policy Studies. Retrieved June 10, 2013, from http://healthpolicy.ucsf.edu/content/icu-outcomes

21

10. Gomes AS, Kluck MM, Riboldi J, Fachel JM. Mortality prediction model using data from the Hospital Information System. Revista de Saude Publica. Oct 2010; 44(5):934-41. DOI: 10.1590/S0034-89102010005000037.

11. Hosmer DW, Hosmer T, le Cessie S, Lemeshow S. A Comparison of Goodness-Of- Fit Tests For The Logistic Regression Model. Statistics in Medicine. May 1997; 16(9):965-80. DOI: 10.1002/(SICI)1097-0258(19970515)16:9<965::AID- SIM509>3.0.CO;2-O.

12. Feudtner C, Hexem KR, Shabbout M, Feinstein JA, Sochalski J, Silber JH. Prediction of Pediatric Death in the Year after Hospitalization: A Population-Level Retrospective . Journal of Palliative Medicine. Feb 2009; 12(2):160-169. DOI: 10.1089/jpm.2008.0206.

13. Kemper AR, Cohn LM, Dombkowski KJ. Patterns of Vision Care Among Medicaid- Enrolled Children. Pediatrics. 2004; 113(3):e190-e196.

22