<<

Biometrics 000, 1–18 DOI: 000 000 0000

On Assessing Binary Regression Models Based on Ungrouped Data

Chunling Lu

Division of Global Health, Brigham and Women’s Hospital

& Department of Global Health and Social Medicine

Harvard University, Boston, U.S.

email: chunling [email protected]

and

Yuhong Yang

School of , University of Minnesota, Minnesota, U.S.

email: [email protected]

Summary: Assessing a binary regression model based on ungrouped data is a commonly encountered but very challenging problem. Although tests, such as Hosmer-Lemeshow test and le Cessie-van Houwelingen test, have been devised and widely used in applications, they often have low power in detecting lack of fit and not much theoretical justification has been made on when they can work well. In this paper, we propose a new approach based on a cross validation voting system to address the problem. In addition to a theoretical guarantee that the probabilities of type I and II errors both converge to zero as the sample size increases for the new method under proper conditions, our simulation results demonstrate that it performs very well. Key words: Goodness of fit; Hosmer-Lemeshow test; Model assessment; Model selection diagnostics.

This paper has been submitted for consideration for publication in Biometrics for Ungrouped Data 1

1. Introduction

1.1 Motivation

Parametric binary regression is one of the most widely used statistical tools in real appli- cations. A central component in parametric regression is assessment of a candidate model before accepting it as a satisfactory description of the data. In that regard, goodness of fit tests are needed to reveal significant lack-of-fit of the model to assess (MTA), if any. When the observations are grouped (i.e., multiple independent cases at the same set of predictor values) naturally, goodness of fit test (GOFT) can be done based on the deviance of the model or the Pearson chi-square statistic. Both methods are asymptotically valid and thus are theoretically justified. See e.g., Agresti (2012); Hosmer et al (2013); Harrel (2015) for more details and references.

When the binary regression data are ungrouped, the deviance or chi-square based tests are no longer proper. Hosmer and Lemeshow (1980) proposed a method that divides the observations into a number of groups (usually 10) based on their estimated success proba- bilities from the fitted model and then use the Pearson chi-square statistic to decide if the observed counts are too different from what the model predicts. Numerical results show that the Hosmer-Lemeshow (HL) statistic approximately has a chi-square distribution under the null hypothesis in various settings. The HL test has been routinely used in logistic and is commonly included/taught in textbooks/courses on (e.g.,

Agresti, 2012; Hosmer et al, 2013). In the particular application of calibration of mortality benchmarks in critical care, for instance, the HL test is the most widely used method

(Serrano, 2012). Issues of possible low power of the HL test and its sensitivity to data grouping have been raised (Hosmer et al, 1997). See Hosmer et al (Chapter 5, 2013) for later versions of the HL test and new developments. For example, le Cessie and van Houwelingen

(1991) proposed an alternative to the HL test based on smoothing the residuals. In the 2 Biometrics, 000 0000 special case of GOFT of a logistic regression model, Bondell (2007) advocated the approach of examining a discrepancy between two competing kernel density estimators of the underlying conditional distributions of the covariates given case-control status. The asymptotic null distribution of the test statistic was well identified. Since kernel density estimators are used in the construction of the test statistics, a potential challenge of this approach is to deal with a number of covariates.

Bootstrap based methods have been studied for GOFT. Yin and Ma (2013) proposed a

Pearson type of GOFT based on bootstrap maximum likelihood estimation for generalized linear modeling. See the references in Yin and Ma (2013) and Vexler et al (2016) for earlier bootstrap and Bayes Factor type methods for related GOFT problems.

The present work is motivated by a medical cost reduction data set, where the interest lies in finding which factors may have played important roles on whether an individual had medical cost reduction by at least 25% or not after a treatment. The model recommended by

BIC (Schwarz, 1978) is quite simple and interpretable. Residual analysis reveals no major con- cerns. The HL test and the le Cessie-van Houwelingen (CH) test give relatively large p-values

(0.43 and 0.13, respectively), which is typically regarded to mean the model fits the data properly. To have more confidence on this model, we considered the nonparametric methods of random forest and bagging (Breiman, 1996, 2001). It turned out that their importance measures of the covariates are not consistent with the BIC model: two covariates outside of the model are the most important from the view of the random forest and bagging methods.

Which approach should we believe? This prompted our research on model assessment in the binary regression context. We show in this paper that by suitably comparing the model in question with a nonparametric method, we can have satisfactory performance in assessing the goodness of fit of the model. In particular, unlike the other methods mentioned earlier, Goodness of Fit for Ungrouped Data 3 our approach may employ statistical or machine learning tools that can deal with multi- or even high-dimensional classification as the aforementioned nonparametric method.

1.2 Formulation of the problem

Consider the binary regression setting:

P (Yi = 1|Xi) = f(Xi), 1 ≤ i ≤ n,

n where (Xi,Yi)i=1 are iid observations with Xi taking values in a d-dimensional Borel set X ⊂ Rd for some d ≥ 1 and f is the true conditional probability function of the Bernoulli random variable Yi given the covariate value. The distribution of Xi is unknown.

For conditional probability functions f and g, with PX being the distribution of X1, let Z  f(x) 1 − f(x) L(f, g) = f(x) log + (1 − f(x)) log P (dx) g(x) 1 − g(x) X be the Kullback-Leibler divergence. This divergence is naturally considered for testing prob- lems (e.g., Veldkamp and van der Linden, 2002) and, when used as a loss function, is closely

R 2 related to the square loss (f(x)−g(x)) PX (dx) under mild regularity conditions. Note that

L(f, g) = 0 if and only if f = g almost surely in PX .

Suppose a parametric model is being considered to model f:

fθ(x), θ ∈ Θ, where the parameter space Θ is finite-dimensional. Our goal is to conduct a goodness of fit test, i.e.,

∗ H0 : f(x) = fθ∗ (x) for all x, for some unknown θ ∈ Θ, versus H1 : inf L(f, fθ) > 0. (1) θ∈Θ

∗ Note that H0 is equivalent to L(f, fθ∗ ) = 0 for some (again, unknown) θ ∈ Θ.

For simplicity, we focus on the logistic regression models in our numerical work. The results are expected to be similar when other link functions (e.g., probit or complementary log-log link) are used. We also point out that for binary regression data with few replications (i.e., 4 Biometrics, 000 0000 the group sizes are rather small), the issues identified in this paper also exist, to a lesser extent. Due to space limitation, this matter is not explored.

The rest of the paper is organized as follows. In Section 2, we propose our method DRY-V, with a theoretical justification. Its performance is examined via simulations in Section 3.

The application of our approach is illustrated in Section 4 on a real data set. Concluding remarks are given in Section 5. The assumptions, proof of our main theorem in Section 2, and additional analyses of the real data are in Web Appendices A, B, and C, respectively.

2. Assessing lack of fit via a comparison with nonparametric estimates based

on cross validation voting

Cross validation (CV) is a widely used tool for estimating prediction accuracy, for selection of a tuning parameter of a statistical procedure, or for identifying the best candidate method

(Stone, 1974; Geisser, 1975; Arlot and Celisse, 2010, for a comprehensive review). Consistency of CV for comparing models/procedures has been investigated by Shao (1993) for and by Yang (2006, 2007) and Zhang and Yang (2015) for classification and general regression with continuous response.

In this section, we propose the use of CV with voting to address the goodness of fit of a binary regression model. Specifically, the model in question is compared to a proper nonparametric method in terms of the estimation of the conditional probability function f(x) and a decision is made accordingly on the goodness of fit of the parametric model.

The basic idea is that with a suitable data splitting ratio for CV, when the MTA holds, the parametric estimator is more efficient (converging at the familiar parametric rate of n−1/2 instead of a slower nonparametric rate), leading to better performance in the CV competition; in contrast, if the MTA is inadequate, then due to its non-convergence its predictive likelihood in the CV comparison is most likely much worse than that based on the more flexible nonparametric estimation, resulting in the rejection of the MTA. Various Goodness of Fit for Ungrouped Data 5 nonparametric methods can be used in the aforementioned methodology, including random forests (as is used in our numerical work), local likelihood based nonparametric estimation and series expansion based methods.

When nonparametric function estimation is the goal, general likelihood based CV methods have been successfully developed for asymptotical optimal performance in terms of estimation accuracy (van der Laan, Dudoit and Keles, 2004). In the context of binary regression, the general approach of local maximum likelihood method can be used for efficient nonpara- metric estimation of the conditional probability function f(x) via optimal tuning parameter selection (Fan, Farmen and Gijbels, 1998).

2.1 The DRY-V method

We split the data into two parts of size n1 and n2 = n − n1: the estimation part is

1 n1 2 n Z = (Xi,Yi)i=1 and the testing data is Z = (Xi,Yi)i=n1+1. Fitting the parametric model

1 on Z , we obtain the estimator fbn1,1(x) = f (x). Similarly, applying the nonparametric θbn1 2 estimator, we get fbn1,2(x). Then we compute the predictive log-likelihood on Z : n X CV (fbn1,j) = Yi log fbn1,j(Xi) + (1 − Yi) log(1 − fbn1,j(Xi)), j = 1, 2. (2)

i=n1+1

If CV (fbn1,1) < CV (fbn1,2), then the nonparametric estimator is preferred over the parametric estimator.

Let π denote a permutation of the orders of the observations. Let CVπ(fbn1,j) be the criterion value as defined in (2) except that the data splitting is done after the permutation π. If

CVπ(fbn1,1) ≥ CVπ(fbn1,2), then let τπ = 1 and otherwise let τπ = 0.

Let Π denote a subset of all n! permutations of the observations with size |Π|. The set Π may be obtained based on a fixed number (e.g., 100) of random permutations, as is done in our numerical work. Or one may arrange a well-designed set of non-random permutations.

P |Π| If π∈Π τπ ≥ 2 , then we accept the parametric model (H0) and otherwise reject it. We call this specific approach of comparing a parametric model with a nonparametric 6 Biometrics, 000 0000 estimator for the main purpose of evaluating the parametric model divide, re-fit/re-evaluate, and say yes or yield with voting, abbreviated as DRY-V or simply DRYV. Note that for a traditional cross validation, the CV values from different data splittings are usually averaged before comparing the candidate methods.

2.2 Property of DRYV

Theorem 1: Under Assumptions 1.1-1.3 in Web Appendix A, if the data splitting ratio satisfies b ≤ n2/n1 ≤ B for some constants 0 < b < B < ∞, then as n → ∞, the

DRY-V method for the testing problem (1) has both probabilities of type I and type II errors approaching 0.

Remark 1: The assumptions are natural. See Web Appendix A for discussions and ref- erences for parametric and nonparametric rates of convergence for estimating f.

Remark 2: The result shows that with n large, DRYV can tell apart the null and alternative hypotheses with probability going to 1. Thus, the DRYV method, used as a goodness of fit test, asymptotically makes the right decision. The numerical results in the next section will examine its finite-sample performance.

It should be pointed out that the DRYV approach is certainly not without any weakness.

First, since DRYV depends on random data splittings, it handles only iid observations for the current theoretical arguments to succeed. Extensions seem possible to relax this restrictive assumption, but additional technical developments (e.g., concentration inequalities for cluster based data splittings or stratified sampling) are needed to deal with longitudinal data or cases where the covariates may not be random or have changing distributions. Second, clearly, as is for any cross-validation based method, due to data division, the performance evaluated on the test part of data pertains to the behaviors of the relevant estimators at the reduced sample size, which may not always be representing those at the full sample size. Since the Goodness of Fit for Ungrouped Data 7 nonparametric method has larger complexity, if the parametric model is already rejected by DRYV (with estimation at the training size of n1 < n), at the full sample size n, the parametric model would still not be favored. In the reverse direction, in contrast, there may be the possibility that the relative performance of the MTA and the nonparametric alternative switch direction at the reduced and full sample sizes. If so, the DRYV method induces a mismatch (due to the sample size reduction from n to n1 in estimating f). Fortunately, in such a situation, since the relative performance of the two methods is in a transition stage, the DRYV result will typically reflect that by having a wining fraction of the MTA not too much above 50%, hinting that the parametric model is not very strong. One can investigate further this matter by having a higher data splitting ratio in favor of training. If indeed the nonparametric alternative gains significant ground, we have more confidence that the data is rich enough to begin to show the lack of fit of the parametric model. If the relative performance is not much changed, a plausible interpretation is that the data is rich enough to show that there is a (not strong) lack of fit in the parametric model but not rich enough to support the complicated nonparametric approach. The transition to the extent that the nonparametric method performs convincingly better may take many more observations than the current sample size.

Intuitively, the winning fraction of the nonparametric estimator may be loosely interpreted as strength of evidence against the parametric model. In this sense, it is somewhat similar to the usual p value.

3. Simulation results

All the numerical results in this work were obtained in R (unless stated otherwise). Two packages, namely, rms by Harrell (2015) and randomForest by Breiman, Cutler, Liaw and

Wiener were used to compute the CH test and random forest/bagging respectively. To avoid 8 Biometrics, 000 0000 presenting results with cherry-picking tuning to favor one’s own method, the default values in these packages were used.

We compare the HL and CH tests with the DRY-V methods based on random forest or bagging. For the HL and CH tests, we reject the model being checked if the p-value is smaller than or equal to 0.05, respectively.

In all the simulations in this section, the responses are generated based on the model with success probability of the form exp (η) P (Y = 1|X) = , 1 + exp (η) where the linear component is η = β0 +β1x1 +...+β6x6 with possible addition of a quadratic or interaction term in some situations. In this context, clearly, we only need to specify the terms in η to determine a model.

The 6 covariates are generated as follows. The first 4 covariates are independent standard normal and the last two are independent of the first 4 and have bivariate with covariance Cov(X5,X6) = 0.5.

The random forest and bagging are based on all the 6 covariates for the results in this section. It is worth pointing out that if the MTA correctly involves only a subset of the 6 covariates and the random forest and bagging are built using this subset of the covariates, then they enable even higher power to detect any lack of fit than based on all the 6 covariates.

3.1 Missing covariates

Here we study the behaviors of the methods when some independent covariates are wrongly dropped. Besides HL and CH tests, the method proposed by Bondell (2007) is also included.

3.1.1 Data generation. Two models are used to generate data. The larger and smaller models respectively have the linear expression

η = 0.5 + 4x1 + 2x2 + x3 − 5x4 − x5 + 3x6, (3) Goodness of Fit for Ungrouped Data 9

η = 0.5 + 4x1 + 2x2 + 3x6. (4)

3.1.2 Three GOF assessment scenarios. For Scenario I, intended to examine the type I error probability, the MTA is y ∼ x1 + x2 + x3 + x4 + x5 + x6 based on data from (3). For

Scenario II, intended to examine the type II error probability, the data generating model is the same, but the MTA is y ∼ x1 +x2 +x3 +x6. For Scenario III, intended to examine the type

I error probability when the true model uses only a subset of the available covariates, the data are generated from (4) and the model y ∼ x1 + x2 + x6 is assessed. The nonparametric estimators for DRYV, namely random forest (RF) and bagging (BAG), consider all the 6 covariates.

3.1.3 Results. The outputs in Table 1 are based on 100 replications (simulation runs).

[Table 1 about here.]

We have the following observations.

(1) For the probability of type I error, at the simulation settings, the DRYV methods have no

difficulty at all in rejecting the nonparametric estimators. For the HL, CH and Bondell

tests, the error probabilities are mostly acceptable (compared to the nominal 5% level).

(2) For the probability of type II error, in Scenario II, the DRYV methods have no trouble

knowing that the MTA is inadequate. When x4 and x5 are ignored, the HL, CH and

Bondell tests, however, have error probabilities around 95%. Even at n = 5000, the HL

test does not improve, but the CH test performs much better (around 31%).

From the results, the approach of using the HL or CH tests on a model based on a subset of the available covariates may not be able to identify lack of fit due to missing (available) covariates. The DRYV approach very successfully addresses the problem, while improving the probability of type I error as well. 10 Biometrics, 000 0000

3.2 Missing quadratic or interaction term

In this subsection, we investigate the capabilities of the different methods in detecting lack of fit due to missing a quadratic/interaction term. A contrast between the case of a severely wrong model (the model being too small) and the case with just the quadratic/interaction term missing is made. We also demonstrate that when the missing quadratic term is corre- lated with some term in the model, in comparison with the non-correlated case, the lack of

fit can be detected much more easily by all the methods with the larger (less wrong) model, but not so for the HL/CH tests with the smaller (severely wrong) model.

3.2.1 Three scenarios. In the scenario Missing Quadratic I, the data are generated by

2 η = 0.5 + 4x1 + 2x2 + 3x3 + x6 + 4x1. (5)

The MTA is 1) y ∼ x1 + x2 + x3 + x6, labeled Larger; 2) y ∼ x6, labeled Smaller. For the former case, clearly there is no missing covariate in the MTA.

In the scenario of Missing Quadratic II, the only change from the above is i) X1 is such

2 that X1 + 4 has χ distribution with 4 degrees of freedom and is independent of the other

2 covariates; ii) η = 0.5 + 4x1 + 2x2 + 3x3 + x6 − x1. In the scenario of Missing Interaction, the data are generated with

η = 0.5 + 4x1 + 2x2 + 3x3 + x6 + 4x1x2. (6)

The other aspects remain the same as in Missing Quadratic I.

3.2.2 Results. Figure 1 presents the simulation outcomes.

[Figure 1 about here.]

(1) When the MTA misses only the quadratic or interaction term (the larger model case),

all the methods perform better as the sample size increases. When the model has only

one term that is independent of the other terms in the true model (the smaller model

case), both HL and CH has not shown power to detect the deficiency of the model even Goodness of Fit for Ungrouped Data 11

at n = 500 (the situation is unchanged with 1000 observations). We see that a worse

model may look acceptable by HL and CH test while the better but still wrong model

is rejected more frequently.

(2) Comparing the two missing quadratic term scenarios with the larger incorrect MTA, we

see that the lack of fit in the second scenario is much more likely found out by the HL

test (also for CH, but to a lesser extent). Note that in the first scenario, the missing

predictor is uncorrelated with all the terms in the model, but for the second scenario

2 2 with χ distribution for X1, the missing term X1 is correlated with X1. This supports the point that when the missing term is uncorrelated with the terms in the MTA and

there are no interaction effects between the term and those in the MTA, the HL and CH

test have little power to find deficiency of the MTA.

(3) For the DRYV methods, the power of detecting lack of fit seems strong. For the smaller

model scenario, it rejects the model basically all the time even at n = 100. For the larger

model case, the defect of missing only the quadratic/interaction term is harder to detect,

and the DRYV methods have some trouble when n is small, except the case that the

2 missing term X1 is correlated with the variable itself (X1) (included in the model).

3.3 Randomly generated models

Regarding the simulation results above, since the coefficients in the true models are some specific choices, one naturally wonders how representative the outcomes are. Below we show the relative performances of the methods with randomly generated coefficients.

3.3.1 Model generation. The true logit model has the form

2 η = 0.5 + β1x1 + β2x2 + β3x3 + β6x6 + γx1. (7)

In 100 replications, each time, all the 5 coefficients are independently generated from Unif[0, 5].

Then 5 independent data sets at the sample size 200 are generated. 12 Biometrics, 000 0000

3.3.2 Models to assess. The first case is assessing the model y ∼ x6 (denoted “Smaller” in

Table 2) and the second case is on the larger model y ∼ x1 + x2 + x3 + x6 (denoted “Larger” in Table 2). This model uses the right set of covariates but without the quadratic term. To investigate type I error behaviors, for the third case, we remove the quadratic component by setting γ to be 0 in equation 7 (denoted “No Lack-of-Fit” in Table 2). The other aspects are unchanged. The random forest and bagging methods are based on all the 6 covariates (if the irrelevant covariates X4 and X5 are dropped, the DRYV methods perform even better).

3.3.3 Results. The percentages of rejecting the model under consideration by the two tests (i.e., when the p-value is no bigger than 0.05) or by the DRYV methods (i.e., choosing the nonparametric estimators) are recorded. The averages over the 100 replications are then reported in Table 2. The numbers in the parentheses are the corresponding standard errors.

[Table 2 about here.]

The results show that both HL and CH tests have little power to detect the lack of fit for both wrong models, even if the MTA has included all the needed covariates (the “Larger” case). For the DRYV methods, for the worse model (smaller model), they have no difficulty determining that the nonparametric estimators are way better. For the larger (but still wrong) model, they are much better than the two tests but still have difficulty in rejecting the model. The reason is that with γ randomly chosen, it can be small and the single missing term is much less detectable.

For type I error, the DRYV methods performed perfectly, while CH has error probability around 5%. For HL, unfortunately, it fails to run due to non-unique grouping issues.

4. Application on low birth weight data

A real data set, low birth weight data from Hosmer et al (2013), is used to demonstrate the utility of the proposed DRYV method for parametric fitness assessment. The data contain Goodness of Fit for Ungrouped Data 13

189 observations, among which 130 babies had normal birth weight (59 low weight). The response indicates whether a baby was underweight or not and there are 8 covariates for explaining the response. Logistic regression modeling with these covariates is considered. A subset of the covariates is to be chosen to understand the risk factors to low birth weight.

For illustration, we consider two models, selected by AIC (Akaike, 1973) and BIC (Schwarz,

1978) via backward deletion, respectively. BIC selects the model with three covariates (be- sides the intercept): weight of the mother at the last menstrual period, history of hypertension

(binary), and history of premature labor (binary). AIC selects 3 more covariates: smoking status during pregnancy, race (three levels), and presence of uterine irritability (two levels).

The two models are quite different. For illustration, we also add a poor model that includes only the age of the mother (plus the intercept) into consideration.

The HL tests on the age-only, BIC and AIC models yield p-value 0.317, 0.409 and 0.424, respectively (Note that Stata gives 0.317, 0.788 and 0.324, respectively. The tests in R and

Stata were both done with the conventional use of 10 groups). Since these values are quite large, the tests do not reveal any significant lack of fit of the three models.

We next show that our DRYV approach provides a different conclusion on GOF of the age- only model. With 300 data splittings at 94:95 and 170:19 ratios of training versus evaluation, for a given pair of models/methods, the number of times one gives higher predictive likelihood over the other is recorded, and the percentages can be presented in a table. We call such a table DRYV pairwise comparison table. Here, in Tables 3-4, the (i, j)-th entry records the percentages of times among the random data splittings the i-th model/method has a strictly larger predictive likelihood than the j-th model/method. Clearly the (i, j)-th entry plus the

(j, i)-th entry may not equal 100% for j 6= i if there are practical ties.

[Table 3 about here.]

[Table 4 about here.] 14 Biometrics, 000 0000

From Tables 3-4, both AIC and BIC models are clearly preferred to the nonparametric alternatives, justifying their goodness of fit. In contrast, with the higher percentage of training, the age-only model is less favored to random forest, signaling a lack of fit.

Besides the main conclusion above, the other entries in the tables (together with another one at 75% training, shown in Web Appendix C) add support to the AIC and BIC models.

This gives a strong sense that the parametric modeling is sufficient for the data. The choice between the AIC and BIC models, which is not a GOF issue, may need additional aspects

(e.g., practical considerations) to settle. More analyses are in Web Appendix C.

5. Concluding remarks

Goodness of fit refers to soundness of a model for the available data. Hosmer et al (2013, p.

154) states: “In assessing goodness of fit we compare the fitted model to the largest possible model, the saturated model.” With this objective, methods that depend only on the fitted values from the MTA and the actual response values simply cannot be generally successful because the comparison of the response and fitted values, however it is done, ignores the regression information in the data that goes beyond what is limited by the covariates used in the model. Unfortunately, such methods are routinely used in applications where model choices may have real world consequences.

The newly proposed DRYV approach is theoretically justified in both types of error probabilities. In contrast, as stated already, the earlier tests on the GOF problem with theoretical investigations either lack an understanding in terms of power of the tests or have difficulties dealing with multi-dimensional covariates. These shortcomings suggest limitations of their scope of applicability and possible low power of detecting lack of fit in appliations.

Indeed, based on the simulations, our method compares favorably with the previous HL, CH and the other tests mentioned earlier in both the type I and type II errors.

This work continues the spirit of the developments summarized in Hosmer et al (2013, Goodness of Fit for Ungrouped Data 15

Chapter 5) on overcoming some weaknesses of the HL test, but in a very different way. The

DRYV approach can be used for higher detection power without the prerequisite that all the important covariates are known to be included in the MTA. Such methods can help prevent false justification of overly simplified or defective models for better scientific understanding.

There are different aspects that a binary regression model may suffer from a lack of fit. 1.

The model has mistakenly missed important covariates. 2. The model has included all the relevant covariates, but the form of the linear combination is incorrect (e.g., due to lack of a needed quadratic or interaction term). 3. The model has included all the relevant covariates, but the link function is false. Obviously, these issues may be present concurrently.

DRYV can be used to gain insight on why a model fails the GOF assessment. Suppose that the MTA involves only a subset of the available covariates. We can first apply DRYV to compare the model with a nonparametric method that uses all the variables in the data. If the MTA is rejected, we know either some covariates are missing or the model assumptions

(i.e., specification of the linear component and the link function in our context) are not right.

Then we apply DRYV with the nonparametric methods based on only the covariates used in the MTA. If the model is also rejected, we conclude that the structural assumptions of the MTA (relative to the variables it uses) are not sound. If the MTA is actually favored by the new DRYV comparison, we conclude that the MTA has missed important covariates.

The traditional rejection region of a test typically depends on the intended size of the test, and the data analyst plays an active role in controlling the probability of type I error, which is certainly not without controversy. In our DRYV approach, no such choice is needed (one may set a larger percentage of having higher predictive likelihood for the nonparametric estimator than the 50% used in this paper, but it has no obvious relation to error probabilities). This may also be regarded as a disadvantage in terms of active control of the probability of type 16 Biometrics, 000 0000

I error. In any event, as shown in Theorem 1, under sensible conditions, both types of error probabilities converge to zero, and the numerical results strongly support it.

For the use of DRYV method, one must specify a suitable nonparametric estimator of the conditional probability function. Although the tree based random forest and bagging worked very well in our simulations, alternative nonparametric estimation tools for binary regression may help increase the power of detecting specific aspects of lack of fit.

One major limitation of the DRYV method is that it deals with iid observations. Extensions are needed to handle longitudinal or other non-iid data that are frequently encountered in real applications.

6. Supplementary Materials

The assumptions, proof of Theorem 1 in Section 2 and some additional data analy- sis results for Section 4 are available with this paper at the Biometrics website on Wi- ley Online Library. For details of implementations of the R packages of rms and ran- domForest, see http://cran.r-project.org/web/packages/rms/index.html and http://cran.r- project.org/web/packages/randomForest/index.html.

Acknowledgements

The authors are truly grateful to Co-Editor, the AE and two reviewers for very constructive suggestions. We thank Dr. Howard Bondell for sharing his R codes.

References

Agresti, A. (2012). Categorical Data Analysis, 3rd Edition. Wiley.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle.

In Proc. 2nd Int. Symp. Info. Theory, Ed. B.N. Petrov and F. Csaki, 267-81. Budapest. Goodness of Fit for Ungrouped Data 17

Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection,

Statistics Surveys, 4, 40-79.

Bondell, H. D. (2007). Testing goodness-of-fit in logistic case-control studies, Biometrika, 94,

487-495.

Breiman, L. (1996). Bagging predictors, Machine Learning, 24, 123-140.

Breiman, L. (2001). Random forests, Machine Learning, 45, 5-32.

Burman, P. (1990). Estimation of optimal transformations using v-fold cross validation and

repeated learning-testing methods. Sankhya, Ser. A 52, 314-345.

Fan, J., Farmen, M. and Gijbels, I. (1998). Local maximum likelihood estimation and

inference. Journal of the Royal Statistical Society, Ser.B, 60, 591-608.

Geisser, S. (1975). The predictive sample reuse method with applications, Journal of the

American Statistical Association, 70, 320-328.

Harrel, F. (2015). Regression Modeling Strategies: With Applications to Linear Models,

Logistic and , and Survival Analysis. Springer: New York.

Hosmer, D. W. and Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic

regression model. Communications in Statistics, A10, 1043-1069.

Hosmer, D. W., Hosmer, T., Le Cessie, S. and Lemeshow, S. (1997). A comparison of

goodness-of-fit tests for the logistic regression model. Stat Med., 16, 965-980.

Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression, 3rd

Edition. New York: Wiley. le Cessie, S. and van Houwelingen, J. C. (1991). A goodness-of-fit test for based

on smoothing residuals, Biometrics, 47, 1267-1282.

Nan, Y. and Yang, Y. (2014). Variable selection diagnostics measures for high-dimensional

regression. Journal of Computational and Graphical Statistics, 23, 636-656.

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 461-464. 18 Biometrics, 000 0000

Serrano, N. (2012). Calibration strategies to validate predictive models: is new always better?

Intensive Care Medicine, 38, 1246-1248.

Shao, J. (1993). Linear model selection by cross-validation, JASA, 88, 486-494.

Stodden, V. (2015). Reproducing statistical results. Ann. Rev. Stat. and Appl., 2, 1-19.

Stone, M. (1974). Cross-validation choice and assessment of statistical predictions. Journal

of the Royal Statistical Society, Ser.B, 36, 111-147. van der Laan, M. J., Dudoit, S. and Keles, S. (2004). Asymptotic optimality of likelihood-

based cross-validation. Stat Appl Genet Mol Biol, 3, Article 4.

Veldkamp, B. P. and van der Linden, W. J. (2002). Multidimensional adaptive testing with

constraints on test content. Psychometrika, 67, 575-588.

Vexler, A., Hutson, A. D. and Chen, X. (2016). Statistical Testing Strategies in the Health

Sciences. Chapman & Hall.

Yang, Y. (2006). Comparing learning methods for classification. Stat Sinica, 16, 635-657.

Yang, Y. (2007). Consistency of cross validation for comparing regression procedures. Annals

of Statistics, 35, 2450-2473.

Yin, G. and Ma, Y. (2013). Pearson-type goodness-of-fit test with bootstrap maximum

likelihood estimation. Electronic Journal of Statistics, 412-427.

Zhang, Y. and Yang, Y. (2015). Cross-validation for selecting a model selection procedure.

Journal of Econometrics, 187, 95-112. Goodness of Fit for Ungrouped Data 19

Missing Quadratic 1: Larger Missing Quadratic 1: Smaller 60 60 0 0

100 200 300 400 500 100 200 300 400 500 Freq. of Rejecting H0 Freq. of Rejecting H0 Freq. of Rejecting

Sample Size Sample Size

Missing Quadratic 2: Larger Missing Quadratic 2: Smaller 60 60 0 0

100 200 300 400 500 100 200 300 400 500 Freq. of Rejecting H0 Freq. of Rejecting H0 Freq. of Rejecting

Sample Size Sample Size

Missing Interaction: Larger Missing Interaction: Smaller 60 60 0 0

100 200 300 400 500 100 200 300 400 500 Freq. of Rejecting H0 Freq. of Rejecting H0 Freq. of Rejecting

Sample Size Sample Size

Figure 1. Power Comparison of HL, CH and DRYV with Missing Quadratic/Interaction Term. HL: solid line; CH: dashed line; DRYV-RF: dotted line; DRYV-BAG: dot dash line. This figure appears in color in the electronic version of this article. 20 Biometrics, 000 0000

Table 1 Percentages of Rejecting the MTA Scenario n HL CH Bondell DRYV-RF DRYV-BAG I 200 9 8 7 0 0 300 6 3 4 0 0 500 5 5 5 0 0 700 9 9 2 0 0 1000 8 6 8 0 0 II 200 3 3 2 100 100 (Wrong) 300 6 4 5 100 100 500 5 7 2 100 100 700 6 3 4 100 100 1000 4 8 4 100 100 III 200 7 1 2 0 0 300 4 1 5 0 0 500 8 7 4 0 0 700 8 6 9 0 0 1000 7 4 7 0 0 Goodness of Fit for Ungrouped Data 21

Table 2 Average Percentages of Rejecting the MTA MTA HL CH DRYV-RF DRYV-BAG Smaller 4.6 5.0 99.8 99.2 (0.9) (0.9) (0.2) (0.5) Larger 7.6 8.2 27.6 30.2 (1.4) (1.2) (3.5) (3.5) No Lack-of-Fit 4.0 0 0 (0.9) (0) (0) 22 Biometrics, 000 0000

Table 3 DRYV Pairwise Comparison with 50% Training. The (i, j)-th entry records the percentages of times among the random data splittings the i-th model/method has a strictly larger predictive likelihood than the j-th model/method. Age-only BIC AIC RF BAG Age-only 0 18.6 29.0 52.3 81.3 BIC 81.3 0 52.0 79.0 93.0 AIC 71.0 48.0 0 75.0 89.0 RF 47.7 21.0 25.0 0 95.7 BAG 18.7 7.0 11.0 4.3 0 Goodness of Fit for Ungrouped Data 23

Table 4 DRYV Pairwise Comparison with 90% Training. The (i, j)-th entry records the percentages of times among the random data splittings the i-th model/method has a strictly larger predictive likelihood than the j-th model/method. Age-only BIC AIC RF BAG Age-only 0 31 25.3 47.7 62.0 BIC 69.0 0 37.0 73.0 82.7 AIC 74.7 63.0 0 81.7 89.3 RF 52.3 27.0 18.3 0 78.0 BAG 38.0 17.3 10.7 22.0 0