<<

Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2018 A Study of Some Issues of Goodness-of-Fit Tests for Wei Ma

Follow this and additional works at the DigiNole: FSU's Digital Repository. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

A STUDY OF SOME ISSUES OF GOODNESS-OF-FIT TESTS

FOR LOGISTIC REGRESSION

By

WEI MA

A Dissertation submitted to the Department of in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2018

Copyright c 2018 Wei Ma. All Rights Reserved. Wei Ma defended this dissertation on July 17, 2018. The members of the supervisory committee were:

Dan McGee Professor Co-Directing Dissertation

Qing Mai Professor Co-Directing Dissertation

Cathy Levenson University Representative

Xufeng Niu Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements.

ii ACKNOWLEDGMENTS

First of all, I would like to express my sincere gratitude to my advisors, Dr. Dan McGee and Dr. Qing Mai, for their encouragement, continuous support of my PhD study, patient guidance. I could not have completed this dissertation without their help and immense knowledge. I have been extremely lucky to have them as my advisors. I would also like to thank the rest of my committee members: Dr. Cathy Levenson and Dr. Xufeng Niu for their support, comments and help for my thesis. I would like to thank all the staffs and graduate students in my department. During the 5 years of my PhD time, they helped me so much for my study, research and life. I have been so lucky to attend FSU for my doctorate degree. Last but not the least, I would like to thank my family members for support and encouragement during my PhD study especially my parents who give birth to me at the first place and support me spiritually throughout my life.

iii TABLE OF CONTENTS

List of Tables ...... vi List of Figures ...... viii Abstract ...... x

1 Introduction 1 1.1 Review of Logistic Regression ...... 1 1.2 Goodness-of-fit Test ...... 2 1.3 Two Issues of Hosmer-Lemeshow Test ...... 3

2 Goodness-of-fit Test 6 2.1 Pearson’s Chi-square and Tests ...... 7 2.2 Tsiatis Test ...... 9 2.3 Unweighted Sum of Squares Test ...... 11 2.4 Hosmer-Lemeshow Test ...... 12 2.5 Motivating Example ...... 15

3 Grouping Test 17 3.1 Majority Vote Method ...... 17 3.2 Minimum P Method ...... 18 3.3 P Values Combined Method ...... 19 3.4 Averaging Statistics Method ...... 22

4 Simulation Studies of Grouping Test 24 4.1 Type I Error ...... 24 4.2 Power ...... 32 4.3 Conclusion ...... 43

5 Test 44 5.1 Global Interaction Test ...... 45 5.2 Local Interaction Test ...... 47 5.3 Generalization of the Binary Covariate ...... 48

6 Simulation Studies of Interaction Test 51 6.1 Type I Error ...... 51 6.2 Power ...... 52

7 Analysis of Real 65

8 Summary and Future Work 75

iv Appendices A Calculation of the Expectation for Five Grouping Tests 77

B IRB Approval 79

Bibliography ...... 83 Biographical Sketch ...... 88

v LIST OF TABLES

2.1 for goodness-of-fit tests ...... 8

2.2 HL tests for bone mineral density data ...... 16

4.1 Average rate of HL tests agreeing with each other across number of groups ...... 26

4.2 Type I error of HL tests for different number of groups with 500 replications, *m = 802 is not applicable for n = 500, 1000, 2000, 4000 as n/m is too small, so the results under these combinations are empty ...... 26

4.3 Type I error of majority vote with 500 replications ...... 27

4.4 Type I error of minimum p method and Bonferroni correction with 500 replications . 28

4.5 Type I error of p values combined methods with independent assumption with 500 replications ...... 29

4.6 Type I error of p values combined and averaging statistics using bootstrap approach with 500 replications and 1000 bootstrap samples ...... 31

4.7 Power of HL tests for different number of groups with 500 replications, *m = 802 is not applicable for n = 500, 1000, 2000, 4000 as n/m is too small, so the results under these combinations are empty ...... 33

4.8 Power of majority vote with 500 replications ...... 35

4.9 Power of minimum p method with 500 replications ...... 37

4.10 Power of p values combined and averaging statistics using bootstrap approach with 500 replications and 1000 bootstrap samples ...... 42

4.11 Summary of the performance of grouping tests ...... 43

5.1 Summary of five grouping tests and their expectations ...... 46

5.2 Summary of five local LR and their expectations ...... 48

5.3 Summary of five local LR and their expectations for ...... 49

6.1 Type I error of interaction tests with 500 replications and 1000 bootstrap samples . . 52

6.2 Power of different tests for Model 1, case 1 with 500 replications and 1000 bootstrap samples ...... 54

6.3 Power of different tests for Model 2, case 1 with 500 replications and 1000 bootstrap samples ...... 56

vi 6.4 Power of different tests for Model 3, case 1 with 500 replications and 1000 bootstrap samples ...... 57

6.5 Power of different tests for Model 4, case 1 with 500 replications and 1000 bootstrap samples ...... 59

6.6 Power of different tests for Model 1, case 2 with 500 replications and 1000 bootstrap samples ...... 61

6.7 Power of different tests for Model 2, case 2 with 500 replications and 1000 bootstrap samples ...... 62

6.8 Power of different tests for Model 3, case 2 with 500 replications and 1000 bootstrap samples ...... 63

7.1 Descriptive summary for BMD ...... 65

7.2 Grouping tests for model (7.2) ...... 70

7.3 Goodness-of-fit tests of model (7.4) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples ...... 71

7.4 Likelihood ratio test for bone mineral density data ...... 72

7.5 Grouping tests of model (7.5) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples ...... 73

vii LIST OF FIGURES

4.1 Type I error of minimum p method and Bonferroni correction with 500 replications . 28

4.2 Comparison of distribution of T between bootstrap and true samples with N = 1000 . 30

4.3 Type I error of p values combined and averaging statistics using bootstrap approach with 500 replications and 1000 bootstrap samples ...... 31

4.4 Power of majority vote with 500 replications ...... 36

4.5 Power comparison of majority vote and minimum p with Bonferroni Correction using 3 to 12, 18 groups with 500 replications ...... 38

4.6 Power of minimum p with BC using 3 to 12 and 18 groups with 500 replications . . . 39

4.7 Power of minimum p with BC using 6 to 12 groups with 500 replications ...... 40

4.8 Power of p values combined and averaging statistics with 500 replications and 1000 bootstrap samples ...... 41

6.1 Type I error of interaction tests with 500 replications and 1000 bootstrap samples . . 53

6.2 Power of different tests for Model 1, case 1 with 500 replications and 1000 bootstrap samples ...... 55

6.3 Power of different tests for Model 2, case 1 with 500 replications and 1000 bootstrap samples ...... 56

6.4 Power of different tests for Model 3, case 1 with 500 replications and 1000 bootstrap samples ...... 58

6.5 Power of different tests for Model 4, case 1 with 500 replications and 1000 bootstrap samples ...... 59

6.6 Power of different tests for Model 1, case 2 with 500 replications and 1000 bootstrap samples ...... 61

6.7 Power of different tests for Model 2, case 2 with 500 replications and 1000 bootstrap samples ...... 62

6.8 Power of different tests for Model 3, case 2 with 500 replications and 1000 bootstrap samples ...... 64

7.1 of BMD ...... 66

7.2 of BMD versus age ...... 67

viii 7.3 of BMD versus gender ...... 67

7.4 Box plot of BMD vs race-ethnicity ...... 68

7.5 Logit plot for bone mineral density data with 10 groups ...... 69

7.6 Goodness-of-fit tests of model (7.4) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples ...... 72

7.7 Grouping tests of model (7.5) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples ...... 74

ix ABSTRACT

Goodness-of-fit tests are important to assess how well a model fits a set of observations. Hosmer- Lemeshow (HL) test is a popular and commonly used method to assess the goodness-of-fit for logistic regression. However, there are two issues for using HL test. One of them is that we have to specify the number of partition groups and the different groups often suggest the different decisions. So in this study, we propose several grouping tests to combine multiple HL tests with varying the number of groups to make the decision instead of just using one arbitrary group or finding the optimum group. This is due to the reason that the best selection for the groups is data-dependent and it is not easy to find. The other drawback of HL test is that it is not powerful to detect the violation of missing interactions between continuous and dichotomous covariates. Therefore, we propose global and interaction tests in order to capture such violations. Simulation studies are carried out to assess the Type I errors and powers for all the proposed tests. These tests are illustrated by the bone mineral density data from NHANES III.

x CHAPTER 1

INTRODUCTION

1.1 Review of Logistic Regression

When the response variable is binary or dichotomous, logistic regression is a standard method to model the relationship between a response variable (dependent variable) and a set of predictors (independent variables). Suppose we are in the binary case and have a sample of n independent 0 th pairs (xi,Yi), i = 1, 2, ..., n, where xi = (1, xi1, ..., xip) is the covariate vector for the i subject and

Yi = 0, 1 is the outcome. Denote πi = P(Yi = 1|xi) as the probability of event. Then the logistic regression model is

eg(xi) πi = . (1.1) 1 + eg(xi)

The logit is given by the equation

h πi i g(xi) = log = β0 + β1xi1 + ... + βpxip , (1.2) 1 − πi 0 where β = (β0, ..., βp) are the regression coefficients. The maximum likelihood is used to estimate 0 the parameters. Denote βˆ = (βˆ0, βˆ1, ...βˆp). Since the observations are assumed to be independent, the log- is obtained as

n X l(β) = Yi log (πi) + (1 − Yi) log(1 − πi) . (1.3) i=1 This log-likelihood function is maximized using an iterative algorithm such as the Newton- Raphson method. The and of the estimated coefficients are obtained from −1 the inverse of the information matrix denoted as Var(βˆ) = ˆI (βˆ) = (X0VXˆ )−1, where X is an n + 1 by p design matrix containing the dependent variables for each subject and Vˆ is an n × n diagonal matrix with general elementπ ˆi(1 − πˆi). The logistic regression model is one of the classic models that can be used to estimate the probability of the event based on one or more features and it can also be used for classification.

1 ˆ The logistic regression model offers nice interpretation. The exp(βk), k = 1, 2, ... , p for a logistic regression is the that approximates how much more likely it is for the event to be happened when xk increases by one unit with other explanatory variables remaining the same. It allows one to identify if the presence of a risk factor increases the odds of a given outcome. The logistic regression remains popular today. Nashef et al. 1999 [32] built a scoring system for predicting the mortality of cardiac surgical patients. They obtained the score for each variable by the logistic regression coefficients. Witt et al. 2004 [56] performed the logistic regression to examine the association between cardiac rehabilitation after myocardial infarction (MI) and participation participation with survival, age, and gender. Bensic et al. 2005 [7] compared the accuracy for credit scoring in small-business by logistic regression, neural networks (NNs), and CART decision trees. Abreu et al. 2008 [2] measured the quality of life based on the two categorical variables: gender and marital status using ordinary logistic regression. Fullerton 2009 [18] analyzed ordinal dependent variables in social science research relying on logistic regression using examples from the General Social Survey and the American National Election Study. Cole et al. 2010 [11] developed a logistic regression model to predict the survival of preterm babies using the predictors of gestation, birth weight and base deficit from umbilical cord blood. Hauser and Booth 2011 [21] predicted the bankrupt firms by logistic regression using financial ratio data from 2006 to 2007.

1.2 Goodness-of-fit Test

After fitting the logistic regression, it is important to assess the adequacy of the fitted model in describing the relationship between the independent and dependent variables by goodness-of-fit tests. Goodness-of-fit tests are crucial for logistic regression due to the following reasons. If the logistic regression is misspecified, the interpretation of the relationship between response variable and explanatory variables could be wrong. So we should evaluate the model’s effectiveness using goodness-of-fit tests before interpreting the results. In addition, important statistical inferences can be drawn from the fitted model and the validity of any inferences depends on the assumption that the fitted model is a correct model. For example, missing one or more important covariates could affect the coefficient estimation. It will lead to the biased odds ratio which is crucial for estimating the treatment effect. Besides, the logistic regression is often used to predict the probability of the event. A lack-of-fit model will lead to low accuracy. Therefore, it’s important and worthy

2 to study the goodness-of-fit tests for logistic regression. The goodness-of-fit tests mainly examine three violations: if the logistic link is appropriate, if the linear combination of the explanatory variables are sufficient and if the underlying distribution for the outcome variable is Bernoulli.

1.3 Two Issues of Hosmer-Lemeshow Test

In this dissertation, we will focus on Hosmer-Lemeshow (HL) test which is very popular and widely used for logistic regression. HL test employs a very straightforward way to partition groups based on the estimated probabilities. It can be performed in most of statistical software packages, thus it is very easy to implement. However there are two issues of using this test. To obtain the of HL test, we need to construct a contingency table and assign the estimated probabilities in different groups. Then the statistic is computed from the usual chi- square test form. One of the drawbacks is the difficulty of finding the best selection of the number of groups. Table 2.2 which we are going to discuss in Chapter 2 indicates that the results for different number of groups are not the same and we do not have general guidance to select the best one with confidence. Hence, instead of trying to find the optimum number of groups, we proposed a set of grouping tests using different combination methods to combine the HL tests with varying the number of groups. In this way, we do not have to debate to choose one of the groups, instead, simply combine multiple groups and it can produce one single statistic. The other issue is that HL test is not powerful to detect the interaction term. Hosmer et al. 1997 [25] and Hosmer and Hjort 2002 [22] have indicated in their papers that the power of goodness-of- fit tests to detect the omission of interaction terms between continuous and binary covariates for logistic regression is generally low including the HL test. To solve the problem of choosing number of groups, we propose a set of grouping tests using different combination methods to combine multiple HL tests varying the number of groups. However, the power of detecting missing interactions is uniformly low regardless of choosing the number of groups. Therefore, the grouping tests do not solve the problem either. The simulation results presented in later Chapters, such as 6.2 prove this point that the power does not improve when combining HL tests with different number of groups. Pulkstenis and Robinson 2002 [40] proposed two goodness-of-fit tests with a two-stage modifi- cation of HL test. The subjects are first grouped by all the categorical covariates and then within each group, form two more subgroups according to the of the estimated probabilities. Then

3 the ordinary Pearson’s chi-square or the deviance is performed to compare the observed and ex- pected counts in the contingency table which is similar to the HL test. The authors pointed out that the proposed tests may be useful in detecting the lack-of-fit due to the missing of higher order interaction terms. Nevertheless, this approach requires strict structure of covariates in the model and if there are too many or too few levels of the categorical covariates, it may cause troubles to implement the test. In addition, they may not detect other types of violation. To address the problem of detecting the specific violation of missing interactions for fitted logistic regression, a natural choice is to consider the likelihood ratio (LR) test. LR test can be used to compare the goodness-of-fit for two models, in which the null model is a special case of the alternative model. The test measures if the data are more likely under one model than the other based on the likelihood ratio. The LR test is very powerful and easy to implement. According to the Neyman–Pearson lemma (Neyman and Pearson 1933[33]), the LR test is a uniformly most powerful (UMP) test which has the greatest power among all possible tests of a given size α. In the logistic regression model setting, we could include the interactions for one model but not for the other model in order to examine the violation of missing interactions. Pe˜naand Slate 2006 [37] constructed a test statistic for validating the four assumptions of the by a function of the four independent statistics, each focusing on a particular violation. The four assumptions for the linear regression are summarized in Pe˜naand Slate as (A1) Linearity: the relationship between the independent and dependent variables need to be linear; (A2) : the of error terms stay constant across the values of the independent variables; (A3) Uncorrelatedness: there is no correlation in the dependent variables; (A4) Normality: the data have a multivariate normal distribution. There are many formal tests that are only for a specific assumption. For example, Anscombe and Tukey 1963 [4] is for normality assumption, Cook and Weisberg 1983 [12] is for heterogeneity of variances, Tukey 1949 [51] is for link misspecification and Durbin and Watson 1950 [14] is for the uncorrelatedness. Therefore, Pe˜na and Slate proposed a test for all the assumptions (A1)-(A4) which could be used especially when we do not know which specific assumption is violated. The test is constructed by summing up the four component statistics, each designed for one specific assumption. Simulation studies indicated that the test possesses the ability to detect different type of violations of the model assumptions. In the logistic regression setting, the LR test is utilized for the violation of missing interactions and

4 HL is powerful to detect other types of violation. Therefore, this paper motivates us to propose an interaction test which is combining the LR test and HL test and this proposed test is used to detect different types of violation for the fitted logistic regression. To address these two issues of HL test, we propose grouping tests ans interaction tests cor- respondingly. We also conduct some numerical studies to assess the performance of these pro- posed tests. This dissertation is structured as follows. In Chapter 2, we will review some popular goodness-of-fit tests for logistic regression and we also apply them to a real dataset to motivate our study. In Chapter 3, several approaches will be discussed to combine the Hosmer-Lemeshow tests for grouping tests. Simulation studies of grouping tests will be presented in Chapter 4. In Chapter 5 and Chapter 6, we will discuss the methodology and simulation results of global and local interaction tests. An real data will be analyzed as an illustration for all proposed methods in Chapter 7. Then, in Chapter 8, we will discuss our planned future work.

5 CHAPTER 2

GOODNESS-OF-FIT TEST

The goodness-of-fit test assesses if the fitted model is correct. The null hypothesis for this test is that the logistic regression model is true and the is that the fitted model is not sufficient for the data. In the logistic regression context, the essential components of fit are listed by the following three assumptions (Hosmer et al. 1997 [25]): i) the logit link function is appropriate, g(x) = Xβ ii) no omission of , transformation of predictors, or interactions of predictors, so the linear combination, Xβ, is sufficient

iii) the variance is Bernoulli, Var(Yi|xi) = πi[1 − πi]. The consequences of violation of above assumptions are serious. The estimated coefficients and corresponding standard errors will be biased thus significance tests and confidence interval may be misleading. Besides, it could also lead to poor estimation of other quantities. For example, the inaccurate odds ratio estimation will influence the interpretation of the treatment effect. If the model is misspecified, it could reduce the accuracy of the prediction and classification for the new subjects. The interpretation of the relationship between the response variable and independent variables could also be inaccurate. Many goodness-of-fit tests have been proposed and most of them examine the overall fit for the fitted model (Pearson 1900 [36], Tsiatis 1980 [50], Hosmer and Lemeshow 1980 [23], Brown 1982 [9], Stukel 1988 [46], Copas 1989 [13], Azzalini et al. 1989 [6], Su and Wei 1991 [48], Le Cessie and Van Houwelingen 1991 1995 [27] [28], Osius and Rojek 1992 [34], Farrington 1996 [16], Pigeon and Heyse 1999 [39] [38], Hosmer and Hjort 2002 [22], Pulkstenis and Robinson 2002 [40]). There are also some tests that examine the specific lack-of-fit (Royston 1992 1993 [43][44]). In the following sections, we will describe some commonly used goodness-of-fit tests for logistic regression.

6 2.1 Pearson’s Chi-square and Deviance Tests

We first define the term covariate pattern that we will use for two goodness-of-fit tests. The covariate pattern is used to describe a particular configuration of values for the covariates in a model. For example, if the model contains only two categorical predictors, race and sex, each coded with two levels, then there are only four distinct combinations of covariates and thus four possible covariate patterns. When continuous covariates are present in the model, each observation could potentially lead to a different covariate pattern. Then the asymptotic chi-square distribution of the test statistic does not hold when the counts are so small for the cells in contingency table. Person’s chi-square test and deviance test are constructed using two measures of the difference between the observed outcome variable Y and the fitted value πˆ (the estimated probability of the event, Y = 1) : the Pearson residual and the deviance residual. The observations of X is an n by p + 1 covariate matrix and the first column is 1 for the intercept. Let J denote the number of covariate patterns for X. If some subjects have the same value of covariates then J < n. We PJ denote the number of subjects with x = xj by mj, j = 1, 2, ...J. It follows that j=1 mj = n. Let yj be the number of events, y = 1, among the mj subjects for the particular covariate pattern. it PJ follows that j=1 yj = n1, the total number of subjects with Y = 1. In a logistic regression, the fitted values are calculated for each covariate pattern and depend on the estimated probability for that covariate pattern. So for the jth covariate pattern, the fitted value

h egˆ(xj ) i yˆj = mjπˆj = mj , (2.1) 1 + egˆ(xj )

whereg ˆ(xj) = βˆ0 + βˆ1xj1 + βˆ2xj2 + ... + βˆpxjp is the estimated logit.

2.1.1 Pearson’s Chi-square Test

For a particular covariate pattern, the Pearson residual is

(yj − mjπˆj) r(yj, πˆj) = p . (2.2) mjπˆj(1 − πˆj) The Pearson’s chi-square statistic based on this these residuals is

J 2 X 2 χ = [r(yj, πˆj)] . (2.3) j=1

7 The Pearson residuals, r(yj, πˆj) measure the difference between the observed outcomes and the fitted probabilities within each covariate pattern and then χ2 takes the summation for J covariate patterns. The small value implies that the model is a good fit and large number indicates the evidence of deviation. We can think of χ2 as the Pearson’s chi-square statistic that results from a 2 × J contingency table referred as Table 2.1. The columns of the table are two possible values of the response,

Y = 1, 0 and the rows are J covariate patterns. Let O1j denote the observed value and E1j denote th the expected value with Y = 1 for the j covariate pattern. Similarly, O0j and E0j are the observed value and expected value for Y = 0. The estimated expected value under the hypothesis that the logistic model is the correct model for Y = 1 column is mjπˆj and mj(1 − πˆj) for Y = 0 column. Then the statistics χ2 are calculated as:

1 J 2 X X (Oij − Eij) χ2 = . (2.4) E i=0 j=1 ij

Covariate pattern Y = 0 Y = 1 Total

1 O01 (E01) O11 (E11) n1 2 O02 (E02) O12 (E12) n2 ...... J O0g (E0g) O1g (E1g) ng

Table 2.1: Contingency table for goodness-of-fit tests

2.1.2 Deviance Test

The deviance test which is a likelihood ratio test comparing the fitted model to a saturated model also measures the difference between Oij and Eij. But it takes the summation of deviance residuals instead of the Pearson residuals. The deviance residual is

n h  yj   mj − yj io1/2 d(yj, πˆj) = ± 2 yj log + (mj − yj) log , (2.5) mjπˆj mj(1 − πˆj) where the sign + or − is the same as the sign of (yj − mjπˆj). The summary statistic based on the deviance residuals is

8 J X 2 D = d(yj, πˆj) . (2.6) j=1 In the context of contingency table, we can think of D as the log-likelihood chi-square statistic. Then the statistic of D is

1 J X X Oij D = 2 O log . (2.7) ij E i=0 j=1 ij Under the null hypothesis that the fitted logistic regression is correct, the distribution of statis- tics χ2 and D follow a chi-square distribution with J − (p + 1) degrees of freedom. We reject the null hypothesis of no significant lack-of-fit when the statistic exceeds the corresponding upper α critical value. The Pearson’s chi-square and deviance tests defined in Equation (2.4) and (2.7) are classical goodness-of-fit tests and they are conceptually elegant. However, in reality, there are issues using these two tests for logistic regression. Under the null hypothesis that the estimated expected values are “large” in each cell (i.e. mjπˆj > 5) and the number of cells remains fixed, the p-values are correct when chi-square tests are computed from contingency table. Thus, p-values calculated for χ2 and D will not be valid when the number of covariate patterns is close to the number of subjects, (i.e. 2 J ≈ n) using the χJ−p−1 distribution. As a result, these two test statistics may not be appropriate for continuous covariates. Since the approximation to the chi-square distribution for Pearson’s chi- square test is not valid, Osius and Rojek 1992 [34] derived a large sample normal approximation to the distribution of the Pearson’s chi-square statistic. Farrington 1996 [16] proposed approximations to the first three moments of Pearson’s statistic which has better sparse data properties. However, this test has the structural deficiency of never rejecting H0 when the covariate pattern is equal to the number of observations, i.e. J = n.

2.2 Tsiatis Test

In the contingency table as in Table 2.1, the expected values are always quite small if the model contains at least one continuous covariate since the number of rows increase as n increases. One way to deal with this problem is to collapse the rows into a fixed number of groups, and then calculate

9 observed and expected frequencies. Tsiatis 1980 [50] proposed an approach based on fixed groups in the “x” space. Tsiatis statistic partition the covariates into G regions and then perform a . It adds a constant term in each region via a series of indicator functions. Then the conditional probability of a successful outcome can be expressed as

h πi i 0 0 log = xiβ + γ Ii , (2.8) 1 − πi

0 0 0 0 (1) (2) (G) where xi = (1, xi1, ..., xip), β = (β0, β1, ..., βp), γ = (γ1, γ2, ..., γG) and Ii = (Ii ,Ii , ..., Ii ). (j) (j) th The indicator functions Ii , j = 1, 2, ..., G are defined by Ii = 1 if xi lies in the j region (j) and Ii = 0 otherwise. The null hypothesis for the goodness-of-fit test is H0 : γ1 = γ2... = γG = 0 and H1: at least one of the γj 6= 0. This test is based on a comparison of the observed to estimated expected frequency within each of the partition group that yields a score test. The Tsiatis statistic is U = X0V −X , (2.9)

0 where X is G dimensional vector and X is the G-dimensional vector (∂l/∂γ1 , ..., ∂l/∂γG ) and l is the log-likelihood function for logistic regression. The G × G matrix V can be expressed as

V = A − BC−1B0 . (2.10)

In terms of kth covariate and jth group,

2 0 Ajj0 = −∂ l/∂γj ∂γj0 (j, j = 1, 2, ..., G) (2.11)

2 Bjk = −∂ l/∂γg ∂βk (j = 1, 2, ..., G; k = 0, 1, ..., p) (2.12)

2 0 Ckk0 = −∂ l/∂βk∂βk0 (k, k = 0, 1, ..., p) (2.13)

X and V are evaluated at γ = 0 and β = βˆ and V − is any generalized inverse of the matrix V . For logistic regression, Tsiatis 1980 [50] demonstrated that the statistic is a quadratic form of the vector of observed counts minus expected counts. The jth element of X is

n n X (j) X (j) 0 ˆ 0 ˆ YiIi − Ii exp(xiβ)/(1 + exp(xiβ)) (2.14) i=1 i=1

10 and the components for computing the matrix V are ( P πˆ (1 − πˆ )(j = j0) ξj i i Ajj0 = (2.15) 0 (j 6= j0; j, j0 = 1, 2, ..., G) X Bjk = xikπˆi(1 − πˆi)(j = 1, 2, ..., G; k = 0, 1, ..., p) (2.16)

ξj n X 0 Ckk0 = xikxik0 πˆi(1 − πˆi)(k, k = 0, 1, ..., p) (2.17) i=1 The Tsiatis statistic is asymptotically distributed as chi-square with a degree of freedom the same as the rank of V . If the covariates are discrete, the Pearson’s chi-square test is the same as Tsiatis statistic. However, with a complicated model, it is difficult to partition the covariates into meaningful groups.

2.3 Unweighted Sum of Squares Test

If the expected frequencies are small, then the approximation to the chi-square distribution for Pearson’s test is not valid. Copas 1989 [13] proposed an unweighted sum of squares test to retain all the individual groups. The test statistic is

n X 2 Sˆ = (Yi − πˆi) . (2.18) i=1 The moments and asymptotic distribution of the statistic Sˆ are provided from Hosmer et al. 1997 [25]. Under the null hypothesis that the fitted logistic regression is true, the asymptotic moments of Sˆ are

E(Sˆ − trace(V )) =∼ 0 (2.19)

Var(Sˆ − trace(V )) =∼ d0(I − M)V d , (2.20) where d is a vector with di = 1 − 2πi , V = diag[vi = πi(1 − πi)] is an n by n diagonal matrix and M = VX(X0VX)−1X0 is the logistic regression version of the hat matrix. The matrix of X is the design matrix with n by (p + 1) dimension. We obtain the estimates of moments for Sˆ by using q Vˆ . Therefore, [Sˆ − trace(Vˆ )] Var[d Sˆ − trace(Vˆ )] is used to assess the significance by standard normal distribution.

11 There is some evidence that supports the unweighted sum of squares test. Hosmer et al. 1997 [25] compared a series of goodness-of-fit tests by simulation. They recommended to use the Pearson’s chi-square test with approximation to standard normal distribution and unweighted sum of squares because of their superior power than other tests. Sturdivant and Hosmer Jr 2007 [47] developed a modification with kernel smoothed residuals to the unweighted sum of squares goodness-of-fit test for hierarchical logistic regression models and then the standardized statistic is compared to the standard normal distribution. However, the optimum bandwidth choice for this test is not very clear. Other methods involving residuals have been proposed. Su and Wei 1991 [48] proposed a test based on cumulative sums of residuals but the p-values are determined by complicated and time consuming simulations. Le Cessie and Van Houwelingen 1991 1995 [27] [28] proposed tests based on sums of squares of smoothed residuals. However, neither test is available in software packages at this time.

2.4 Hosmer-Lemeshow Test

Hosmer and Lemeshow 1980 [23] applied the idea of contingency table (see Table 2.1) and proposed a goodness-of-fit test referred to as the Hosmer-Lemeshow (HL) test now. They partition the table based on the ranked estimated logistic probabilitiesπ ˆi in two grouping strategies: (1) form groups of equal numbers of subjects and (2) use fixed cut points on the [0, 1] interval. The statistic using 10 equal sized groups is a special case of the grouping strategy called “deciles of risk” and can be computed in many statistical software packages. The basis of this goodness-of-fit statistic is to construct a 2 × g contingency table. First, define a W where wi = j, if cj−1 ≤ πˆi < cj, j = 1, ..., g, i = 1, ...n. The cj’s are known constants and 0 = c0 < c1 < ... < cg−1 < cg = 1. For the deciles of risk grouping strategy, force the marginal distribution of W to be uniform to form the 2 × g table. That is, the cut points depend on the estimated probabilitiesπ ˆi and are determined so that in the first group containing n1 = n/g subjects having the smallest estimated probabilities and the last group containing n10 = n/g subjects having the largest estimated probabilities. For the fixed cut points strategy, for example, using g = 10 groups, the first groups containing all the subjects whose estimated probabilities are less than or equal to 0.1, while the tenth group contains those subjects whose estimated probability is greater than 0.9. Denote the observed frequency of the pair (Yi = k, wi = j), k = 0, 1, j = 1, 2, ..., g

12 Pnj as Okj and the expected frequency as Ekj. The expected frequency is computed by E1j = i=1 πˆi, th and E0j = nj − E1j, where nj is the number of subjects in the j group, j = 1, 2, ..., g. Table 2.1 provides the data classification for Hosmer-Lemeshow test but using g groups partitioning on the fitted probabilities instead of the J covariate patterns. The goodness-of-fit statistic is obtained by comparing the observed frequencies to ones which are expected if the hypothesis of a logistic regression model holds and it’s constructed in the usual chi-square goodness-of-fit form. Then, the test statistic is:

g 1 2 X X (Okj − Ekj) Cˆ = . (2.21) Ekj j=1 k=0 The distribution of the statistic Cˆ is well approximated by the chi-square distribution with 2 g − 2 degrees of freedom, i.e. χg−2 under the null hypothesis that the fitted logistic regression is the correct model. Hosmer et al. 1988 [24] recommended avoiding using equal sized grouping strategy for HL test when the estimated probabilities are small and avoiding using fixed cut points when the of estimated probabilities is narrow based on simulation results. They also point out that the grouping method based on percentiles of the estimated probabilities is preferable to the one based 2 on fixed cut points to achieve better adherence to the χg−2 distribution, especially when many of the estimated probabilities are small. Hosmer-Lemeshow test is widely used due to some desirable properties. It employs a straight- forward grouping method partitioning the ranked fitted probabilities which resolves the issue of invalid p-values calculation for Pearson’s chi-square test. This grouping strategy does not depend on the covariate patterns. It is only based on the estimated probabilities, thus can be used for any structures of the predictors. HL test can provide an easily interpretable and computable number that can be used to assess fit. Besides, it can be performed in most of statistical software packages. However, HL test also has some deficiencies. The first one is that the HL test statistics as well as the associated p-values are certain to vary with the selection of number of groups. As a result, it brings ambiguousness for making the decisions according to the quite different p-values calculated by different groups. We have to debate whether to trust one of the groups with no extra information. This issue makes the process of decision-making very difficult. The second point is that HL test examines the overall fit of the model but it has low power to detect specific types

13 of lack-of-fit such as omission of interaction terms (Hosmer et al. 1997 [25], Hosmer and Hjort 2002 [22]). Aside from this two major issues, it has some additional minor issues. Kramer and Zimmerman 2007 [26] point out that HL is very sensitive to the sample size. When the sample size is large, the HL test almost always rejects the null hypothesis. The distribution of HL statistic has not been derived rigorously and the approximation to the chi-square distribution is only according to the simulation results. When using the decile of risk grouping method, the subjects belonging to the same group may have quite different values for the covariates but with similar estimated probabilities. So we get no information where there is a lack of fit in terms of covariates. There is also no advice to improve the model when HL statistic is statistically significant. When ties exist in the estimated probabilities, different software may use different methods to handle ties. Bertolini et al. 2000 [8] found that the order of observations could be another possible explanation of inconsistent HL test statistics other than the selection of ways to assign ties when the number of covariate patterns are less than the subjects. The popularity and deficiency of HL test has been studied by many researchers in recent years. A class of tests were developed by Hosmer and Hjort 2002 [22] based on weighted partial sum of residuals, referred to as Hjort-Hosmer tests. The tests had the highest power to detect the departure from the omission of a quadratic term and they also had low power to detect lack of fit due to omission of an interaction term like HL test. Quinn et al. 2015 [42] compared several goodness-of-fit tests for a log-link regression. They pointed out that the Hjort-Hosmer [22] statistic with weights equaling to 1 outperformed the others and HL is the next best option. Pulkstenis and Robinson 2002 [40] proposed two tests: χ∗2 and G∗ which first use categorical covariates to form groups and then within each of these groups, partition the estimated probabilities, form two more subgroups and perform the HL test for regression models containing both categorical and continuous covariates. Simulation results showed that the proposed tests are powerful to detect the omission of interaction terms than HL test. Archer et al. 2007 [5] proposed a series of goodness-of- fit tests for logistic regression models when data were collected using a complex design. They found that the complex sampling design based goodness-of-tests were more powerful to detect the departure from the misspecified covariates and interaction terms than iid-based HL tests. Xie et al. 2008 [58] proposed two goodness-of-fit tests which performed a Pearson chi-square statistic or score statistic after partitioning groups in the covariate space using . The proposed

14 tests outperform HL test in all simulated scenarios and can identify what part of the covariates do not fit well when the statistic is significant. Fagerland et al. 2008 [15] generalized the HL test for multinomial logistic regression. Abdul Hamid et al. 2017 [1] compared the goodness-of-fit test using clustering partitioning of the covariate space which is a direct generalization to the approach suggested by Xie et al. 2008 [58] for binary logistic regression and the test by Fagerland et al. 2008 [22] for multinomial logistic regression. Abdul Hamid et al. 2017 [1] recommended to use the clustering partition test for multinomial logistic regression when the model do not contain highly skewed covariates. Canary et al. 2017 [10] compared the Tsiatis statistic and HL test using simulations and concluded that the performance of them was similar.

2.5 Motivating Example

HL is well received nowadays for assessing the goodness-of-fit for logistic regression and it can be performed in most statistical softwares. However, there is no general rule to choose the number of groups when performing the HL test. The magnitude of HL statistic is certain to vary using different number of groups. The multinomial version of HL test statistics appeared to be very different varying the selection of number of groups in Fagerland et al. 2008 [15] when analyzing the data from a study of cytological criteria for the diagnosis of breast tumors. The p-values for the tests were 0.154, 0.376 and 0.803 for 8 groups, 10 groups and 12 groups respectively. Pigeon and Heyse 1999 [39] suggested to use different groupings to assess the sensitivity of Cˆ. Paul et al. 2013 [35] studied methods for specifying the number of groups so that the power can be standardized across the different sample size. It is suggested to use 10 groups for sample size n < 1000 and use  n m n−m n 2o g = max 10, min 2 , 2 , 2 + 8( 1000 ) for 1000 < n < 25000. As shown in Fagerland et al. 2008 [15], the p-values of the multinomial version of HL test are quite different for 8, 10 and 12 groups. We use an example to illustrate that the performance of HL test is very sensitive to the selection of number of groups and it could suggest different conclusions when using different number of groups. The data is provided by Third National Health and Nutrition Examination Survey. As one of the co-investigators, the research has been reviewed and approved by an IRB and the required documents are attached in Appendix B. The outcome is mortality status for each subject coded with 0 and 1 and the predictor is the bone mineral density (BMD, g/cm2) at femur neck. The goal is to study the relationship between the mortality and the

15 bone mineral density. We choose to fit the logistic regression and perform the HL to assess the fit of the model. Table 2.2 shows the results of HL tests with different selection of number of groups.

Table 2.2: HL tests for bone mineral density data

Number of Groups 6 7 8 9 10 11 12 Statistic 6.0796 8.5773 10.6444 16.9208 23.8340 17.1968 18.9178 P-value 0.1933 0.1272 0.1000 0.0179 0.0024 0.0457 0.0413 Decision not reject H0 not reject H0 not reject H0 reject H0 reject H0 reject H0 reject H0

It should be noted that the p-values differs greatly ranging from 0.0024 to 0.1933 when using 6 to 12 groups. It also gives different decisions: 3 tests suggest not to reject the null hypothesis but the other 4 tests recommend to reject H0. Therefore, it would be very challenging to use HL test since the magnitude of statistic Cˆ varies with the selection of number of groups and thus gives different decisions which brings ambiguity. This is concordant with the simulation results in chapter 4 Table 4.1 which indicates the grouping selection affects the results and they do not always agree with each other. So further research on this behavior is meaningful. In the following sections, we will propose some tests to combine the HL tests with different number of groups.

16 CHAPTER 3

GROUPING TEST

As discussed in Chapter 2, the application of Hosmer-Lemeshow (HL) test depends on the group selection. In later Chapters, the simulation shows that the best selection is data-dependent so it may not be easy to find. Therefore, we take an alternative approach which combines the HL tests with different group numbers instead of finding the optimal group. Combination methods are often more robust (Qian and Yang 2016 [41], Wei and Yang 2012 [53], Liu and Yang 2012 [29]). In this Chapter, we introduce a set of grouping tests using different methods to combine the HL test results and in this way, we expect that the combined statistics will be more robust and have performance uniformly close to the best number of groups. For a data set, after fitting the logistic regression, we apply multiple HL tests with varying numbers of groups and then combine them to make a single decision. To introduce the approach, we start by setting the notations. Suppose we do HL tests J times and the number of groups are chosen from the set of {m1, m2, ... mJ }. That is, we use m1 groups for the first HL test and then m2 for the second HL test and so on. Each time, we can obtain a test statistic and a corresponding p-value. So there are J statistics and p-values, denoted as {c1, c2, ... cJ } and {p1, p2, ... pJ }. We consider following combination methods in different sections: majority vote, minimum p, minimum p with Bonferroni correction, Tippett, Stouffer, Fisher, Logit and averaging statistics.

3.1 Majority Vote Method

The intuition for the majority vote method is that we may trust the decision made by most of groups. In our context, the method is that if more than half of the HL tests with different selection of groups reject the null hypothesis, then we make the decision to reject the null hypothesis. In other words, mathematically, we reject H0 if

J X J I > , (3.1) (pi<α) 2 i=1 where J is the total number of tests.

17 Majority vote has lots of advantages for making a binary decision and it can deliver the decision in a fair way. Because it’s easy to understand and implement, we apply this idea and make it as one of our methods to combine the HL tests results. We expect the majority vote method will be more robust because it combines the information of all the tests. Consider HL results for the bone mineral density data in Table 2.2. We desire the significance level α = 0.05. Since four groups out of seven are significant which is more than half of the total number of tests, we reject H0 based on the majority vote method. Majority vote can control Type I error but sometimes it may lead to low power based on the simulation results that we are going to present in Chapter 4, then we consider minimum p method in next section.

3.2 Minimum P Method

The simulation results in Chapter 4 indicate that the majority vote has low power in some situations, then we suggest a more aggressive method. The majority vote method makes the decision based on the majority of the tests, but the minimum p method is more aggressive that rejects the null hypothesis if any of the tests rejects the null hypothesis. That is, we reject H0 if

min(p1, p2, ...pJ ) = p[1] < α . (3.2)

This minimum p method rejects H0 if any of the tests are significant and it raises the power compared to the majority vote, but in the meantime, the chance of incorrectly rejecting a null hypothesis (making a Type I error) increases. We use the Bonferroni correction to compensate for that increase by testing each individual HL test at a significance level of α/J. Therefore, the minimum p method with Bonferroni correction (BC) rejects H0 if

min(p1, p2, ...pJ ) = p[1] < α/J . (3.3)

We expect the Type I error can be controlled by the Bonferroni correction but it will lead to a conservative test at the same time. Therefore, we will study if the minimum p method with BC will be as aggressive as we expect in the simulation.

In Table 2.2, the minimum p value is p[1] = 0.0024 which is less than α = 0.05. So we reject the null hypothesis using the minimum p method. With the Bonferroni correction, p[1] = 0.0024 <

α/J = 0.007, then we also reject H0.

18 3.3 P Values Combined Method

In the previous two sections, the majority vote and minimum p methods both combine HL tests according to the comparison between p value and the significance level α. In this section, we will construct several statistics from the function of {p1, p2, ... pJ }. Compared to previous two combined methods, the p values combined methods involve more information on p-values. These methods are widely used in meta analysis under the assumption that all the tests are indepen- dent. The independence assumption is very crucial. Hartung et al. 2008 [20] mentioned that the observed p-values derived from multiple independent tests follow a uniform distribution when the null hypothesis is true regardless of the form of the test statistic, the underlying testing problem and the nature of the parent population from which samples are drawn. Then the distributions of all the proposed combined statistics which are the transformations of independent p-values are derived based on this fact. If the independence assumption is violated, the distributions derived from the test statistics will no longer hold. In reality, maybe the fact from Hartung et al. 2008 [20] will still work for the weakly dependent data. However, in our context, the HL tests are highly dependent since all the tests are performed on the same data set. Therefore, it is inappropriate to use the derived distributions to calculate the p-value of the combined methods when applying on HL tests. Our solution is to use bootstrap method instead to approximate the distribution of them and obtain p-values from it. In the simulation part, we will examine the null distribution of the combined methods under the independence assumption and compare the results for these methods between assuming the independent assumption satisfied and using the bootstrap method. The performance of the bootstrap method is summarized in the simulation part. We are now ready to present the combined methods for summarizing p-values: Tippett, Stouffer, Fisher and Logit methods as follows.

3.3.1 Tippett’s Method

Tippett 1931[49] proposed a test based on the minimum of the independent p-values that rejects the null hypothesis if any of the p-values is less than α∗, where α∗ = 1 − (1 − α)1/J and J is the total number of tests. That is, we reject H0 when

∗ 1/J min(p1, p2, ...pJ ) = p[1] < α = 1 − (1 − α) . (3.4)

19 This method is a special case of Wilkinson 1951 [54] with the smallest p-values. Under H0 is J true, the statistic follows a beta distribution, thus the p-value for this test is 1 − (1 − p[1]) . In fact, we can approximate the Tippett’s method to the minimum p method with Bonferroni correction numerically. By first order Taylor expansion, the criteria rejecting the null hypothesis for Tippet’s method is

α∗ = 1 − (1 − α)1/J ≈ 1 − (1 − α/J) = α/J , (3.5) which is the criteria for rejecting the null hypothesis for Bonferroni correction. But we recommend to use Bonferroni correction instead of Tippett’s method because the Bonferroni correction method does not require the independence assumption but the Tippett’s method need it for validity. The Bonferroni correction is unlikely to inflate the Type I error but the power of it is waited to be seen. For the completeness, we will still include the results of Tippett’s method in the simulation studies and real data analysis in the following sections. When combining the HL tests for bone mineral density data in Table 2.2 using Tippett’s method, the test statistic is p[1] = 0.0024.

3.3.2 Stouffer’s Method

Stouffer’s method (Stouffer et al. 1949 [45]) is based on the Z scores rather than just using p-values directly. Under the null hypothesis and the independence assumption, pi follow a uniform −1 distribution, then z(pi) = Φ (pi) is a standard normal variable, where Φ(.) is the cumulative distribution function for standard normal. Then convert the p-values {p1, p2, ... pJ } to z scores

{z(p1), z(p2), ... z(pJ )} and obtain iid standard normal variables. Thus, the test statistic is

J X z(pi) Z = √ , (3.6) i=1 J where J is the total number of tests. Then the statistic Z is a standard normal variable and the

Stouffer’s method rejects null hypothesis when |Z| > zα.

In our context, multiple HL tests are highly correlated, so z(pi) still follows the normal distribu- tion, but the variance is no longer correct due to the dependency structure among the HL tests. So the Equation (3.6) will not follow a standard normal distribution. Therefore, we will just calculate the combined statistic and obtain the critical value by bootstrap method.

20 We revisit the Table 2.2 to obtain the Stouffer’s statistic. First, we need to compute the z scores P √ for seven probabilities. Then Z = z(pi)/ 7 = −4.3935.

3.3.3 Fisher’s Method

Fisher’s method (Fisher 1934[17]) is another p values combined method and it’s also widely known. It combines the p-values into one test statistic X2 using the formula:

J 2 X X = −2 log(pi) . (3.7) i=1 If the independence assumption holds, the statistic follows a chi-square distribution with 2J degrees of freedom and J is the number of tests being combined. This is because under the null hypothesis, pi follows a uniform distribution and then −2 log(pi) is a chi-square distribution with 2 degrees of freedom. So the sum of J independent chi-square values follow a chi-square distribution with 2J degrees of freedom. This test rejects H0 when the statistic exceeds the 100(1−α)% critical value of the chi-square distribution with 2J degrees of freedom. This logarithm transformation of p-values is very straightforward and easy to implement. It is widely used in meta analysis, but due to the dependent structure of HL tests, the chi-square distribution for the test will no long hold. We will just calculate the Fisher’s statistic using Equation (3.7) as another p values combined method for HL tests. Bootstrap method is still needed to obtain the p-value for this method. On the basis of Fisher’s method, the combined statistic for bone mineral density data in Table 2.2 is X2 = 44.6333.

3.3.4 Logit’s Method

The last p values combined technique we are going to involve is called Logit’s method. George 1978 [19] proposed this test based on a logit transformation of p-values. The statistic is

J 2 X h pi ihJπ (5J + 2)i−1/2 G = − log . (3.8) 1 − p 3(5J + 4) i=1 i Under the null hypothesis for the independent tests, we can approximate G with the t dis- tribution with 5J + 4 degrees of freedom. So the test rejects H0 when the statistic exceeds the 100(1 − α)% critical value of the t distribution. Just like the previous three combined methods, the application on HL tests violates the independent assumption, so the approximated t distribution

21 may not be correct in our situation and then we will still use the bootstrap method and compare these two results in the simulation. Referring to Table 2.2, the Logit test’s statistic calculated by Equation (3.8) for this data is G = 4.6534.

3.3.5 Bootstrap Algorithm

The distributions of four p values combined methods described above are derived from the bootstrap method. We use W to denote the general statistic for these tests. The general steps for different methods are very similar except that we need to calculate the specific statistic for each test. The steps for bootstrap method are as follows.

Algorithm 1 Bootstrap’s algorithm Input: a data set with n observations, (X,Y ) Output: p-value for W 1: Fit the logistic regression and obtain the estimated coefficients, βˆ 2: Apply multiple Hosmer-Lemeshow tests and calculate W 3: for i = 1 to N do 4: Generate a data set using βˆ and the original X with n observations, (X,Y (i)) 5: Apply the Hosmer-Lemeshow tests and calculate W (i) 6: end for PN I (i) i=1 (W (i)>W ) 7: The p-value is the proportion of W > W , N

The bootstrap method uses the estimated coefficients to generate N data sets and N new statistics to approximate the distribution of test statistics. This is the same step for all p values combined methods by replacing W with the corresponding test statistic.

3.4 Averaging Statistics Method

All the proposed methods in the previous sections are based on the p-values, but combining p-values are very tricky. In this section, we consider to construct three statistics from the functions of HL test statistics, {c1, c2, ... cJ } with different weights based on p-values. Hosmer and Lemeshow (1980) [23] showed, based on the simulations, that if the null hypothesis is true, then the distribution of HL test statistic for g groups is well approximated by a chi-square 2 distribution with g − 2 degrees of freedom, χg−2 and so the of the statistic is g − 2. Based on this fact, we can scale the HL test statistics by their corresponding degrees of freedom for different

22 selection of groups and then combine the J HL tests information by averaging the scaled test statistics. That is, after divided by its degrees of freedom, mi − 2, the new statistic, denoted as T , is the mean of these scaled ci : J 1 X ci T = . (3.9) J m − 2 i=1 i

We also apply weights to each ci by divided by the corresponding pi. In this way, the more significant test gets more weights. Denote it as T1.

J 1 X ci T = . (3.10) 1 J p (m − 2) i=1 i i

Instead of just using the p-value directly, we also try the square of pi. Denote this test as T2.

J 1 X ci T = . (3.11) 2 J p2(m − 2) i=1 i i The averaging statistics combine HL tests by using both the test statistics and p-values. In this way, they capture more information but are also easy to interpret and implement. However, we do not know the distribution of these statistics, then we will use bootstrap method to calculate the p-value for these tests in all the simulation studies which is the same as using p values combined methods referred to Algorithm 1. We propose a set of grouping tests using different methods to combine HL tests in this Chapter. In next Chapter, we will examine the performance of the proposed methods based on simulations in two ways: when the null hypothesis is true and when the alternative hypothesis is true.

23 CHAPTER 4

SIMULATION STUDIES OF GROUPING TEST

In this Chapter, we consider a number of different scenarios to examine the performance of the grouping tests. We divide the study into two parts. One is to examine if the proposed methods could control the Type I error when the fitted logistic model is the correct model. The other part is to assess the power of the tests to detect the departure from the logistic model for different situations.

4.1 Type I Error

In this section, we are concerned with rejection rates for the statistics when the correct model is fit to the data. we used four different distributions of covariates as the true models to generate data. The covariate distributions were picked from Hosmer et al. 1997 [25]. The four models are as follows:

Model 1: logit(π) = 0.8X1

Model 2: logit(π) = 0.8X2

Model 3: logit(π) = −4.9 + 0.65X3

Model 4: logit(π) = −1.3 + (0.8/3) X1+ (0.8/3) X2+ (0.65/3) X3 , 2 where X1 ∼ U(−6, 6), X2 ∼ N(0, 1.5), X3 ∼ χ (4) and X1, X2 and X3 are independent. These models are classical models to examine the performance of the goodness-of-fit tests. Hosmer and Hjort 2002 [22], Fagerland et al. 2008 [15], Abdul Hamid et al. 2017 [1] and Canary et al. 2017 [10] also used all these models or part of these models in their simulation studies. These models cover various scenarios. Model 1 to 3 use univariate predictors and Model 4 has three covariates. The uniform distribution U(−6, 6) from Model 1 produces mostly small or large probabilities π with a symmetric distribution. The distribution of the probabilities π is right skewed that has a long right tail using the chi-square covariate in model 3. The chi-square distribution generates mostly small probabilities but only a few large probabilities. The other two covariate distribution structures for Model 2 and Model 3 produce the probabilities uniformly on the interval

24 (0, 1). The probability of Y = 1 are 0.53, 0.508, 0.191, 0.451 for 4 models respectively using sample size n = 1000. Model 3 has an unbalanced outcome distribution with only 0.191 successful rate. The event rate for other 3 models are nearly 50 percent. In all simulations, we generated data with different sample size, n =500, 1000, 2000, 4000 and 10000 for each model. Algorithm 2 gives the specific steps to generate data sets and fit the logistic model. After fitting the logistic regression, we employed several HL goodness-of-fit tests by varying numbers of groups, m = 3 to 12, 18, 34, 66, 130, 802. Part of the numbers of groups are chosen from Yu et al. 2017 [59] in simulation of study 2. In all situations, we used 500 replicates. Table 4.2 shows the results of Type I error that is the proportion of times each statistic reject the null hypothesis at α = 0.05 level for HL tests varying numbers of groups. We expect the probabilities are close to 0.05. But sometimes the values are greater than 0.05 by chance. Therefore, we did the one proportion z test with the sample size equal to the replicate times 500 for each setting. The highlighted values are statistically significant which the Type I errors are not controlled under these combinations.

Algorithm 2 Generate a data set 1: for i = 1 to n do

2: Generate xi which follows a specific covariate distribution 1 3: Calculate the probability by πi = 0 1+e−β xi 4: Generate the outcome variable yi by the Bernoulli random variable with the probability πi 5: end for P(Y = 1|x) 6: Fit the logistic regression, log = Xβˆ P(Y = 0|x)

From Table 4.2 we observe that Type I errors of HL tests show different performance across the different number of groups and thus affect the decisions. Table 4.1 also supports this opinion. We calculate the average agreement rate over 500 replicates across the number of groups for each scenario. The values are all below 100 percent which means that HL tests do not agree with each other completely. That is the reason why the grouping method is so important when using the HL test. The main issues arise for very small and large groups. The Type I errors are not controlled well under these situations. Even within the range of groups where the Type I error are controlled well (i.e. 6 to 12 groups), they also have different powers and there is no unique optimal selection

25 based on the simulation results that we are going to present in the next section. We will examine the performance of grouping tests when H0 is true in the following.

Table 4.1: Average rate of HL tests agreeing with each other across number of groups

Sample Size n Model n=500 n=1000 n=2000 n=4000 n=10000 1 94.91% 95.10% 95.63% 95.64% 94.87% 2 95.41% 95.33% 94.97% 95.97% 94.99% 3 94.83% 95.74% 95.04% 95.14% 95.00% 4 95.99% 95.53% 96.19% 93.90% 95.23%

Table 4.2: Type I error of HL tests for different number of groups with 500 replications, *m = 802 is not applicable for n = 500, 1000, 2000, 4000 as n/m is too small, so the results under these combinations are empty

Number of Groups m Model Sample Size 3 4 5 6 7 8 9 10 11 12 18 34 66 130 802 1 n=500 0.050 0.050 0.054 0.050 0.056 0.042 0.038 0.060 0.054 0.052 0.054 0.060 0.070 0.130 n=1000 0.058 0.036 0.040 0.040 0.040 0.050 0.062 0.044 0.060 0.046 0.054 0.080 0.066 0.094 n=2000 0.050 0.052 0.046 0.054 0.046 0.044 0.052 0.034 0.048 0.050 0.036 0.054 0.040 0.058 n=4000 0.050 0.048 0.048 0.062 0.048 0.050 0.048 0.046 0.038 0.042 0.048 0.052 0.068 0.082 n=10000 0.054 0.052 0.050 0.050 0.056 0.058 0.048 0.042 0.066 0.058 0.044 0.034 0.064 0.056 0.084 2 n=500 0.054 0.064 0.060 0.052 0.044 0.052 0.040 0.050 0.058 0.056 0.050 0.042 0.046 0.034 n=1000 0.060 0.058 0.052 0.040 0.044 0.046 0.052 0.056 0.048 0.046 0.042 0.034 0.038 0.054 n=2000 0.060 0.060 0.044 0.064 0.056 0.056 0.040 0.048 0.052 0.062 0.062 0.052 0.042 0.066 n=4000 0.042 0.056 0.032 0.032 0.040 0.042 0.038 0.034 0.042 0.042 0.060 0.038 0.042 0.040 n=10000 0.050 0.060 0.062 0.054 0.050 0.056 0.042 0.054 0.052 0.046 0.078 0.062 0.046 0.042 0.034 3 n=500 0.054 0.042 0.052 0.046 0.058 0.042 0.054 0.048 0.056 0.058 0.076 0.066 0.072 0.104 n=1000 0.070 0.070 0.036 0.046 0.040 0.054 0.038 0.042 0.046 0.040 0.028 0.042 0.048 0.052 n=2000 0.072 0.056 0.052 0.058 0.052 0.056 0.054 0.036 0.046 0.048 0.048 0.056 0.054 0.062 n=4000 0.062 0.060 0.072 0.052 0.036 0.056 0.040 0.038 0.056 0.042 0.042 0.048 0.056 0.084 n=10000 0.076 0.066 0.080 0.052 0.046 0.078 0.044 0.042 0.064 0.064 0.046 0.040 0.050 0.042 0.074 4 n=500 0.054 0.078 0.044 0.046 0.050 0.062 0.040 0.034 0.034 0.034 0.034 0.034 0.036 0.022 n=1000 0.046 0.054 0.058 0.048 0.048 0.060 0.056 0.056 0.056 0.048 0.064 0.038 0.044 0.042 n=2000 0.052 0.038 0.040 0.044 0.022 0.038 0.040 0.040 0.040 0.034 0.046 0.046 0.046 0.048 n=4000 0.066 0.084 0.078 0.058 0.088 0.072 0.062 0.064 0.066 0.052 0.074 0.056 0.076 0.058 n=10000 0.054 0.038 0.048 0.038 0.056 0.058 0.072 0.068 0.058 0.042 0.040 0.042 0.044 0.038 0.048

4.1.1 Majority Vote Results

Table 4.3 shows the results of Type I error using majority vote method. All the tests consistently reject null hypothesis less than 5 percent when the correct logistic regression is specified which means the majority vote method controls the Type I error well. The small probabilities may imply the low power for this method, we then consider a more aggressive test, minimum p method, which rejects H0 if any of the groups is significant.

26 Table 4.3: Type I error of majority vote with 500 replications

Sample Size n Model n=500 n=1000 n=2000 n=4000 n=10000 1 0.020 0.018 0.010 0.022 0.014 2 0.012 0.008 0.012 0.004 0.012 3 0.018 0.008 0.010 0.012 0.018 4 0.008 0.018 0.012 0.022 0.008

4.1.2 Minimum P Results

Table 4.4 shows the Type I errors on four models with two different combination of HL tests. The first one is to combine HL tests with 3 to 12 and 18 groups and we also present the combined results with 6 to 12 groups for HL tests. The “minimum p” column means using the minimum p method making the significance level at α = 0.05. The “minimum p BC” use the Bonferroni correction setting the significance level at 0.05/J , where J is the total number of tests. As we expected that for the minimum p method, when narrowing down the range of tests (combine HL tests with 6 to 12 groups rather than 3 to 12 and 18 groups) to make the decision, the

Type I errors get lower for all four models because small groups tend to reject H0 more. However, even within a smaller range, the Type I errors are not controlled by minimum p method. In fact, individual HL test controls the Type I error well. However, if one of the groups makes mistakes, then combining the multiple HL tests for the minimum p method increases the probability of making Type I error and thus the minimum p method is more aggressive in this way. But with Bonferroni correction, the values are all less than 0.05 as indicated in Figure 4.1 . Figure 4.1 shows the scatter plot of the comparison for the minimum p method with or without Bonferroni correction for all the scenarios. The y axis is the value of Type I error and the red horizontal reference line is where α equal to 0.05. We can clearly observe that all the points are below 0.05 for the Bonferroni correction. Therefore, the minimum p with Bonferroni correction method works well at controlling the Type I error.

4.1.3 P Values Combined and Averaging Statistics Results

For p values combined and averaging statistics, we combine HL tests within 6 to 12 groups because Table 4.2 shows that for these groups, the Type I errors are controlled well, then we study

27 Table 4.4: Type I error of minimum p method and Bonferroni correction with 500 replications

Selection of Groups 3:12, 18 6:12 Model Sample Size Minimum p Minimum p BC Minimum p Minimum p BC 1 n=500 0.240 0.030 0.184 0.040 n=1000 0.222 0.034 0.168 0.044 n=2000 0.222 0.034 0.158 0.034 n=4000 0.216 0.030 0.152 0.038 n=10000 0.254 0.036 0.198 0.034 2 n=500 0.244 0.028 0.170 0.034 n=1000 0.272 0.024 0.200 0.022 n=2000 0.244 0.032 0.170 0.040 n=4000 0.208 0.024 0.140 0.018 n=10000 0.254 0.038 0.162 0.034 3 n=500 0.236 0.040 0.168 0.036 n=1000 0.234 0.018 0.158 0.028 n=2000 0.254 0.020 0.184 0.020 n=4000 0.226 0.032 0.154 0.040 n=10000 0.262 0.026 0.186 0.030 4 n=500 0.246 0.016 0.172 0.022 n=1000 0.238 0.038 0.166 0.046 n=2000 0.200 0.024 0.140 0.028 n=4000 0.296 0.044 0.230 0.048 n=10000 0.238 0.034 0.184 0.032

Figure 4.1: Type I error of minimum p method and Bonferroni correction with 500 replications

28 the performance of proposed combined methods within this range. We first assume that all the HL tests are independent, then apply all the p-values combined methods: Tippett, Stouffer, Fisher, Logit, on four models and use the distributions of these statistics stated in Chapter 3 to get the p-values and Type I errors for these approaches. We display the results in Table 4.5. From the table, we observe that except for Tippett method, other three statistics appear large numbers compared to the significance level α = 0.05 which indicates the violation of the independence assumption. Therefore, it motivates us to consider different approach to find the true distributions of them.

Table 4.5: Type I error of p values combined methods with independent assumption with 500 replications

Model Sample Size Tippett Stouffer Fisher Logit 1 n=500 0.040 0.214 0.160 0.202 n=1000 0.044 0.194 0.134 0.176 n=2000 0.034 0.210 0.146 0.194 n=4000 0.038 0.200 0.162 0.190 n=10000 0.034 0.230 0.156 0.214 2 n=500 0.034 0.226 0.162 0.210 n=1000 0.022 0.238 0.152 0.216 n=2000 0.040 0.224 0.172 0.208 n=4000 0.018 0.186 0.130 0.176 n=10000 0.034 0.186 0.136 0.180 3 n=500 0.036 0.218 0.156 0.198 n=1000 0.030 0.204 0.152 0.186 n=2000 0.020 0.240 0.176 0.226 n=4000 0.042 0.200 0.146 0.190 n=10000 0.034 0.250 0.182 0.228 4 n=500 0.024 0.206 0.148 0.194 n=1000 0.046 0.198 0.154 0.180 n=2000 0.028 0.182 0.124 0.156 n=4000 0.050 0.258 0.214 0.244 n=10000 0.032 0.232 0.158 0.208

For both p values combined and averaging statistics method, the true distributions of the statistics are not easy to find, so we use the bootstrap method to approximate the true distributions. But we do not know if the bootstrap method can approximate the true distributions well, thus we preform some preliminary analysis and draw some plots to examine if the bootstrap method works. Figure 4.2 displays the comparison of the distribution plots of T between using bootstrap method and the random samples generating by the true coefficients on four models. The plots on the first row use the bootstrap method with N = 1000 for each model. The second row is done by generating

29 1000 data sets with true coefficients β and get 1000 T s. The plots with bootstrap method are very similar to the plots on the second row. Therefore, we claim that the bootstrap method is a good approximation for the distribution of T . This also holds for other p values combined statistics and averaging statistics methods, but we do not present the plots for them here. Therefore, we use the bootstrap method to get the distribution of them for all the simulation studies.

(a) Bootstrap, model 1 (b) Bootstrap, model 2 (c) Bootstrap, model 3 (d) Bootstrap, model 4

(e) Samples, model 1 (f) Samples, model 2 (g) Samples, model 3 (h) Samples, model 4

Figure 4.2: Comparison of distribution of T between bootstrap and true samples with N = 1000

Table 4.6 shows the results of Type I errors for p values combined method and averaging statistics. For all simulations, we use N = 1000 to do the bootstrap and r = 500 replicates to get Type I error. The highlighted cells are significantly greater than 0.05 using one proportion z test. From the table, we observe that with a few exceptions, the empirical rejection percents are within 2 percent of desired 5 percent level of significance and all of the tests control the Type I errors well in general. There are only 6 highlighted cells and they are only a litter bit beyond 0.05. Figure 4.3 is the box plot of Type I error for these seven proposed tests. The x axis presents different tests and y axis is the Type I error under all combinations for each test. As Table 4.6 shows that T1, T2, Fisher, Logit have some outliers for a few situations and other tests control the Type I error very well. Therefore, overall, all the tests have a good performance on controlling the Type I errors. In summary, all the proposed tests perform well at controlling Type I error except minimum p method. But we use Bonferroni correction to compensate it for the over sensitivity for the minimum p method and the simulation results demonstrate it works well by this correction. The majority

30 Table 4.6: Type I error of p values combined and averaging statistics using bootstrap approach with 500 replications and 1000 bootstrap samples

Model Sampe size T T1 T2 Tippett Stouffer Fisher Logit 1 n=500 0.050 0.048 0.052 0.062 0.060 0.044 0.044 n=1000 0.040 0.048 0.048 0.028 0.022 0.052 0.054 n=2000 0.050 0.044 0.044 0.050 0.052 0.052 0.050 n=4000 0.036 0.050 0.052 0.042 0.046 0.038 0.038 n=10000 0.032 0.040 0.040 0.054 0.040 0.058 0.050 2 n=500 0.064 0.076 0.072 0.046 0.046 0.056 0.058 n=1000 0.046 0.042 0.046 0.046 0.036 0.032 0.032 n=2000 0.056 0.042 0.042 0.052 0.050 0.058 0.062 n=4000 0.054 0.042 0.050 0.048 0.036 0.054 0.050 n=10000 0.042 0.036 0.034 0.048 0.058 0.046 0.044 3 n=500 0.052 0.056 0.052 0.048 0.048 0.042 0.040 n=1000 0.050 0.050 0.050 0.058 0.056 0.054 0.052 n=2000 0.062 0.048 0.056 0.044 0.040 0.042 0.040 n=4000 0.050 0.052 0.056 0.048 0.056 0.076 0.074 n=10000 0.052 0.064 0.066 0.050 0.048 0.030 0.032 4 n=500 0.046 0.046 0.048 0.050 0.050 0.062 0.060 n=1000 0.062 0.056 0.062 0.048 0.054 0.068 0.072 n=2000 0.036 0.048 0.050 0.048 0.042 0.052 0.054 n=4000 0.046 0.048 0.044 0.038 0.032 0.044 0.044 n=10000 0.054 0.046 0.042 0.040 0.046 0.050 0.050

Figure 4.3: Type I error of p values combined and averaging statistics using bootstrap approach with 500 replications and 1000 bootstrap samples

31 vote method is very conservative and does not reject often enough, so we expect low powers to detect the deficiencies for this method. In the next section, we will present the simulation results when the fitted model is not correct.

4.2 Power

In this section, we run a series of simulations when the fitted model is not the correct model and then examine the power of the grouping tests. The true models we used to simulate the data are from Paul et al. 2013 [35]. The models are as follows: 2 Model 1: logit(π) = −2 + X1 + 0.2X1 + Z − 2(Z × X1)

Model 2: logit(π) = 2 + X1 + Z + 0.5(Z × X1)

Model 3: logit(π) = X1 + Z + 0.5(Z × X1) 2 Model 4: logit(π) = X1 + 0.2X1 2 Model 5: logit(π) = X1 + 0.2X1 + X2 2 Model 6: logit(π) = −3 − X1 − 0.2X1 , where X1 and X2 are N(0,1), Z ∼ Bernoulli(0.5), X1, X2 and Z are independent. All simulated data are used to fit the following logistic regression:

logit(π) = β0 + β1X1 . (4.1)

These 6 models cover different situations. Model 1 is the most complicated model involving 2 the quadratic, X1 , and large interaction term 2(Z × X1) and also has the greatest deviation from the fitted model. Model 2 and 3 have a small interaction term, 0.5(Z × X1). Model 4 to 6 exhibit 2 the additional quadratic terms from the fitted model, X1 . The probability of Y = 1 generated by these models are 0.256, 0.874, 0.585, 0.529, 0.528 and 0.055 for Model 1 to 6 respectively. Model 3, Model 4 and Model 5 have a relatively balanced data and the successful rate for them are close to 50 percent. Model 1 have a moderate rate of positive response with 25.6% and Model 2 have a very high rate with 87.4% . However, Model 6 generates the most unbalanced data set with only 5.5% for Y = 1. In the simulations, for each model, we generated data with different sample size, n = 500, 1000, 2000, 4000 and 10000. We all used 500 replicates for each setting. After fitting the logistic regression, we run the HL tests by varying numbers of groups, m = 3 to 12, 18, 34, 66, 130, 802.

32 Table 4.7: Power of HL tests for different number of groups with 500 replications, *m = 802 is not applicable for n = 500, 1000, 2000, 4000 as n/m is too small, so the results under these combinations are empty

Number of Groups m Model Sample Size 3 4 5 6 7 8 9 10 11 12 18 34 66 130 802 1 n=500 0.586 0.680 0.674 0.688 0.678 0.678 0.652 0.656 0.614 0.630 0.594 0.450 0.290 0.148 n=1000 0.864 0.940 0.944 0.948 0.940 0.948 0.950 0.956 0.958 0.952 0.930 0.844 0.682 0.484 n=2000 0.990 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.978 0.916 n=4000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 n=10000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 2 n=500 0.050 0.062 0.048 0.044 0.038 0.048 0.058 0.072 0.060 0.048 0.060 0.066 0.066 0.086 n=1000 0.078 0.080 0.060 0.040 0.056 0.054 0.050 0.054 0.070 0.056 0.062 0.066 0.076 0.112 n=2000 0.078 0.060 0.046 0.056 0.058 0.058 0.052 0.054 0.062 0.062 0.056 0.072 0.076 0.106 n=4000 0.078 0.064 0.080 0.084 0.052 0.062 0.072 0.082 0.070 0.066 0.070 0.058 0.070 0.086 n=10000 0.110 0.066 0.100 0.090 0.074 0.076 0.066 0.084 0.084 0.066 0.074 0.078 0.060 0.084 0.118 3 n=500 0.068 0.050 0.058 0.052 0.064 0.058 0.048 0.056 0.060 0.054 0.050 0.044 0.036 0.036 n=1000 0.060 0.072 0.064 0.046 0.050 0.054 0.078 0.068 0.052 0.068 0.064 0.040 0.058 0.038 n=2000 0.088 0.086 0.078 0.078 0.074 0.080 0.070 0.074 0.092 0.070 0.068 0.072 0.064 0.060 n=4000 0.106 0.114 0.106 0.088 0.094 0.110 0.098 0.100 0.082 0.078 0.082 0.088 0.062 0.062 n=10000 0.204 0.228 0.216 0.216 0.208 0.196 0.192 0.152 0.156 0.168 0.150 0.114 0.096 0.080 0.066 4 n=500 0.170 0.216 0.222 0.210 0.200 0.186 0.188 0.178 0.190 0.188 0.162 0.112 0.084 0.066 n=1000 0.374 0.424 0.420 0.430 0.404 0.426 0.410 0.388 0.386 0.364 0.320 0.256 0.180 0.120 n=2000 0.592 0.698 0.746 0.742 0.764 0.768 0.756 0.756 0.742 0.754 0.692 0.572 0.438 0.294 n=4000 0.872 0.950 0.968 0.984 0.982 0.982 0.984 0.978 0.972 0.978 0.978 0.918 0.802 0.636 n=10000 0.998 0.998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.706 5 n=500 0.162 0.188 0.162 0.196 0.172 0.176 0.178 0.182 0.188 0.168 0.134 0.102 0.072 0.044 n=1000 0.264 0.352 0.354 0.364 0.362 0.348 0.324 0.320 0.336 0.300 0.268 0.194 0.146 0.106 n=2000 0.488 0.618 0.652 0.698 0.680 0.704 0.688 0.682 0.658 0.686 0.616 0.522 0.400 0.244 n=4000 0.824 0.926 0.938 0.962 0.950 0.946 0.966 0.962 0.956 0.950 0.940 0.888 0.736 0.530 n=10000 0.994 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.992 0.572 6 n=500 0.092 0.076 0.068 0.072 0.060 0.066 0.052 0.046 0.038 0.048 0.048 0.038 0.024 0.026 n=1000 0.098 0.100 0.112 0.108 0.086 0.086 0.108 0.086 0.082 0.078 0.074 0.046 0.036 0.008 n=2000 0.214 0.194 0.194 0.170 0.162 0.176 0.138 0.160 0.150 0.146 0.136 0.092 0.038 0.042 n=4000 0.410 0.386 0.414 0.410 0.376 0.336 0.324 0.382 0.304 0.302 0.272 0.158 0.090 0.044 n=10000 0.806 0.822 0.826 0.836 0.838 0.828 0.834 0.840 0.812 0.810 0.772 0.640 0.394 0.228 0.008

The results in Table 4.7 show the power of HL tests for different number of groups which is the proportion of the test correctly rejecting the null hypothesis at α = 0.05 level when the alternative hypothesis is true. In general, Model 1 , Model 4, Model 5 and Model 6 perform well as sample size increases. HL tests are especially powerful on Model 1 and it reaches to 1 when sample size increases to 2000 since Model 1 is the most complicated model and the test is easy to detect the departure. Model 6 have relatively low power with the small sample size compared with other three models possibly because the generated data is unbalanced from Model 6. However, Model 2 and Model 3 have low powers that are only close to significance level. This indicates that the HL tests are not powerful to detect the departure from missing a small interaction with Z. Even though Model 1 also involves interaction term, the power is still large possibly due to the relatively large coefficient and quadratic term. The power gets higher as sample size increases and it tends to have low power

33 when the number of groups are relatively small or large in general. With too few groups, we will not have the sensitivity needed to distinguish observed from expected occurrences. The low power may due to the small subjects within each group for large m , so it will be difficult to determine whether differences between observed and expected are due to chance or model misspecification. The performance of HL tests depends on choosing the partition groups. The power differs greatly across the different number of groups for the same model and same sample size. It is very challenging to find a general optimal group for HL test. For example, m = 6 has the highest power for Model 1 with 500 sample size, but it is not always the case for other settings. Therefore, this is the main reason why we consider to combine the HL test results across the groups and we expect the combined methods are more robust and have the comparable power with the individual group for HL test. We will examine the performance of all the proposed grouping tests when H1 is true in the following. We expect that the power of the grouping tests would either be superior than the HL tests with different selection of groups or very close to the optimum.

4.2.1 Majority Vote Results

The power of majority vote method appears in Table 4.8 under different combinations. It is the same as individual HL test that the statistic is most powerful on Model 1. Model 4, 5 and Model 6 perform well. Model 2 and 3 have a relatively low power. As the sample size increases that the power goes up for each model except for Model 2 which has a very erratic behavior with the lowest point at n = 2000. Figure 4.4 are the power plots of majority vote for each model. The first row are for Model 1 and Model 2. The second row includes Model 3 and Model 4. The other two models appear in the last row. For each model, the x axis is the category of different sample size and y axis represents the power. For a fixed model and sample size, each bar covers the range of the power for HL tests with different groups and the point denotes the power of majority vote for that specific setting. Overall, majority vote power is close to the optimum for large sample size (n = 4000 and n = 10000) except for Model 2 and Model 3 possibly due to the interaction terms. For small sample size, Model 1 shows a comparable power but the majority vote power only reaches the mean of the individual HL tests for some settings. The performance on Model 2 and Model 3 is very poor. Most of the majority vote power appears to be the lowest point. As the small Type I error for the majority vote indicated in last section, we could expect the low power on this method. Majority vote method is not powerful enough than most of the individual HL tests. Therefore, we

34 consider a more powerful test, minimum p method and we will present the simulation results for it in the following.

Table 4.8: Power of majority vote with 500 replications

Sample Size n Model n=500 n=1000 n=2000 n=4000 n=10000 1 0.592 0.944 1.000 1.000 1.000 2 0.024 0.022 0.018 0.026 0.040 3 0.012 0.016 0.024 0.042 0.120 4 0.098 0.304 0.710 0.980 1.000 5 0.092 0.236 0.620 0.946 1.000 6 0.016 0.038 0.074 0.238 0.802

4.2.2 Minimum P Results

The majority vote method is too conservative, so we use simulations to confirm whether the minimum p method, which rejects the null hypothesis as long as the null hypothesis is rejected by one grouping method would increase the power. When partition groups are too large or too small, the type I errors are out of control based on the simulation in Table 4.2. If we include these groups for the minimum p method, it may reject the null hypothesis too often. Therefore, we try different combinations of HL tests. The result appears in Table 4.9. We use two different combination of HL tests for the simulation. The column “3:12, 18” means using HL tests with 3 to 12 and 18 groups and the column “6:12” is the results of combining 6 to 12 groups for HL tests. The power of minimum p method appear in the “minimum p” column and the power of minimum p with Bonferroni correction is presented in the “minimum p BC” column. For both combination of tests, the power greatly improves compared to the majority vote method. However, in the simulation part when the null hypothesis is true, we demonstrated that the minimum p method is too liberal which is more likely to make Type I error. To control the Type I error, we used Bonferroni correction setting the criteria of the HL tests rejecting the null hypothesis at level α/J. The Type I error was controlled by the correction as shown in Table 4.4, but consequently, we observe from Table 4.9 the power decreases compared to the minimum p method.

35 Figure 4.4: Power of majority vote with 500 replications

Figure 4.6 and Figure 4.7 are used to compare the performance of the minimum p with Bon- ferroni correction with the multiple HL tests for six models under two different combinations of HL tests. For a fixed model, x axis is the category of sample size and y represents the power. The dot plot is for the HL tests with different group for different setting and the blue reference line for each setting is the power of minimum p with Bonferroni correction. When using 3 to 12 and 18 groups (Figure 4.6), Bonferroni correction only removes the extreme low power from the HL tests

36 Table 4.9: Power of minimum p method with 500 replications

Selection of Groups 3:12, 18 6:12 Model Sample Size minimum p minimum p BC minimum p minimum p BC 1 n=500 0.910 0.626 0.856 0.616 n=1000 1.000 0.944 0.990 0.938 n=2000 1.000 1.000 1.000 1.000 n=4000 1.000 1.000 1.000 1.000 n=10000 1.000 1.000 1.000 1.000 2 n=500 0.234 0.028 0.168 0.032 n=1000 0.262 0.04 0.160 0.038 n=2000 0.242 0.032 0.170 0.040 n=4000 0.276 0.048 0.210 0.046 n=10000 0.318 0.05 0.234 0.048 3 n=500 0.266 0.022 0.190 0.036 n=1000 0.256 0.032 0.202 0.034 n=2000 0.296 0.058 0.224 0.060 n=4000 0.350 0.066 0.264 0.066 n=10000 0.510 0.148 0.414 0.132 4 n=500 0.548 0.148 0.436 0.140 n=1000 0.776 0.342 0.682 0.332 n=2000 0.944 0.73 0.910 0.720 n=4000 1.000 0.972 0.998 0.968 n=10000 1.000 1.000 1.000 1.000 5 n=500 0.494 0.132 0.404 0.116 n=1000 0.706 0.288 0.596 0.276 n=2000 0.914 0.63 0.882 0.626 n=4000 0.998 0.946 0.996 0.950 n=10000 1.000 1.000 1.000 1.000 6 n=500 0.262 0.042 0.176 0.044 n=1000 0.318 0.066 0.244 0.068 n=2000 0.532 0.122 0.394 0.112 n=4000 0.750 0.298 0.642 0.268 n=10000 0.966 0.818 0.950 0.806 for most of the settings but it has bad performance on Model 2 and Model 3 which lies below all the points possibly due to the interaction terms on these two models. Figure 4.7 indicates that Bonferroni correction are not powerful to detect the departure from the fitted model and the power are not comparable with individual HL tests on all models. In short, Bonferroni correction control the Type I error well but not powerful enough compared to the HL tests with different groups. When compared with majority vote method, the minimum p with Bonferroi correction has higher power than the majority vote. Figure 4.5 displays the comparison of majority vote and minimum p with Bonferroni Correction. The y axis represents the power across different sample size and various models are categorized in the x axis. Overall, the minimum p with Bonferroni correction is more powerful than the majority vote method for all the models.

4.2.3 P Values Combined and Averaging Statistics Results

The power of p values combined and averaging statistics are summarized in Table 4.10. We combined HL tests with 6 to 12 groups. For all simulations, we used N = 1000 for the bootstrap times and r = 500 replicates. From Table 4.10, we observe that the power for Tippett and Stouffer

37 Figure 4.5: Power comparison of majority vote and minimum p with Bonferroni Correction using 3 to 12, 18 groups with 500 replications are very low overall and there are no power at all for many situations. These two methods are not powerful in our context of combining multiple HL tests. Therefore, we suggest not to use the Tippett’s method and Stouffer’s method. We compare the other five methods listed in Table 4.10 with the HL tests for 6 to 12 groups referred in Figure 4.8. Figure 4.8 displays the power comparison of the HL tests with 6 to 12 groups and the five proposed grouping tests across the different sample size for Model 1 to 6. The x axis represents the sample size from 500 to 10000 and y axis presents the power. The box plots summarize the power of HL tests with 6 to 12 groups for a fixed sample size and model. The five lines represent different combined methods for each model. In general, all tests have a good performance on Model 1, Model 4, Model 5 and Model 6. The power of all tests are superior than the HL tests even for the small sample sizes (n = 500, n = 1000) for Model 1. The proposed grouping tests either have superior powers or they are very close to the optimum of HL tests on Model 4 and Model 5. Model 6 has a relatively low power in small samples for HL tests possibly because of the low event rate and unbalanced data. Model 2 shows a very erratic behavior and it reaches to the lowest point at n = 2000 for T1. The power of HL tests is very low for Model 2 and Model 3 overall possibly due to the interaction term in the model and the combined methods do not improve the performance either . The T , Fisher, Logit methods only improve the power when the sample sizes are large for Model 2 and Model 3. Generally, Fisher, Logit and T have a very good performance compared to

38 Figure 4.6: Power of minimum p with BC using 3 to 12 and 18 groups with 500 replications

individual HL tests for all the combinations. T1 and T2 are comparable with the optimum of HL tests. Therefore, except for Tippett and Stouffer, all the grouping tests are recommended. In conclusion, majority vote and minimum p with Bonferroni correction are very conservative and do not have enough power to detect the discrepancies. Tippett and Stouffer’s method are not suitable for our setting and they basically have no power for all scenarios. The T , Fisher and

Logit have a very good performance and T1, T2 are also comparable with individual HL tests.

39 Figure 4.7: Power of minimum p with BC using 6 to 12 groups with 500 replications

Therefore, we recommend to use these five grouping tests to combine multiple HL tests. Based on the simulations, the grouping tests have a good performance on Model 4 and Model 5, which indicates the tests are powerful to detect departure from the omission of quadratic term. In contrast, HL tests are not sensitive to detect missing the interaction term and even the combination methods do not improve either. Another point worth noting is that for both individual HL tests and grouping tests, the power are relatively low for the small sample sizes on model 6. This behavior is possibly

40 Model 1 Model 2 0.11 1.0 Method: Fisher Logit T T1 T2

0.9 0.09

Method: Fisher Logit 0.8 T power T1 power 0.07 T2

0.7 0.05

0.6 n=500 n=1000 n=2000 n=4000 n=10000 n=500 n=1000 n=2000 n=4000 n=10000 sample size sample size

Model 3 Model 4 0.25 1.00 Method: Fisher Logit T T1 T2 Method: Fisher Logit T T1 T2

0.20 0.75

0.15 power power 0.50

0.10

0.25 0.05

n=500 n=1000 n=2000 n=4000 n=10000 n=500 n=1000 n=2000 n=4000 n=10000 sample size sample size

Model 5 Model 6 1.00 Method: Fisher Logit T T1 T2 Method: Fisher Logit T T1 T2

0.75

0.75

0.50 power power 0.50

0.25

0.25

0.00 n=500 n=1000 n=2000 n=4000 n=10000 n=500 n=1000 n=2000 n=4000 n=10000 sample size sample size

Figure 4.8: Power of p values combined and averaging statistics with 500 replications and 1000 bootstrap samples

41 Table 4.10: Power of p values combined and averaging statistics using bootstrap approach with 500 replications and 1000 bootstrap samples

Model Sampe size T T1 T2 Tippett Stouffer Fisher Logit 1 n=500 0.766 0.710 0.690 0.004 0.002 0.742 0.748 n=1000 0.982 0.974 0.964 0.000 0.000 0.978 0.978 n=2000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 n=4000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 n=10000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 2 n=500 0.062 0.046 0.044 0.052 0.044 0.070 0.072 n=1000 0.074 0.078 0.076 0.046 0.046 0.050 0.048 n=2000 0.062 0.048 0.052 0.052 0.058 0.062 0.060 n=4000 0.062 0.050 0.046 0.040 0.050 0.066 0.068 n=10000 0.090 0.074 0.076 0.042 0.032 0.106 0.108 3 n=500 0.066 0.064 0.064 0.034 0.040 0.046 0.046 n=1000 0.062 0.068 0.070 0.030 0.042 0.068 0.070 n=2000 0.090 0.084 0.080 0.032 0.032 0.086 0.088 n=4000 0.100 0.090 0.084 0.036 0.038 0.114 0.114 n=10000 0.246 0.212 0.206 0.018 0.008 0.220 0.222 4 n=500 0.300 0.256 0.244 0.020 0.014 0.278 0.288 n=1000 0.510 0.456 0.424 0.004 0.006 0.506 0.510 n=2000 0.812 0.768 0.744 0.000 0.000 0.826 0.836 n=4000 0.992 0.986 0.982 0.000 0.000 0.990 0.990 n=10000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 5 n=500 0.254 0.216 0.200 0.018 0.012 0.250 0.256 n=1000 0.428 0.394 0.390 0.004 0.006 0.506 0.510 n=2000 0.778 0.738 0.710 0.002 0.000 0.760 0.766 n=4000 0.984 0.974 0.966 0.000 0.000 0.986 0.986 n=10000 1.000 1.000 1.000 0.000 0.000 1.000 1.000 6 n=500 0.056 0.062 0.060 0.034 0.042 0.066 0.064 n=1000 0.098 0.092 0.086 0.028 0.028 0.112 0.112 n=2000 0.210 0.178 0.172 0.008 0.006 0.228 0.226 n=4000 0.436 0.374 0.344 0.004 0.004 0.434 0.442 n=10000 0.892 0.864 0.848 0.000 0.000 0.902 0.904 caused by the unbalanced data with low event rate generated by Model 6.

42 4.3 Conclusion

In this Chapter, we study the properties of the grouping tests based on the simulation. The performance on controlling Type I error and getting high power for different tests is summarized on Table 4.11. We observe from the Table 4.11 that except for the minimum p method, all the methods control Type I error well. But with the Bonferroni correction, the Type I errors are controlled well for the minimum p method. The simulation also shows that minimum p, Fisher, Logit and averaging statistics are powerful to detect the departure from the fitted model. However, the minimum p method has poor performance at controlling Type I error. In summary, we recommend to use Fisher, Logit and averaging statistics because they control the Type I error well and have high power compared to the individual HL tests.

Table 4.11: Summary of the performance of grouping tests

Control Type I error High power Majority Vote   Minimum P   Minimum P BC   Tippett   Stouffer   Fisher   Logit   T   T1   T2  

43 CHAPTER 5

INTERACTION TEST

As discussed in Chapter 2, Hosmer-Lemeshow (HL) test is not powerful to detect the violation of missing interactions between continuous and dichotomous predictors. The grouping tests which combine the information from multiple HL tests do not solve the problem either. Since the indi- vidual HL test could not capture the departure from the omission of interactions, the combination of these HL tests would not do the job as well. Therefore, in this Chapter, we discuss two inter- action tests which combine the likelihood ratio (LR) test under the logistic regression setting with grouping tests in order to capture the specific interaction violation. The idea is motivated by Pe˜na and Slate[37] which proposed a test for linear regression to detect different kinds of violations by summing up the four individual statistics. Thus, in the context of logistic regression, we expect that by combining LR test and grouping tests, the interaction test would be able to detect the violation of missing interactions. We talk about two possible LR tests. The first one is the global LR test. The general idea behind it is to conduct the LR test to compute the likelihood ratio between two models, one of which includes all the possible two-way interactions between continuous and binary covariates and the other one does not have interactions. However, one of the potential issues for the global LR test is that it would create a large number of interactions if there are high dimensional covariates. At the meantime, if the true model only contains few interactions, this method would cause too many noises and may result in losing power. Thus, we also consider the local LR test and expect that it would be more powerful when the true model only contains few interactions. Instead of using all possible interactions, the local LR test conducts multiple LR tests, each only containing interactions for one continuous variable and then performs the different combination methods to combine the test results. We expect that in this way, the local LR test would capture some main missing terms and be more sensitive to detect this specific interaction violation.

44 Then combine the global and local LR statistics with the grouping tests which results in global and local interaction tests correspondingly. In the following sections, we will present these two interaction tests in detail.

5.1 Global Interaction Test

The interaction test is defined by adding the LR component to the grouping tests. Specifically, let W1 denote the LR test and W2 be the grouping test. Then the interaction test, W , can be constructed by the summation of these two standardized tests which is W − µ W − µ W = 1 1 + 2 2 , (5.1) σ1 σ2 where µj and σj are the mean and for Wj. We expect that the interaction test could capture more information and be able to detect different types of violations. Let us take a close look at the LR test under the logistic regression setting. Suppose Y is the response taking values of 0 and 1. There are p continuous covariates,

X = (X1, ..., Xp) and a dichotomous variable Z with values of 1 and 0. The reduced model which only contains the main effects for both continuous and binary covariates are specified as: h π i log = (β + β X + ... + β X ) + θZ , (5.2) 1 − π 0 1 1 p p

0 where π =P(Y = 1|X) is the probability of event, β = (β0, β1, β2, ..., βp) and θ are the coefficients for the main effects. For global LR test, the complex model with all possible two-way interactions between continuous and binary covariates are described as: h π i log = (β + β X + ... + β X ) + θZ + [δ (X × Z) + ... + δ (X × Z)] , (5.3) 1 − π 0 1 1 p p 1 1 p p

0 where δ = (δ1, δ2, ..., δp) are the coefficients for the interaction terms. The LR test compares how likely the complex model fits the data than the simple model based on a function of log-likelihood and the log-likelihood for the logistic regression can be calculated as n X l = Yi log (πi) + (1 − Yi) log(1 − πi) . (5.4) i=1

Therefore, the statistic for global LR test is W1 = 2(l1 − l0), where l0 is the log-likelihood calculated by Equation (5.4) for the simple model in Equation (5.2) and l1 is for the complex model

45 in Equation (5.3). By Wilks’ theorem [55], W1 approximately follows a chi-squared distribution with q degrees of freedom, where q is the number of predictors added to the complex model. Therefore √ the mean of W1 is q and the standard deviation is 2q.

For the grouping test part referred to as W2 in Equation (5.1), we studied different approaches to combine multiple HL tests in previous Chapters. Simulation studies show that three average statistics, Fisher and Logit method perform well. Therefore, we apply these grouping tests to global interaction tests. Recall that we do HL tests J times and numbers of groups are chosen from the set of {m1, m2, ... mJ }. So there are J HL statistics and p-values, denoted as {c1, c2, ... cJ } and

{p1, p2, ... pJ }. Based on Equation (5.1), we need the mean and the standard deviation of W2 for the grouping tests. To this end, compute the expectation for all five grouping tests. However, it is very hard to obtain the standard deviation of them due to the dependent structure of multiple HL tests. Therefore, we apply the bootstrap algorithm to estimate the standard deviations. The means of grouping tests are summarized in Table 5.1 and we will present the detailed calculation in Appendix A.

Table 5.1: Summary of five grouping tests and their expectations

W2 µ2 1 J c T T = P i 1 J i=1 mi − 2 J J 1 P ci 1 P 1 T1 T1 = J i=1 pi(mi − 2) J i=1 1 − F (mi − 2) J J 1 P ci 1 P 1 T2 T2 = 2 2 J i=1 pi (mi − 2) J i=1 [1 − F (mi − 2)] J 2 P Fisher X = −2 log(pi) 2J i=1 h ih 2 i−1/2 Logit G = − PJ log pi Jπ (5J+2) 0 i=1 1−pi 3(5J+4)

where F (s) is the CDF for chi-square distribution with s degrees of freedom. Since there are five grouping tests, we can construct five global interaction test statistics by adding the global LR component to each of the grouping tests according to Equation (5.1). We denote them as WT, WT1, WT2, WFisher and WLogit respectively. It is not easy to derive the distribution of them, and hence we use the bootstrap algorithm to approximate the distributions and obtain the p-values out of them.

46 In the following section, we will discuss the local interaction test which is conducting the multiple LR tests and combine the test results using different combination approaches in detail.

5.2 Local Interaction Test

In the global interaction test, we apply the global LR test which includes all the two-way continuous-binary interactions for the complex model. We expect the global LR test to capture the violation of missing interactions for logistic regression. Whereas if the true model only contains few interactions, the global LR test may be less powerful due to the large noise created for the complex model. To solve this problem, we further propose the local interaction test which uses the local LR test instead. For this local interaction test, each LR test only includes the interactions for a single continuous covariate. In this way, we expect that this local interaction test to capture the main interactions, thus it would be more sensitive to detect the departure from the true model with few interactions. Recall that there are p continuous variables and a dichotomous variable Z. We only select one continuous predictor as the candidate for interactions for each LR test. So there are p LR tests in total. Specifically, the simple model for each test is the same as the global LR test referred to Equation (5.2) and a sequence of complex models are

h π i log = (β + β X + ... + β X ) + θZ + δ (X × Z) , (5.5) 1 − π 0 1 1 p p j j where j = 1, 2, ...p. Therefore, we can obtain p test statistics and denote them as {L1, L2, ... Lp} and the corresponding p-values as {pL1 , pL2 , ... pLp }. Since we have multiple LR tests, we apply different combination methods to combine p test results. As in the global interaction tests, we use five combination methods to combine the local LR tests and compute their expectations. They are summarized in Table 5.2. Therefore, with the corresponding grouping tests and based on Equation (5.1), it results five local interaction tests denoted as LT, LT1, LT2, LFisher and LLogit. However, due to the highly dependent structure, it is not easy to derive the distribution of local interaction statistics explicitly. Therefore, we apply the bootstrap algorithm to adjust the dependent nature and approximate the distribution of them.

where F (s) is the CDF for chi-square distribution with s degrees of freedom.

47 Table 5.2: Summary of five local LR and their expectations

Local LR Expectation p 1 P T L i 1 p i=1 p p 1 P Li 1 P 1 T1 p i=1 pLi p i=1 1 − F (1) p p 1 P Li 1 P 1 T2 2 2 p i=1 pL p i=1 [1 − F (1)] p i P Fisher −2 log(pLi ) 2p i=1 h p ih 2 i−1/2 Pp Li pπ (5p+2) Logit − i=1 log 0 1−pLi 3(5p+4)

5.3 Generalization of the Binary Covariate

In previous sections, we consider a binary Z, but in practice we could have a categorical variable with multiple levels. Therefore, in this section, we discuss the generalization of the interaction tests when the model has a categorical variable. Suppose there are k levels for Z. When fitting the logistic regression, we create k − 1 dummy variables, z1, z2, ..., zk−1 taking values of 1 if it fits the corresponding category and 0 otherwise.

5.3.1 Global Interaction Test

When there are a categorical covariate with k levels in the model, the simple model for the global LR test includes all the main effects for continuous and k − 1 dichotomous variables. This can be specified as:

h π i log = (β + β X + ... + β X ) + (θ z + θ z + ... + θ z ) . (5.6) 1 − π 0 1 1 p p 1 1 2 2 k−1 k−1

The complex model includes all the possible two-way continuous-binary interactions and can be expressed as:

p k−1 p k−1 h π i X X X X log = β + β X + θ z + δ X z . (5.7) 1 − π 0 j j i i ij j i j=1 i=1 j=1 i=1 The complex model in Equation (5.7) has p(k − 1) more terms compared to the simple model in Equation (5.6), so the degrees of freedom for the LR test statistic is p(k − 1) and the standard

48 error is p2p(k − 1). Then we can combine the LR test with grouping tests using the Equation (5.1) in the previous section.

5.3.2 Local Interaction Test

For the local interaction test, the simple model for each test is the same as global LR test in Equation(5.6) which only has the main effects for all covariates. The complex model for each test would contain k − 1 more terms (Xj × z1,Xj × z2, ..., Xj × zk−1) compared to the null model. Specifically, a sequence of alternative models are:

p k−1 k−1 h π i X X X log = β + β X + θ z + δ X z , (5.8) 1 − π 0 j j i i ij j i j=1 i=1 i=1

where j = 1, 2, ...p. Thus each statistics Lj follows a chi-squared distribution with k − 1 degrees of freedom. Then we can apply different combination methods to the p LR tests and yield a set of local interaction tests. The local LR tests and their expectations are summarized in Table 5.3.

Table 5.3: Summary of five local LR and their expectations for categorical variable

Local LR Expectation 1 p L T P i 1 p i=1 k − 1 p p 1 P Li 1 P 1 T1 p i=1 pLi (k − 1) p i=1 1 − F (k − 1) p p 1 P Li 1 P 1 T2 2 2 p i=1 pL (k − 1) p i=1 [1 − F (k − 1)] p i P Fisher −2 log(pLi ) 2p i=1 h p ih 2 i−1/2 Pp Li pπ (5p+2) Logit − i=1 log 0 1−pLi 3(5p+4)

where F (s) is the CDF for chi-square distribution with s degrees of freedom. In this Chapter, we present global and local interaction tests to detect the violation of missing interactions for the fitted logistic regression model. The global interaction test adds the global LR component to the grouping tests and it results in five statistics: WT, WT1, WT2, WFisher and WLogit. We expect this method to be more powerful when the true model has large numbers of interactions. Because the global LR test component includes all the continuous predictors as the candidate for the interactions into the model. However, the local interaction test conducts p LR

49 tests. Each LR test only selects one continuous predictor for the interactions and then combine all the test results. This local interaction test also results in five statistics and we denote them as LT,

LT1, LT2, LFisher and LLogit. Since each one of the tests only contains one continuous predictor based interactions, we expect this method to be more sensitive when the true model only has a few interactions. In the next Chapter, we will examine the performance of interaction tests through simulations with various scenarios. We will assess if the proposed tests have reasonable Type I errors and high power at the same time in detail.

50 CHAPTER 6

SIMULATION STUDIES OF INTERACTION TEST

We discussed two interaction tests in last Chapter in order to detect the missing interactions for the fitted logistic regressions. In this Chapter, we examine the performance of these tests through simulations. The study can be divided into two parts: one is to examine if the proposed methods could control the type I error when the null hypothesis is true and the other part is to assess the power of the tests under the alternative hypothesis.

6.1 Type I Error

In this section, we are concerned with the Type I error which is the rejection rate for the statistics when the null hypothesis is the true model. The true model used to generate the data is as follows:

logit(π) = X1 + X2 + X3 + Z, (6.1) where X1, X2 and X3 are independent N(0,1) and Z ∼ Bernoulli(0.5). All the simulations used 1000 bootstrap samples to approximate the distribution and 500 repli- cates. We carried out our methods with different sample sizes. We set α = 0.05 and expect that the Type I errors are around 0.05. Table 6.1 presents the simulation results of the Type I error which is the rejection rates out of 500 replications for interaction tests. The sample sizes are examined from 500 to 10000. The top part contains the Type I errors for the global interaction test and the bottom part contains the results for the local interaction test. The highlighted numbers are significantly greater than 0.05. From the table, we observe that generally the Type I errors are under control for different sample sizes. There are only few exceptions. Figure 6.1 is the plot of the Type I errors for interaction tests under all sample sizes. The y axis is the Type I error and x axis represents different test statistics. The orange points on the left panel are the Type I errors for the global interaction tests and the blue points are for the local interaction tests. The black horizontal line is where the Type I error

51 equals to 0.05. The figure shows that most Type I errors are close to 0.05. Generally, LT2 has higher Type I errors compared with other tests.

Table 6.1: Type I error of interaction tests with 500 replications and 1000 bootstrap samples

Global Interaction Test

Sample Size WT WT1 WT2 WFisher WLogit n=500 0.056 0.062 0.062 0.052 0.056 n=1000 0.060 0.042 0.046 0.066 0.056 n=2000 0.040 0.042 0.048 0.036 0.040 n=4000 0.050 0.034 0.030 0.054 0.048 n=10000 0.050 0.052 0.056 0.046 0.052

Local Interaction Test

Sample Size LT LT1 LT2 LFisher LLogit n=500 0.050 0.046 0.064 0.052 0.050 n=1000 0.066 0.048 0.058 0.070 0.078 n=2000 0.036 0.038 0.066 0.030 0.034 n=4000 0.048 0.050 0.058 0.050 0.046 n=10000 0.046 0.042 0.072 0.050 0.044

We conclude that in general, the Type I errors for the proposed tests are under control through simulations. In next section, we will assess if the interaction tests are powerful to detect the missing interactions in different scenarios.

6.2 Power

In this section, we run a series of simulations when the fitted model is not the correct model and examine the power of the interaction tests under two cases. In case 1, the true models have only a few continuous variables and one dichotomous predictor. However, there are a large number of continuous variables and one categorical predictor with multiple levels in case 2. We also use various models for each case. We will examine if the interaction tests would have high power to detect the departure and if the tests are robust under different situations.

6.2.1 Case 1: Low Dimensional and One Dichotomous Covariate

We used four different models to simulate the data and they are summarized as follows:

52 Figure 6.1: Type I error of interaction tests with 500 replications and 1000 bootstrap samples

Model 1: logit(π) =X1 + X2 + X3 + Z + 0.5(Z × X1)

Model 2: logit(π) =X1 + X2 + X3 + Z + 0.5(Z × X1) + 0.2(Z × X2)

Model 3: logit(π) =X1 + X2 + X3 + Z + 0.2(Z × X1) + 0.2(Z × X2) + 0.2(Z × X3) 2 Model 4: logit(π) =−1.5 + X1 + X2 + X3 + X3 + Z , where X1, X2 and X3 are independent N(0,1) and Z ∼ Bernoulli(0.5). All simulated data were used to fit the following logistic regression:

logit(π) = X1 + X2 + X3 + Z. (6.2)

First three models use different numbers of interactions. Model 1 only contains one interaction term: (Z × X1) with 0.5 interaction coefficient. Model 2 has two interactions and Model 3 con- tains all three possible interactions with mild coefficients. Model 4 has a quadratic term in order to examine if the interaction tests could also detect different types of violations besides missing interactions. Global Interaction tests and local interaction tests were applied in the simulation. The competi- tive statistics include the grouping tests: T, T1,T2, Fisher, Logit and a standard Hosmer-Lemeshow (HL) test with 10 groups. For grouping tests, we all used 6 to 12 groups for the individual HL

53 test and then combined the results of them. In all simulations, 1000 bootstrap samples were used to approximate the distribution of corresponding statistics and 500 replicates were simulated to compute the power. Various sample sizes from 500 to 10000 were used to generate the data. For Model 1 and Model 2, when sample size reaches 10000, the performance of the global and local interaction tests are pretty similar and both have high powers. So the sample sizes are only up to 4000 in both figures when we use the plots to demonstrate the simulation results.

Table 6.2: Power of different tests for Model 1, case 1 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 WT 0.232 0.484 0.824 0.986 1.000 Global WT1 0.282 0.570 0.884 0.998 1.000 Interaction WT2 0.296 0.592 0.890 0.998 1.000 Test WFisher 0.232 0.458 0.820 0.986 1.000 WLogit 0.244 0.510 0.820 0.990 1.000 LT 0.250 0.494 0.840 0.992 1.000 Local LT1 0.244 0.502 0.876 0.992 1.000 Interaction LT2 0.300 0.608 0.908 0.996 1.000 Test LFisher 0.226 0.432 0.808 0.986 1.000 LLogit 0.192 0.386 0.678 0.948 1.000 T 0.032 0.064 0.050 0.060 0.082 Grouping T1 0.034 0.056 0.050 0.046 0.086 Test T2 0.036 0.054 0.046 0.046 0.078 Fisher 0.032 0.068 0.048 0.058 0.080 Logit 0.034 0.066 0.048 0.058 0.084 HL HL10Groups 0.032 0.052 0.066 0.066 0.078

Table 6.2 shows the simulation results of power for different tests for Model 1. The traditional HL with 10 groups is denoted as HL10Groups. The powers for the HL with 10 groups are very low and only around significance level even for the large sample sizes. The grouping tests do not improve the power and they even have lower powers in some situations. However, two interaction tests have higher powers and the powers increase as we enlarge the sample size for both methods. Figure 6.2 depicts the power of various tests for Model 1. The x axis represents the tests and y axis is the power. Each point is the power for the specific test and sample size. We observe that the powers for interaction tests are much higher than the grouping tests and the HL with 10 groups. As the sample size increases, the power gets higher for the two interaction tests. WT2 always has the superior power across the sample sizes among the global interaction tests, and LT2 has the highest

54 Figure 6.2: Power of different tests for Model 1, case 1 with 500 replications and 1000 bootstrap samples

power among all tests for small sample sizes. For large sample sizes, all the tests uniformly perform well. LLogit has relatively lower powers for small sample sizes compared to other interaction test. Table 6.3 shows the simulation results of power for different tests for Model 2 and Figure 6.3 is the power plot for Model 2. Like the previous model, the powers for the grouping tests and HL with

10 groups are still very low, although they get higher than Model 1 for large sample sizes. WT2 still has the highest power among the global interaction tests and the power for LT2 is superior. All other interaction tests also have comparable powers. Table 6.4 and Figure 6.4 show the simulation results for Model 3. The true model for this setting contains three interactions with small interaction coefficients. With large interaction terms and small coefficients, the power gets lower for the small sample sizes. The powers are less than 0.1 for most of the tests when n = 500, but in the previous two models, the powers are all around 0.2 for the same sample size. For Model 3, as the sample size increases, the powers get higher and all the tests have an ample power when the sample size is large. Overall, the global interaction tests are more powerful than local interaction tests in this case. This also meets our expectation because Model 3 has three interactions which is large due to there are only three continuous variables in

55 Table 6.3: Power of different tests for Model 2, case 1 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 Global WT 0.232 0.467 0.790 0.980 1.000 Interaction WT1 0.274 0.540 0.880 0.988 1.000 Test WT2 0.278 0.560 0.886 0.990 1.000 WFisher 0.222 0.460 0.796 0.980 1.000 WLogit 0.240 0.490 0.802 0.978 1.000 LT 0.220 0.430 0.768 0.970 1.000 Local LT1 0.196 0.414 0.784 0.968 1.000 Interaction LT2 0.268 0.510 0.844 0.986 1.000 Test LFisher 0.198 0.382 0.722 0.964 1.000 LLogit 0.176 0.314 0.612 0.902 1.000 T 0.034 0.054 0.054 0.054 0.102 Grouping T1 0.032 0.056 0.054 0.062 0.090 Test T2 0.034 0.060 0.054 0.056 0.092 Fisher 0.034 0.050 0.052 0.056 0.102 Logit 0.034 0.052 0.056 0.056 0.104 HL HL10Groups 0.056 0.064 0.062 0.084 0.090

Figure 6.3: Power of different tests for Model 2, case 1 with 500 replications and 1000 bootstrap samples

56 the model. With this large interactions, we could expect that the global interaction tests would be powerful to detect the departure. However, the local interaction tests with only one interaction for each LR test would leave out some information. We clearly observe that from Figure 6.4, the power of local interaction tests are lower than the global interaction tests in general. But both global and local interaction tests are more powerful than the grouping tests and HL with 10 groups.

Table 6.4: Power of different tests for Model 3, case 1 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 WT 0.094 0.148 0.266 0.538 0.924 Global WT1 0.086 0.160 0.346 0.620 0.960 Interaction WT2 0.090 0.162 0.366 0.630 0.956 Test WFisher 0.082 0.140 0.258 0.542 0.924 WLogit 0.086 0.156 0.290 0.552 0.932 LT 0.054 0.092 0.132 0.244 0.692 Local LT1 0.048 0.092 0.136 0.214 0.500 Interaction LT2 0.076 0.130 0.202 0.318 0.580 Test LFisher 0.046 0.086 0.118 0.216 0.660 LLogit 0.060 0.088 0.106 0.210 0.585 T 0.042 0.054 0.036 0.080 0.074 Grouping T1 0.036 0.052 0.046 0.064 0.072 Test T2 0.032 0.048 0.044 0.066 0.072 Fisher 0.046 0.054 0.040 0.088 0.074 Logit 0.040 0.054 0.044 0.086 0.070 HL HL10Groups 0.054 0.048 0.052 0.052 0.086

Table 6.5 and Figure 6.5 show the simulation results for Model 4. The true model for Model

4 does not include interaction terms but has a quadratic term for X3 in order to examine if the interaction tests are powerful to detect other types of violation besides missing interactions. The simulation results show that generally, grouping tests have superior powers across all sample sizes. Since the HL test is powerful to detect the violation of quadratic terms as shown in the simulation part in Chapter 4, the grouping tests which combine the results of HL tests with different number of groups could do the job as well. The commonly used HL test with 10 groups also has ample power but it is not as good as grouping tests. Thus, combining the HL tests has advantages. WT2 has a slight lower power compared to other tests for small sample size. Overall, the interaction tests have comparable powers with grouping tests which mean they could also detect the violation of quartic term. When the true model has no interaction terms, the interaction tests would lose some power. However, when the true model contains interactions such as Model 1, Model 2 and

57 Figure 6.4: Power of different tests for Model 3, case 1 with 500 replications and 1000 bootstrap samples

Model 3, the power has big improvement compared to grouping tests. Thus, the interaction tests are very robust. In first three models, the two interaction tests are powerful to detect the departure from missing interaction terms. Overall, WT2 and LT2 are the superior tests than others, but the powers for other tests are very close to these two. With small interaction numbers, some local interaction tests perform better than the global interaction tests and with large number of interactions, the global one has the advantage. Model 4 verifies that interaction tests are powerful to detect different types of violations. This case used small number of continuous variables and a binary predictor. We will examine the performance of these tests in next section with large number of continuous variables and a categorical variable with multiple levels.

6.2.2 Case 2: Moderately High Dimensional and One Categorical Covariate

We generated 10 independent standard normal random variables X1,X2, ..., X10 and one cate- gorical predictor Z with 5 levels: 1 − 5 for different sample sizes. So there are 4 dummy variables:

58 Table 6.5: Power of different tests for Model 4, case 1 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 WT 0.206 0.426 0.736 0.976 1.000 Global WT1 0.132 0.296 0.568 0.946 1.000 Interaction WT2 0.102 0.222 0.460 0.888 1.000 Test WFisher 0.218 0.426 0.766 0.982 1.000 WLogit 0.188 0.402 0.692 0.972 1.000 LT 0.202 0.428 0.736 0.978 1.000 Local LT1 0.178 0.396 0.676 0.968 1.000 Interaction LT2 0.186 0.396 0.654 0.964 1.000 Test LFisher 0.226 0.438 0.762 0.980 1.000 LLogit 0.216 0.407 0.722 0.976 1.000 T 0.252 0.482 0.762 0.986 1.000 Grouping T1 0.232 0.446 0.732 0.980 1.000 Test T2 0.222 0.426 0.720 0.978 1.000 Fisher 0.244 0.482 0.766 0.986 1.000 Logit 0.248 0.478 0.770 0.986 1.000 HL HL10Groups 0.176 0.338 0.704 0.952 1.000

Figure 6.5: Power of different tests for Model 4, case 1 with 500 replications and 1000 bootstrap samples

59 z1, z2, z3, z4 taking values of 1 if the observation is in the specific category or 0 otherwise. Then the full two-way interaction model is specified as:

10 4 10 4 h π i X X X X log = β + β X + θ z + δ X z , (6.3) 1 − π 0 j j j j ij i j j=1 j=1 i=1 j=1

where β = (β0, β1, ..., β10) is the coefficient for the intercept and 10 continuous predictors,

θ = (θ1, θ2, θ3, θ4) is the coefficient for the 4 dummy variables and δi = (δi1, δi2, δi3, δi4) are the coefficients for the interaction term between Xi and Z. There are 55 parameters in total. We generated the data with three models for different number of interaction terms between continuous variables and the categorical predictor. For all three models, we chose β = (0, 1, 1, .., 1) and θ = (0.4, 0.4, 0.4, 0.4). The true models used to generate the data for three scenarios had 2 interactions (X1 × Z, X2 × Z), 5 interactions (X1 × Z, ... , X5 × Z) and 8 interactions (X1 × Z,

... , X10 × Z). Therefore, the interaction coefficients for different models are specified as:

Model 1 (2 interactions): δij = 0.4 for i = 1, 2; j = 1, 2, ..10 and 0 for others.

Model 2 (5 interactions): δij = 0.4 for i = 1, 2, 3, 4, 5; j = 1, 2, ..10 and 0 for others.

Model 3 (8 interactions): δij = 0.4 for i = 1, 2, ..8; j = 1, 2, ..10 and 0 for others. We used different sample size and the power for various tests were computed over 500 replicates and 1000 bootstrap samples. The results are summarized in the following tables and plots. Table 6.6 and Figure 6.6 present the power comparison of various tests for Model 1 where there are only two interactions in the true model. With large number of continuous covariates and one categorical predictor, the power of interaction tests are relatively low when sample size is 500. For this small sample size, only some of the local interaction tests such as LT2 and LFisher have higher power than the traditional HL test with 10 groups. As the sample size increases, the performance of interaction tests are getting better compared to the grouping tests especially when n = 10000, they all have very high powers around 0.9. Generally, the local interaction tests have higher powers than the global interaction tests as the Figure 6.6 shows that most of the green dots are higher than the orange ones. This is exactly what we expected that when there are few interactions in the true model (i.e. 2 interactions), the power of the local interaction tests are higher than the global ones. Model 2 includes five continuous covariates for interactions and the results are displayed in Table 6.7 and Figure 6.7. From Table 6.7 we observe that grouping tests and HL with 10 groups

60 Table 6.6: Power of different tests for Model 1, case 2 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 WT 0.040 0.076 0.160 0.370 0.866 Global WT1 0.040 0.082 0.220 0.498 0.960 Interaction WT2 0.040 0.082 0.224 0.510 0.960 Test WFisher 0.036 0.070 0.150 0.352 0.866 WLogit 0.040 0.082 0.178 0.394 0.874 LT 0.050 0.064 0.178 0.368 0.884 Local LT1 0.053 0.072 0.222 0.500 0.956 Interaction LT2 0.125 0.120 0.287 0.568 0.972 Test LFisher 0.071 0.064 0.188 0.416 0.920 LLogit 0.057 0.068 0.164 0.320 0.790 T 0.042 0.038 0.070 0.066 0.064 Grouping T1 0.032 0.038 0.068 0.066 0.058 Test T2 0.032 0.038 0.068 0.060 0.062 Fisher 0.038 0.040 0.070 0.064 0.066 Logit 0.040 0.040 0.074 0.068 0.062 HL HL10Groups 0.066 0.040 0.076 0.060 0.046

Figure 6.6: Power of different tests for Model 1, case 2 with 500 replications and 1000 bootstrap samples

61 Table 6.7: Power of different tests for Model 2, case 2 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 WT 0.062 0.134 0.284 0.612 0.992 Global WT1 0.056 0.168 0.416 0.806 0.998 Interaction WT2 0.056 0.166 0.424 0.810 0.998 Test WFisher 0.064 0.136 0.280 0.614 0.994 WLogit 0.064 0.134 0.300 0.646 0.992 LT 0.072 0.116 0.262 0.544 0.978 Local LT1 0.077 0.107 0.204 0.448 0.878 Interaction LT2 0.161 0.147 0.263 0.502 0.880 Test LFisher 0.071 0.100 0.257 0.554 0.980 LLogit 0.077 0.096 0.247 0.514 0.958 T 0.042 0.040 0.076 0.056 0.074 Grouping T1 0.036 0.036 0.074 0.054 0.064 Test T2 0.036 0.036 0.072 0.048 0.064 Fisher 0.046 0.036 0.078 0.052 0.070 Logit 0.046 0.038 0.074 0.056 0.072 HL HL10Groups 0.078 0.068 0.072 0.042 0.056

Figure 6.7: Power of different tests for Model 2, case 2 with 500 replications and 1000 bootstrap samples

62 have very low powers which are around significance level and the powers do not improve much as the sample size increases. Overall, WT1,WT2 have superior power than other tests. Other interaction tests also have comparable powers and they all far beyond the HL tests except for the small sample sizes.

Table 6.8: Power of different tests for Model 3, case 2 with 500 replications and 1000 bootstrap samples

Sample Size n=500 n=1000 n=2000 n=4000 n=10000 WT 0.052 0.124 0.248 0.568 0.984 Global WT1 0.050 0.164 0.334 0.710 0.998 Interaction WT2 0.052 0.162 0.346 0.710 0.998 Test WFisher 0.060 0.114 0.236 0.556 0.988 WLogit 0.058 0.136 0.262 0.584 0.982 LT 0.050 0.096 0.162 0.332 0.862 Local LT1 0.040 0.118 0.143 0.299 0.738 Interaction LT2 0.080 0.151 0.207 0.387 0.768 Test LFisher 0.027 0.092 0.155 0.345 0.858 LLogit 0.053 0.101 0.153 0.315 0.808 T 0.048 0.054 0.060 0.086 0.106 Grouping T1 0.052 0.050 0.054 0.078 0.102 Test T2 0.054 0.048 0.050 0.084 0.098 Fisher 0.056 0.056 0.064 0.090 0.104 Logit 0.058 0.056 0.064 0.092 0.104 HL HL10Groups 0.096 0.096 0.076 0.078 0.074

Table 6.8 and Figure 6.8 displays the power for Model 3 which has 8 continuous interactions. With large number of interactions, the power of global interaction tests are higher than the local tests generally except for the small sample size (i.e. n = 500). When the sample size is small, the powers are relatively low for both global and local interaction tests, so it is also hard to compare the performance. Overall, WT1 and WT2 have the superior power across all different sample sizes. In this Chapter, we examined the performance of the interaction tests under various scenarios. The interaction tests are reliable to detect the missing interactions for fitted logistic regression generally. They could also detect other types of violations. In the following Chapter, we will use a real data example to illustrate all the proposed tests.

63 Figure 6.8: Power of different tests for Model 3, case 2 with 500 replications and 1000 bootstrap samples

64 CHAPTER 7

ANALYSIS OF REAL DATA

In this Chapter, all the proposed tests are illustrated using the bone mineral density data from Third National Health and Nutrition Examination Survey (NHANES III 1988-1994). NHANES III is a program of studies designed to assess the health and nutritional status of adults and children in the United States. There are several papers that used and discussed the NHANES III data set. For example, Looker et al. 1997 [30] estimated the overall scope of the osteoporosis in the older U.S population based on femoral bone mineral density (BMD) from NHANES III. Alexander et al. 2003 [3] studied this data set to quantify the increased prevalence of coronary heart disease (CHD) among people with metabolic syndrome. McEvoy et al. 2005 [31] performed the analysis for the prevalence of the metabolic syndrome in patients with schizophrenia using baseline data from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) Schizophrenia Trial with comparison of NHANES III. Xiao et al. 2017 [57] used NHANES III to study the fat mass to fat-free mass ratio using bioelectrical impedance analysis. One interesting question for this data is to predict the mortality of subjects based on the bone mineral density (BMD, g/cm2) at femur neck. A bone mineral density test uses X-rays to measure the amount of minerals, namely calcium, in the bones. The target variable Y = 1 means that the person is dead and Y = 0 indicates the subject is still alive. The predictor, BMD, recorded the bone mineral density at femur neck for each subject.We randomly selected 1500 subjects from the whole sample, 373 of which appear to be Y = 1, constituting 24.87% of the sample. Table 7.1 shows the descriptive summary of BMD. The range of the value is between 0.3 and 1.6 and the median of it is 0.823. Figure 7.1 gives the histogram of BMD with density overlaid.

Table 7.1: Descriptive summary for BMD

Min. 1st Qu. Median Mean 3rd Qu. Max. 0.316 0.714 0.823 0.8325 0.9403 1.509

65 Figure 7.1: Histogram of BMD

We first draw some plots and have some insights about the BMD. Figure 7.2 shows the scatter plot of BMD versus age in years. We can see that there is a slight linear trend with a negative slope based on the scatter plot which implies that older people tend to have lower BMD. As the box plot showed in Figure 7.3 that males tend to have a higher BMD level compared with females due to the physiological reasons. There is a slight difference among four races for BMD as indicated in Figure 7.4. Subjects falling into Non-Hispanic White tend to have the lowest BMD. Since the response is a binary variable, we choose to use the logistic regression to fit the model. We draw a logit plot to examine the relationship between the mortality and BMD. To draw the logit plot, first split the data into 10 equal-sized groups by ascending order of BMD. Then calculate the mean of BMD in each group and the estimated logit according to the formula

h (m + 1)/M i logitd = log , (7.1) (M − m + 1)/M where M is the binsize or the sample size in each group and m is the number of subjects with Y = 1 in each group.

66 Figure 7.2: Scatter plot of BMD versus age

Figure 7.3: Box plot of BMD versus gender

67 Figure 7.4: Box plot of BMD vs race-ethnicity

Figure 7.5 shows the plot of estimated logit versus the mean of BMD in each group. Overall, it shows a decreasing trend that subjects with lower BMD have more chance to be survived. However, there are two rising places around 0.9 and 1.1. Since it shows a little curly trend, the logistic model may not fit the data well. We fit the logistic regression model with the mortality as the response variable and BMD as the predictor. The fitted model is h π i log i = 3.0748 − 5.2502 BMD . (7.2) 1 − πi The estimated coefficient for BMD is significant and the negative slope indicates that subjects with lower BMD tends to have higher probability to be alive. Then employ the HL tests varying the selections of number of groups to examine the goodness-of-fit for the fitted model. As it has been discussed in Chapter 2, Table 2.2 shows the results of HL tests with 6 to 12 groups for bone mineral density data. We expect that the goodness-of-fit tests would reject the null hypothesis based on what we observe in Figure 7.5 in which the logit plot does not show a straight line. However, not all the tests are powerful as we expected and the selection of numbers of groups for HL test indeed has influence on the decision.

68 Figure 7.5: Logit plot for bone mineral density data with 10 groups

Since the results with a various selection of the number of groups do not come to an agreement, it is hard for us to choose which one should be trusted. Then we apply all the grouping tests to combine the HL tests: majority vote, minimum p, minimum p with Bonferroni correction, the P values combined method and the averaging statistics, on the bone mineral density data and these results are summarized in Table 7.2. We all use significance level α = 0.05 except for the minimum p with Bonferroni correction setting the individual HL test at α/7 = 0.007. But the joint significance level is still α = 0.05 for minimum p with Bonferroni correction. We combine HL results for 6 to 12 groups. One of the advantages of grouping tests is that they only produce one single statistic and we do not need to debate to trust one of the individual HL results. From Table 7.2, most of the tests suggest to reject the null hypothesis except for Tippett and stouffer method which agrees with the low power in the simulation part for these two methods and all other methods are more powerful than the individual HL test in general. Therefore, we conclude that most of the grouping tests are likely to detect the discrepancy from the fitted model. The Figure 7.5 also confirm the lack of fit for the logistic regression thus most of the grouping tests are good and powerful methods. This real data example supports the idea of using multiple HL tests and combine the results of them instead of just relying on HL test with arbitrary group or trying to find the optimum group.

69 Table 7.2: Grouping tests for model (7.2)

Method Statistic P-value Significance Level α Decision Majority Vote * * 0.05 reject H0 Minimum P * 0.0024 0.05 reject H0 Minimum P BC * 0.0024 0.05 reject H0 T 2.0298 0.035 0.05 reject H0 T1 211.5650 0.024 0.05 reject H0 T2 72704.6900 0.022 0.05 reject H0 Tippett 0.0024 0.981 0.05 not reject H0 Stouffer -4.3935 0.971 0.05 not reject H0 Fisher 44.6333 0.028 0.05 reject H0 Logit 4.6534 0.029 0.05 reject H0

The goodness-of-fit tests suggest that the fitted logistic regression listed in Equation (7.2) is not appropriate for the data set. Since most of the grouping tests indicate a lack-of-fit for the simple model in Equation (7.2), we have to take actions to improve the model. One of the modifications we could make is to add an interaction term for the model. From Figure 7.1 we observe that when the BMD is greater than 1.0, the logit plot behaves differently which shows an increasing trend. This erratic behavior may lead to an implication that perhaps, when the value of BMD is large enough, it has small impact on the probability of mortality which we could rather say the subject is relatively healthy and would has a large probability to be alive. In order to model the data with this special behavior, we introduce an indicator function named I, taking values of 1 when the BMD is greater than 1.0 and 0 otherwise. Specifically, it is as follows: ( 1, if BMD> 1. I = (7.3) 0, otherwise.

Figure 7.1 depicts the histogram of BMD and there are about 15.2% of the data with BMD values greater than 1.0. By involving the indicator function, I , the fitted logistic regression becomes

h π i log i = 3.3357 − 5.6242 BMD + 0.4274 I . (7.4) 1 − πi To examine if the updated model (7.4) is appropriate for the bone mineral density data, we run two sets of goodness-of-fit tests: grouping tests and global interaction tests. The null hypothesis

70 and alternative hypothesis for the goodness-of-fit tests are specified as follows:

H0: Mortality = BMD + I H1: The model is not sufficient

We use 6 to 12 groups when combining the individual HL test and run 1000 bootstrap samples to compute the p-value for each test. The results are summarized in Table 7.3. The left panel presents the p-value for each grouping test and the results of global interaction tests are listed on the right. We observe that all the p-values on the left side are greater than 0.05 which is not significant. However, the tests listed on the right side indicate the lack-of-fit for the fitted logistic regression. We also obtain the same fact from Figure 7.6. The left plot depicts the p-values for the grouping tests and global interaction tests are drawn on the right side. The x axis shows the different tests and the y-axis describes the corresponding p-values. The red horizontal line is where the p-value equals to 0.05. The plot shows that all the tests are above 0.05 on the left but the p-values for global interaction tests are extremely small which implies a lack-of-fit. Since the grouping tests are not significant but it detects the violation when adding the LR component, it could lead to the implication that the model is missing an interaction term between the BMD and I.

Table 7.3: Goodness-of-fit tests of model (7.4) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples

Grouping Tests Global Interaction Test Test P-value Test P-value T 0.067 WT 0.000 T1 0.071 WT1 0.002 T2 0.065 WT2 0.001 Fisher 0.072 WFisher 0.000 Logit 0.066 WLogit 0.000

In order to verify this statement, we then conduct a simple likelihood ratio test with the complex model involving the interaction between BMD and I. Table 7.4 presents the results of the LR test. The null model only contains the main effect for BMD and I, whereas the alternative model has the interaction term added. The p-value is 2.98e-05 which is much smaller than 0.05. The LR

71 Figure 7.6: Goodness-of-fit tests of model (7.4) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples

test is very significant and this indeed demonstrates the model would be better if it includes the interaction term.

Table 7.4: Likelihood ratio test for bone mineral density data

Likelihood Ratio Test Model 1: Mortality = BMD + I Model 2: Mortality = BMD + I + BMD*I -2 Log likelihood DF Chi-square P(>chisq) Model 1 1499.8 Model 2 1482.4 1 17.4 2.98e-05

Since the global interaction tests and the LR test detect the lack-of-fit for model (7.4), we then fit the logistic regression again with the interaction added. The fitted model is:

h π i log i = 3.6534 − 6.0503 BMD − 10.2799 I + 9.7812 BMD ∗ I . (7.5) 1 − πi

72 We then run the grouping tests to assess if the updated model is sufficient and the null hypoth- esis and the alternative hypothesis are stated as follows:

H0: Mortality = BMD + I + BMD * I H1: The model is not sufficient

The results are presented in Table 7.5 and Figure 7.7. The p-values in the table are all greater than 0.5 as well as in the Figure that all the bars are above the red line. This means that we could accept the null hypothesis that model (7.5) is sufficient for the bone mineral density data. Compared with Figure 7.6(A) which is the grouping tests for model (7.4) without the interaction term, the p-values in Figure 7.7 are much larger. This also indicates that model (7.5) including the interaction is sufficient.

Table 7.5: Grouping tests of model (7.5) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples

Test T T1 T2 Fisher Logit P-value 0.402 0.457 0.467 0.429 0.394

For this bone mineral density data, we first fit a simple logistic regression with BMD as the only predictor and the grouping tests detected a lack-of-fit. Then we added an indicator I to the model and the global interaction tests indicated a violation. We used the LR test to confirm that the model would be better if it includes an interaction between BMD and I. Therefore, the last step we did is to fit the model with two main effects, BMD and I, and one interaction term and the goodness-of-fit tests all suggest the model is sufficient. Therefore, the final model is specified in Equation (7.5). This model indicates that the BMD is negative related with mortality which means subjects with large BMD value would have higher chance to be alive. Also, the interaction between the BMD and I implies that when BMD is greater than 1, it has large probability to be alive.

73 Figure 7.7: Grouping tests of model (7.5) for bone mineral density data, used 6 to 12 groups and 1000 bootstrap samples

74 CHAPTER 8

SUMMARY AND FUTURE WORK

Hosmer-Lemeshow (HL) goodness-of-fit test for assessing the fit of logistic regression is well received and can be performed in most statistical packages. However, there are two issues to perform HL test. One of the drawbacks is that we have to specify the number of partition groups for calculating the statistic. Different selections of number of groups shows different performance on Type I error and power and often suggest the different decisions. A good selection should control Type I error well under the null hypothesis and have high power under the alternative hypothesis. It is very challenging to find the general optimum group. Therefore, we consider combining the HL tests. In this study, we proposed several grouping tests to combine multiple HL tests with varying numbers of groups instead of just using one arbitrary group or trying to find the optimum group. In the simulation studies, we consider various models and the type I error and power calculations have been carried out for sample size from 500 to 10000 and at the level of significance α = 0.05. It shows that all the grouping tests have correct size which control the type I error well except for minimum p method. We also use bootstrap approaches to get the p-values for some tests. The derived distributions based on the independence assumption are not correct in our context for the p values combined methods. Hence, we apply bootstrap method and it approximates the distributions of the tests well. We consider different types of violations for the logistic regression and examine the power for all the tests to detect the departure from the fitted model. Because the ample power and good Type I error for the grouping tests, we recommend to perform multiple HL tests and then combine the results instead of using only one group. The other issue for using HL test is that it has low power to detect missing interactions between continuous and dichotomous covariates. Even the grouping tests which combine the multiple HL tests with varying the number of groups do not solve the problem. Therefore, we proposed global and local interaction tests which utilize the likelihood ratio (LR) test in order to capture the violation of missing interactions. Simulation studies show that the interaction tests are able to detect different types of violations besides interactions that is exactly what we ask for. We also use

75 the interaction tests for the real data analysis and they help develop a more reasonable model for the data. Our study on this topic shines on the future project and it certainly warrants more research on it. We performed comprehensive numerical study and it also gives some directions we can explore. We all used simple models in the simulation part and we would also like to include more complicated models to assess the performance of proposed methods. For all the simulations, we either used univariate predictors or independent covariates. We may consider generating the data sets with high-dimensional predictors and correlated predictors to assess the power of the proposed tests for our simulation model and to investigate if the proposed methods are robust for more complex settings. To examine the power in the simulation, we only study the ability of proposed methods for detecting the omission of the interaction term and the omission of the quadratic term. We would also like to study the power of tests on incorrect link function, i.e. the logit link is not correct. The power for Model 6 for the simulation of grouping tests is not as good as other models possibly due to the low event rate. We could also generate a series of data sets with varying the event rate to examine the power for proposed methods. We developed the interaction tests to detect the missing interactions for the fitted logistic regression. We could also explore alternative approaches to develop goodness-of-fit tests with high powers in the presence of interactions. One of the possible methods we could think of is to apply the idea of conditional distance independence test ( Wang et al. 2015 [52] ) to test if X is independence of Y when conditioning on Xβ. If the test suggests independence, then we could claim the model is sufficient. Otherwise, it may be missing some terms for the logistic regression model. There is also one application we can practice for the interaction tests. We could add an inter- action term between a continuous variable and an indicator function to find the threshold for the continuous predictor in the logistic regression. In this way, if we fit the model correctly, then we could find the turning point making the probability for the event stays constant.

76 APPENDIX A

CALCULATION OF THE EXPECTATION FOR FIVE GROUPING TESTS

Recall we do Hosmer-Lemeshow (HL) tests J times and the number of groups are chosen from the set of {m1, m2, ... mJ }. So there are J HL statistics and p-values, denoted as {c1, c2, ... cJ } and

{p1, p2, ... pJ }.

Averaging Statistics. The averaging statistics averages the multiple HL statistics adjusted by their degrees of freedom and the corresponding p-values. The first one is not involving the 1 J c p-values which is T = P i . In order to standardize this test, we need to compute the J i=1 mi − 2 mean and the standard deviation for it. In this formula, the only random variables are the HL statistics, ci and the mean of them are the number of groups minus two. Therefore, the mean of T can be computed as

J J 1 X [ci] 1 X mi − 2 [T ] = E = = 1 . E J m − 2 J m − 2 i=1 i i=1 i The second one includes p-values in the denominator thus assigns more weights to the relatively J 1 P ci significant tests which is T1 = . The expectation of T1 is J i=1 pi(mi − 2)

J   J J 1 X 1 ci 1 X 1 mi − 2 1 X 1 [T ] = = = , E 1 J m − 2E 1 − F (c ) J m − 2 1 − F (m − 2) J F (m − 2) i=1 i i i=1 i i i=1 i where F (s) is the CDF for chi-squared distribution with s degrees of freedom. J 1 P ci The statistic of T2 involving the squares of the p-values is formed as T2 = 2 . J i=1 pi (mi − 2) Similarly, the expectation of T2 is

J   J J 1 X 1 ci 1 X 1 mi − 2 1 X 1 [T ] = = = . E 2 J m − 2E [1 − F (c )]2 J m − 2 [1 − F (m − 2)]2 J [F (m − 2)]2 i=1 i i i=1 i i i=1 i

77 Fisher’s Method. The Fisher’s method applies the logarithm transformation of the p-values J 2 P 2 and it is constructed as X = −2 log(pi). Here, the expectation of X can be expressed as i=1

J  2 X E X = E [−2 log(pi)] = 2J. i=1

This is because pi follows U(0, 1), thus [−2 log(pi)] follows a chi-squared distribution with 2 degrees of freedom.

Logit’s Method. The logit method uses the logit transformation and it is specified as: h ih 2 i−1/2 G = − PJ log pi Jπ (5J+2) . The mean of G is i=1 1−pi 3(5J+4)

J hJπ2(5J + 2)i−1/2 X [G] = − [log(p ) − log(1 − p )] = 0 . E 3(5J + 4) E i i i=1

78 APPENDIX B

IRB APPROVAL

79 80 81 82 BIBLIOGRAPHY

[1] Hamzah Abdul Hamid, Bee Wah Yap, Xie Xian-Jin, and Seng Huat Ong. Investigating the power of goodness-of-fit tests for multinomial logistic regression. Communications in Statistics- Simulation and Computation, (just-accepted), 2017.

[2] Mery Natali Silva Abreu, Arminda Lucia Siqueira, Clareci Silva Cardoso, and Waleska Teixeira Caiaffa. Ordinal logistic regression models: application in quality of life studies. Cadernos de Sa´udeP´ublica, 24:s581–s591, 2008.

[3] Charles M Alexander, Pamela B Landsman, Steven M Teutsch, and Steven M Haffner. Ncep- defined metabolic syndrome, diabetes, and prevalence of coronary heart disease among nhanes iii participants age 50 years and older. Diabetes, 52(5):1210–1214, 2003.

[4] Frank J Anscombe and John W Tukey. The examination and analysis of residuals. Techno- metrics, 5(2):141–160, 1963.

[5] Kellie J Archer, Stanley Lemeshow, and David W Hosmer. Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design. Computational Statistics & Data Analysis, 51(9):4450–4464, 2007.

[6] Adelchi Azzalini, Adrian W Bowman, and Wolfgang H¨ardle. On the use of for model checking. Biometrika, 76(1):1–11, 1989.

[7] Mirta Bensic, Natasa Sarlija, and Marijana Zekic-Susac. Modelling small-business credit scor- ing by using logistic regression, neural networks and decision trees. Intelligent Systems in Accounting, Finance and Management, 13(3):133–150, 2005.

[8] G Bertolini, R D’amico, D Nardi, A Tinazzi, and G Apolone. One model, several results: the paradox of the hosmer-lemeshow goodness-of-fit test for the logistic regression model. Journal of and , 5(4):251–253, 2000.

[9] Charles C Brown. On a goodness of fit test for the logistic model based on score statistics. Communications in Statistics-Theory and Methods, 11(10):1087–1105, 1982.

[10] Jana D Canary, Leigh Blizzard, Ronald P Barry, David W Hosmer, and Stephen J Quinn. A comparison of the hosmer–lemeshow, pigeon–heyse, and tsiatis goodness-of-fit tests for binary logistic regression under two grouping methods. Communications in Statistics-Simulation and Computation, 46(3):1871–1894, 2017.

[11] Tim J Cole, Edmund Hey, and Sam Richmond. The prem score: a graphical tool for predicting survival in very preterm births. Archives of Disease in Childhood-Fetal and Neonatal Edition, 95(1):F14–F19, 2010.

83 [12] R Dennis Cook and Sanford Weisberg. Diagnostics for in regression. Biometrika, 70(1):1–10, 1983.

[13] JB Copas. Unweighted sum of squares test for proportions. Applied Statistics, pages 71–80, 1989.

[14] James Durbin and Geoffrey S Watson. Testing for serial correlation in regression: I. Biometrika, 37(3/4):409–428, 1950.

[15] Morten W Fagerland, David W Hosmer, and Anna M Bofin. Multinomial goodness-of-fit tests for logistic regression models. Statistics in medicine, 27(21):4238–4253, 2008.

[16] CP Farrington. On assessing goodness of fit of generalized linear models to sparse data. Journal of the Royal Statistical Society. Series B (Methodological), pages 349–360, 1996.

[17] Ronald Aylmer Fisher. Statistical methods for research workers. Edinburgh, 1934.

[18] Andrew S Fullerton. A conceptual framework for ordered logistic regression models. Sociolog- ical methods & research, 38(2):306–347, 2009.

[19] Ebenezer Olusegun George. Combining independent one-sided and two-sided statistical tests– some theory and applications. 1978.

[20] Joachim Hartung, Guido Knapp, and Bimal K Sinha. Statistical meta-analysis with applica- tions. Hoboken, NJ: John Wiley, 2008.

[21] Richard P Hauser and David Booth. Predicting bankruptcy with robust logistic regression. Journal of Data Science, 9(4):565–584, 2011.

[22] David W Hosmer and Nils Lid Hjort. Goodness-of-fit processes for logistic regression: simula- tion results. Statistics in Medicine, 21(18):2723–2738, 2002.

[23] David W Hosmer and Stanley Lemeshow. Goodness of fit tests for the multiple logistic regres- sion model. Communications in statistics-Theory and Methods, 9(10):1043–1069, 1980.

[24] David W Hosmer, Stanley Lemeshow, and J Klar. Goodness-of-fit testing for the logistic regression model when the estimated probabilities are small. Biometrical Journal, 30(8):911– 924, 1988.

[25] David W Hosmer, Trina Hosmer, Saskia Le Cessie, Stanley Lemeshow, et al. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in medicine, 16(9):965–980, 1997.

84 [26] Andrew A Kramer and Jack E Zimmerman. Assessing the calibration of mortality benchmarks in critical care: The hosmer-lemeshow test revisited. Critical care medicine, 35(9):2052–2056, 2007.

[27] S Le Cessie and JC Van Houwelingen. A goodness-of-fit test for models, based on smoothing methods. Biometrics, pages 1267–1282, 1991.

[28] Saskia Le Cessie and Hans C Van Houwelingen. Testing the fit of a regression model via score tests in random effects models. Biometrics, pages 600–614, 1995.

[29] Song Liu and Yuhong Yang. Combining models in longitudinal data analysis. Annals of the Institute of Statistical Mathematics, 64(2):233–254, 2012.

[30] Anne C Looker, Eric S Orwoll, C Conrad Johnston, Robert L Lindsay, Heinz W Wahner, William L Dunn, Mona S Calvo, Tamara B Harris, and Stephen P Heyse. Prevalence of low femoral bone density in older us adults from nhanes iii. Journal of Bone and Mineral Research, 12(11):1761–1768, 1997.

[31] Joseph P McEvoy, Jonathan M Meyer, Donald C Goff, Henry A Nasrallah, Sonia M Davis, Lisa Sullivan, Herbert Y Meltzer, John Hsiao, T Scott Stroup, and Jeffrey A Lieberman. Prevalence of the metabolic syndrome in patients with schizophrenia: baseline results from the clinical antipsychotic trials of intervention effectiveness (catie) schizophrenia trial and comparison with national estimates from nhanes iii. Schizophrenia research, 80(1):19–32, 2005.

[32] Samer AM Nashef, Frangois Roques, Philippe Michel, E Gauducheau, S Lemeshow, R Salamon, and EuroSCORE Study Group. European system for cardiac operative risk evaluation (euro score). European journal of cardio-thoracic surgery, 16(1):9–13, 1999.

[33] J Neyman and ES Pearson. On the problem of the most efficient tests of . Biometrika A, 20:175–240, 1933.

[34] Gerhard Osius and Dieter Rojek. Normal goodness-of-fit tests for multinomial models with large degrees of freedom. Journal of the American Statistical Association, 87(420):1145–1152, 1992.

[35] Prabasaj Paul, Michael L Pennell, and Stanley Lemeshow. Standardizing the power of the hosmer–lemeshow goodness of fit test in large data sets. Statistics in medicine, 32(1):67–80, 2013.

[36] Karl Pearson. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175, 1900.

85 [37] Edsel A Pe˜naand Elizabeth H Slate. Global validation of assumptions. Journal of the American Statistical Association, 101(473):341–354, 2006.

[38] Joseph G Pigeon and Joseph F Heyse. A cautionary note about assessing the fit of logistic regression models. 1999.

[39] Joseph G Pigeon and Joseph F Heyse. An improved goodness of fit statistic for probability prediction models. Biometrical Journal, 41(1):71–82, 1999.

[40] Erik Pulkstenis and Timothy J Robinson. Two goodness-of-fit tests for logistic regression models with continuous covariates. Statistics in medicine, 21(1):79–93, 2002.

[41] Wei Qian and Yuhong Yang. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research, 2016.

[42] Stephen J Quinn, David W Hosmer, and C Leigh Blizzard. Goodness-of-fit statistics for log- link regression models. Journal of Statistical Computation and Simulation, 85(12):2533–2545, 2015.

[43] Patrick Royston. The use of cusums and other techniques in modelling continuous covariates in logistic regression. Statistics in Medicine, 11(8):1115–1129, 1992.

[44] Patrick Royston et al. Cusum plots and tests for binary variables. Stata Technical Bulletin, 2 (12), 1993.

[45] Samuel A Stouffer, Edward A Suchman, Leland C DeVinney, Shirley A Star, and Robin M Williams Jr. The american soldier: Adjustment during army life.(studies in social psychology in world war ii), vol. 1. 1949.

[46] Th´er`eseA Stukel. Generalized logistic models. Journal of the American Statistical Association, 83(402):426–431, 1988.

[47] Rodney X Sturdivant and David W Hosmer Jr. A smoothed residual based goodness-of-fit statistic for logistic hierarchical regression models. Computational statistics & data analysis, 51(8):3898–3912, 2007.

[48] John Q Su and LJ Wei. A lack-of-fit test for the mean function in a . Journal of the American Statistical Association, 86(414):420–426, 1991.

[49] L. H. C. Tippett. The Methods of Statistics. London: Williams and Norgate, 1931.

[50] Anastasios A Tsiatis. A note on a goodness-of-fit test for the logistic regression model. Biometrika, 67(1):250–251, 1980.

[51] John W Tukey. One degree of freedom for non-additivity. Biometrics, 5(3):232–242, 1949.

86 [52] Xueqin Wang, Wenliang Pan, Wenhao Hu, Yuan Tian, and Heping Zhang. Conditional distance correlation. Journal of the American Statistical Association, 110(512):1726–1734, 2015.

[53] Xiaoqiao Wei and Yuhong Yang. Robust combination of methods for predic- tion. Statistica Sinica, pages 1021–1040, 2012.

[54] Bryan Wilkinson. A statistical consideration in psychological research. Psychological bulletin, 48(2):156, 1951.

[55] Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of , 9(1):60–62, 1938.

[56] Brandi J Witt, Steven J Jacobsen, Susan A Weston, Jill M Killian, Ryan A Meverden, Thomas G Allison, Guy S Reeder, et al. Cardiac rehabilitation after myocardial infarction in the community. Journal of the American College of Cardiology, 44(5):988–996, 2004.

[57] J Xiao, SA Purcell, CM Prado, and MC Gonzalez. Fat mass to fat-free mass ratio reference values from nhanes iii using bioelectrical impedance analysis. Clinical Nutrition, 2017.

[58] Xian-Jin Xie, Jane Pendergast, and William Clarke. Increasing the power: A practical ap- proach to goodness-of-fit test for logistic regression models with continuous predictors. Com- putational Statistics & Data Analysis, 52(5):2703–2713, 2008.

[59] Wei Yu, Wangli Xu, and Lixing Zhu. A modified hosmer–lemeshow test for large data sets. Communications in Statistics-Theory and Methods, (just-accepted), 2017.

87 BIOGRAPHICAL SKETCH

Wei Ma was born in Jilin province, China. In 2009, Wei entered Northeast Normal University in Changchun, China. She received a Bachelor of Economics with a major in Finance from Northeast Normal University in June 2013. In August 2013, she entered the statistics graduate program at Florida State University in Tallahassee.

88