A Simulation Study of Goodness-Of-Fit Tests for Binary Regression with Applications to Norwegian Intensive Care Registry Data

A Simulation Study of Goodness-of-Fit Tests for Binary Regression with Applications to Norwegian Intensive Care Registry Data Ellisif Nygaard Supervisor: Geir Drage Berentsen Øystein A. Haaland Department of Mathematics University of Bergen This dissertation is submitted in partial fulfillment of the requirements for the degree of Master of Science in Statistics (Data Analysis) The Faculty of Mathematics and Natural Sciences January 2019 Acknowledgements I would like to thank my advisers Geir Drage Berentsen and Øystein Ariansen Haaland for introducing me to this topic and their guidance throughout the process of writing this thesis. I am also grateful to the Department of Heart Disease at Haukeland University Hospital for valuable experience, and their patience and generosity. In addition, I would like to thank the Norwegian Intensive Care Registry (NIR) for providing an unique data set. Many thanks to friends and family for their indispensable cheers of encouragement. I am also grateful for the proofreading and feedback received from Ranveig Nygaard. And I am grateful that this process is finally over. Abstract When using statistical methods to fit a model, the consensus is that it is possible to represent a complex reality in the form of a simpler model. It is helpful to systematically measure a model’s ability to capture the underlying system which controls the data generation in the population being examined. One of the possible tools we can apply to evaluate model adequacy is goodness-of- fit (GOF) tests. Summary GOF statistics are computed for a specific fitted model,then attributed an asymptotic distribution, and finally the null hypothesis that the model fits the data adequately is tested. A great challenge, when the model is a binary regression model and it has one or several continuous covariates, is to verify which asymptotic distributions the GOF statistics in fact have (Hosmer et al., 1997). In this thesis, we will evaluate the validity of the distributions of some established GOF test statistics mentioned in the literature. We have chosen so-called global GOF tests, where user input is not necessary. Tests demanding user input, such as the Hosmer-Lemeshow test, have been shown to have some considerable disadvantages. Hosmer et al.(1997) states that number of groups (which are determined by user discretion) can influence whether the GOF test rejects the model fit or not. Binary regression models present a specific set of challenges with regards toGOF measures, especially in situations where at least one covariate is continuous. There appears to be no broad general agreement on which GOF statistics are reliable options when fitting such models. This thesis aims to extend the current knowledge in this area. A modified version of one of the statistics is introduced. The GOF tests studied are later applied in a data analysis on real data set from the Norwegian Intensive Care Registry (NIR). An exploration was performed in the attempt to suggest a suitable tool to evaluate the discrepancies between the estimated logistic probabilities and the outcome variable, and how different GOF tests will behave for different categories of discrepancies. Table of contents Symbols xi 1 Introduction to Binary Regression1 1.1 The Classical Linear Regression Model . .1 1.2 Generalised Linear Models (GLMs) . .2 1.3 Binary Regression Models . .4 1.3.1 Link Functions and Their Corresponding Models . .4 1.3.2 Interpretation of logistic regression models . .6 1.3.3 Maximum Likelihood Estimation . .7 2 Goodness-of-Fit Tests and Their Statistics9 2.1 The Standardized Pearson Test . .9 2.2 Unweighted Sum of Squares Test . 11 2.3 Stukel’s Test . 12 2.4 The Information Matrix Test . 14 3 Significance Level Study 19 3.1 The Goodness-of-Fit Statistics and Their Implementation . 20 3.2 A Weibull-based behaviour indicator . 22 3.3 Significance Level Study . 24 3.4 The results of the significance level study . 27 viii Table of contents 3.5 Study of the Weibull-based behaviour indicator applied to simulation data . 30 3.5.1 Principal Component Analysis . 34 3.5.2 The Weibull shape parameter and standard deviation as potential indicators of empirical significance levels . 39 3.5.3 Future work . 42 4 Power Study 43 4.1 Types of departure . 43 4.2 A General Outline of the Power Study Design . 44 4.3 Details of the power study design . 46 4.3.1 Departure type D1: Omission of a quadratic term . 47 4.3.2 Departure type D2: Omission of a log term . 49 4.3.3 Departure type D3: Omission of the main effect of a binary covariate and its interaction with a continuous covariate. 52 4.3.4 Departure type D4: Selection of an incorrect link function. 54 4.4 The Results of the Power Study . 58 4.4.1 Power assessment results for departure type D1 ........... 58 4.4.2 Power assessment results for departure type D2 ........... 60 4.4.3 Power assessment results for departure type D3 ........... 62 4.4.4 Power assessment results for departure type D4 ........... 63 5 Summary of Simulation Studies 65 5.1 The empirical significance levels of the GOF test statistics . 65 5.2 The empirical power of the GOF test statistics . 67 5.2.1 Omission of a quadratic term . 67 5.2.2 Omission of a log term . 68 5.2.3 Omission of the main effect of a binary covariate and its interaction with a continuous covariate . 68 5.2.4 Incorrectly specified link function . 69 Table of contents ix 5.2.5 Summary and recommendations . 70 6 Data Analysis 71 6.1 Background . 71 6.2 The Data Set . 72 6.3 Methods . 74 6.3.1 The models . 74 6.4 Statistical analysis . 76 6.5 Results . 77 Bibliography 85 Symbols Roman Symbols H0 The null hypothesis M0 The true logistic model Sˆst The standardised USS statistic 2 Xst The standardised Pearson chi-square statistic Greek Symbols j1, j2 The shape parameters of Stukel’s generalised model Acronyms / Abbreviations GOF Goodness-of-fit IMT Information matrix test LPM Linear probability model LRT Likelihood ratio test ML Maximum likelihood NIR The Norwegian Intensive Care Registry PRD Predicted risk of death USS Unweighted sum-of-squares Chapter 1 Introduction to Binary Regression 1.1 The Classical Linear Regression Model Let the data (yi;xi1;:::;xik), i = 1;:::;n, consist of n observations of the continuous response variable y and the k covariates x1;:::;xk. The covariates can be continuous or categorical. In Fahrmeir et al.(2013), the classical linear regression model is given by yi = b0 + b1xi1 + ::: + bkxik + ei; i = 1;:::;n; where the error terms e1;e2;:::;en are assumed to be independent and identically normally 2 distributed with E(ei) = 0 and Var(ei) = s . The model is linear in the parameters b0;:::;bk, whereas the covariates can be non-linear expressions. The following quantity, which represents the influence the covariates have on the model, is referred to as a linear predictor: T hi = b0 + b1xi1 + ::: + bkxik = xi b , (1.1) T T where xi = [1 xi1 xi2 ::: xik] and b = [b0 b1 ::: bk] . This can be expressed in vector form as y = X b + e, and (1.2) E(y) = m = X b , (1.3) 2 Introduction to Binary Regression T T where y = [y1 y2 ::: yn] , and m = [m1 m2 ::: mn] . The matrix X , which is often called the design matrix, is defined as 2 3 2 T3 1 x11 x12 ··· x1k x1 6 7 6 T7 61 x21 x22 ··· x2k7 6x2 7 X = 6. 7 = 6 . 7, 6. ... 7 6 . 7 4. 5 4 . 5 T 1 xn1 xn2 ··· xnk xn and X b is the linear component of the classical linear regression model. In this setting, the ordinary least squares (OLS) estimate and the maximum likelihood estimate (MLE) of b are identical, and given by bˆ = (X TX )−1X Ty. (1.4) Once the model parameters have been fitted to the data in X , the linear combinations of b and the rows of the design matrix comprise the estimated linear predictors: T ˆ ˆ ˆ ˆ hˆi = xi b = b0 + b1xi1 + ::: + bkxik. (1.5) Due to the model assumptions, the estimated linear predictors are suitable estimators for E(yi) = E(yijxi1;:::;xik) = E(b0 +b1xi1 +:::+bkxik +ei) = b0 +b1xi1 +:::+bkxik. Hence ˆ hˆi is used to predict yi, i.e. yˆ = X b . The classical linear regression is applied in prediction, forecasting, and when quantifying the strength of linear relationships. 1.2 Generalised Linear Models (GLMs) Classical linear regression models can be generalised in order to permit response variables whose errors are not normally distributed. GLMs are extensions of linear models which facilitate modelling non-normal response variables. A GLM consists of three components: 1) The random component, 2) the systematic component, and 3) the link function. The response variables Y1;:::;Yn are the random component. A standard assumption is that the response variables are random and independent, but not identically distributed. They each have a distribution in canonical form from the same exponential family. In some cases, 1.2 Generalised Linear Models (GLMs) 3 the observations of Y1;:::;Yn are binary, taking on values such as "0" and "1", or "success" and "failure". The systematic component is the function of the covariates x1;:::;xk which is related to the expected value of Y1;:::;Yn. Just as in classical linear regression, the function is called the linear predictor and takes the usual form: T hi = b0 + b1xi1 + ::: + bkxik = xi b , (1.6) T T where, as before, xi = [1 xi1 xi2 ::: xik] and b = [b0 b1 ::: bk] . Finally, the link function connects the random component and the systematic component by stating that T g(mi) = xi b , (1.7) where mi = E(Yi) and g(·) is the link function.

A Simulation Study of Goodness-Of-Fit Tests for Binary Regression with Applications to Norwegian Intensive Care Registry Data

On Assessing Binary Regression Models Based on Ungrouped Data

05 36534Nys130620 31

The Effects of Simplifying Assumptions in Power Analysis

Introduction to Hypothesis Testing

Power of a Statistical Test

Confidence Intervals and Hypothesis Tests

A Weakly Informative Default Prior Distribution for Logistic and Other

Understanding Statistical Hypothesis Testing: the Logic of Statistical Inference

Measures of Fit for Logistic Regression Paul D

Post Hoc Power: Tables and Commentary

BUGS Example 1: Linear Regression Length 1.8 2.0 2.2 2.4 2.6

STAT 141 11/02/04 POWER and SAMPLE SIZE Rejection & Acceptance Regions Type I and Type II Errors (S&W Sec 7.8) Power