Hierarchical Logistic Regression Modeling with SAS GLIMMIX Jian Dai, Zhongmin Li, David Rocke University of California, Davis, CA

Hierarchical Logistic Regression Modeling with SAS GLIMMIX Jian Dai, Zhongmin Li, David Rocke University of California, Davis, CA ABSTRACT Data often have hierarchical or clustered structures, such as patients clustered within hospitals or students nested within schools. Hierarchical models are statistical models that are used to analyze hierarchical or multilevel data. SAS GLIMMIX procedure is a new and highly useful tool for hierarchical modeling with discrete responses. This paper is focused on hierarchical logistic regression modeling with GLIMMIX. We present several applications of these models and show how to use GLIMMIX to fit the models and test hypotheses. We illustrate the applications using a sample data from a multi-institution database on coronary artery bypass grafting surgeries developed by the California Office of Statewide Health Planning and Development. INTRODUCTION In many applications, data have hierarchical or clustered structures, such as medical and health services research where patients are clustered within hospitals, or educational studies where students are nested within schools. These studies often involve the analysis of data with complex patterns of variability, such as multilevel, nested sources of variability. For example, in the quality analysis of healthcare providers, there is variability between the providers as well as between patients who are nested within the providers. Hierarchical models are statistical models that can be used to analyze nested sources of variability in hierarchical data, taking account of the variability associated with each level of the hierarchy. These models have also been refereed to as multilevel models, mixed models, random coefficient models, and covariance component models (Breslow and Clayton, 1993; Longford, 1993; Snijders and Bosker, 1999; Hox, 2002; Goldstein, 2003). In applications, the outcome variable is often binary. For example, the outcome of a medical treatment might be a success or failure; the patient might have survived or died after a surgery; a tissue sample might be normal or cancerous. Binary outcomes lead to a generalized linear model with the logic link, which is the logistic regression model. In this paper we are focused on hierarchical logistic regression models, which can be fitted using the new SAS procedure GLIMMIX (SAS Institute, 2005). Proc GLIMMIX is developed based on the GLIMMIX macro (Little et al., 1996) and provides highly useful tools for fitting generalized linear mixed models, of which the hierarchical logistic model is a special case. We will show how to use GLIMMIX to fit hierarchical logistic models. We will discuss that hierarchical models and mixed models are equivalent, two ways to express the same relationships. We will introduce random intercept models and demonstrate their usefulness using examples. We use GLIMMIX to fit the models and test hypotheses. The focus of the paper is on two-level hierarchical models, but it is straightforward to fit models with higher hierarchies. The rest of the paper is organized as follows. We first motivate the application with an example in pubic health study; then develop the models and describe the code for fitting the models. We then describe the data and present the results, and finally conclude the study. MOTIVATION Health care quality analysis is an important area in public health study. One way to assess the quality of care is to analyze between and within provider variations in healthcare outcome. This typically involves analysis of hierarchical data at different levels, for example, the patient-level, the physician-level, and the hospital-level. In the simplest case, two levels of analysis are needed: the patient level and a single provider level. The patient level analysis is needed because the outcome measures are meaningful only when adequately adjusted for patient risks. At the provider level, the risk due to the provider is analyzed; the differences between providers are compared and evaluated. It is clear that in such applications the data structures are hierarchical and the analyses are multi-level (Goldstein and Spiegelhalter, 1996; DeLong, 1997; Shahian et al., 2001). There are different modeling approaches for such analysis. One approach is to take steps for different levels of analyses: model the data at the lowest level and then aggregate the modeling results and perform analysis on higher level units. For example, one may begin with patient-level data, estimating a patient-level risk model, computing expected mortality of the provider by summing up the risks of its patients, and then compare the expected mortality to the observed mortality of the provider. An alternative and less approximate approach would be to model the hierarchical data in one step. The hierarchical models take account of the variability at each level of the hierarchy and 1 thus allow the provider effects to be analyzed within the models. The hierarchical models also have other advantages (Normand et al., 1997; Shahian et al., 2001). For example, they can account for clustering of observations, such as patients nested within hospitals, while the traditional regression models assume independence of observations. Hierarchical models assume that higher level units are drawn from a population of units and produce posterior or predicted estimates of unit effects. These estimates are “shrunken” estimates which have the useful property of moving higher level unit estimates towards the population mean (regression to the mean) and increasing the accuracy of prediction (Goldstein and Spiegelhalter, 1996). One area in healthcare that has received much attention is quality assessment for coronary bypass grafting (CABG) surgeries. Information on how well hospitals perform this surgery is critical for hospital quality improvement efforts and assisting patients and their families in making informed decisions about where to receive the best care. For this purpose, many states collect data and publish reports on CABG surgeries. In this paper, we use a sample data from the California CABG 2000-2002 database (Parker et al., 2005) to illustrate the application of hierarchical logistic regression models. We demonstrate the application with two examples: (1) assessing differences in hospital effects on in-hospital mortality; and (2) evaluating the effects of hospital teaching status on the mortality. The purpose of the application is for illustrating the methods, not for policy analysis. MODELS We begin with the ordinary logistic regression model, which is a single level model but provides a starting point for developing multilevel models for binary outcomes. We then present the random intercept models which have many applications in public health and other studies. For simplicity of presentation, we consider two-level models, for example, models accounting for patient-level and hospital-level effects. The two-level data structure is shown in Figure 1 below: Figure 1. Data Structure for a Two-Level Hierarchical Model In this data structure, level-1 is the patient level and level-2 is the hospital level. Within each level-2 unit there are nj patients in the j’th hospital. We further simplify the presentation by assuming there is a single patient-level predictor x (e.g. a risk index) and one hospital-level factor z (e.g. size or type of hospital). The models can be easily generalized to handle more complicated data structures (Hox, 2002; Goldstein, 2003; Demidenko, 2004). ORDINARY LOGISTIC REGRESSION MODEL Suppose that y is a binary outcome variable (e.g. the patient survived or died after a surgery) and follows the Bernoulli distribution, y ~ Bin(1,π ) and x is a patient-level predictor. Then, the ordinary logistic regression model (Hosmer and Lemeshow, 2000) is yij = π ij + eij , (1) ⎛ π ⎞ logit(π ) = log⎜ ij ⎟ = α + βx ij ⎜ ⎟ ij ⎝1− π ij ⎠ where i = 1,...,I j is the patient level indicator, j = 1,..., J is the hospital level indicator, andπ ij is the probability of death for patient i in hospital j, conditional on the risk factor x. The logit model assumes that patient level random 2 2 errors eij are independent with moments E(eij ) = 0 and Var(eij ) = σ e = π ij (1 − π ij ) . The logit model has a linear function at the logit (log odds) scale. Equation (1) implies that the probability function is exp(α + βxij ) π ij = (2) 1+ exp(α + βxij ) We have used two subscripts (i,j) to reflect the fact that the data have two levels and the patients are nested within hospitals. However, this model is a single-level model because it doesn’t contain hospital level effects. Nor does it account for the variation between hospitals and the clustering of patients within hospitals. HIERARCHCIAL LOGISTIC REGRESSION MODEL There are several ways to extend the single-level model to multilevel analysis. A simple way to account for effects of higher-level units is to add design variables (dummy variables) to Equation (1) so that each higher-level unit (in this case, each hospital) has its own intercept in the model. These hospital intercepts (subject-specific intercepts) are used to measure the differences between hospitals, logit(π ij ) = α j + βxij (3) Equation (3) is such a model specified at the logit scale, where each hospital j has its own intercept α j . The intercepts can be specified as either fixed effects or random effects (Demidenko, 2004). The use of fixed intercepts, however, leads to increasing the number of additional parameters equal to the number of higher-level units minus 1 (J-1). Thus, if the number of hospitals is large, one faces the problem of a large number of nuisance parameters in the model and can have very poor estimation results. A more sophisticated approach is to treat the hospital intercepts, α j (j=1,…,J), as a random variable with specified probability distribution, which leads to a random intercept model and more conservative estimates of hospital effects: logit(π ij ) = α j + βxij (4) α j = α + u j In this model, the hospital effects are measured by the random intercepts α j (j=1,…J), a linear combination of a grand mean (α ) and a deviation ( u j ) from that mean.

Load more