SUGI 26: Getting Started with PROC LOGISTIC
Total Page:16
File Type:pdf, Size:1020Kb
Statistics, Data Analysis, and Data Mining Paper 248-26 GETTING STARTED WITH PROC LOGISTIC Andrew H. Karp Sierra Information Services, Inc. Sonoma, California USA Introduction The logistic function is used to estimate, as a function of unit changes in the independent variable, Logistic Regression is an increasingly popular the probability that the event of interest will occur. analytic tool. Used to predict the probability that the This function is often called the link function in that it 'event of interest' will occur as a linear function of connects, or 'links' changes in values of the one (or more) continuous and/or dichotomous independent variables to increasing (or decreasing) independent variables, this technique is implemented probability of occurrence of the event being modeled in the SAS® System in PROC LOGISTIC. This by the dependent variable. paper gives an overview of how some common forms of logistic regression models can be Implementation in the SAS System implemented using PROC LOGISTIC as well as important changes and enhancements to the Techniques for implementing logistic regression are procedure in Releases 6.07 and above of the SAS® found in PROC LOGISTIC in the STAT module of System, as well as new features available in Version SAS System software, and is one of several 8. procedures in this module which can be used for categorical data analysis. Other procedures for Background categorical data analysis in the STAT module include: Logistic regression is commonly used to obtain predicted probabilities that a unit of the population • FREQ under analysis will acquire the event of interest as a • GENMOD linear function of one or more: • CATMOD • PROBIT • continuous-level variables • PHREG • dichotomous (binary) variables • LIFETEST • or, a combination of both continuous and • PROBIT binary independent variables. Data Preparation Many concepts in logistic regression will be familiar to people who already have experience with simple As with other forms of data analysis, the results of a and multiple regression models. In fact, much of the logistic regression analysis performed by PROC syntax in PROC LOGISTIC will be familiar to SAS LOGISTIC can be seriously compromised if the System users already experienced with using analyst does not take care to prepare their data PROCs REG and/or GLM. properly. Following important rules regarding construction and coding of the dependent variable In logistic regression, however, the dependent are critical to the SAS System's generation of variable is dichotomous and is usually coded as: accurate results. • zero (event did not occur) Dependent Variable • one (event did occur) Your dependent variable should be: for each particular subject in the data set upon which the analysis will be carried out. • dichotomous • coded zero for 'non-event' Statistics, Data Analysis, and Data Mining • coded one for 'event' • use the DESCENDING option in the PROC LOGISTIC statement. Although polytomous (or multinomial logistic) regression models (those with three or more levels, This DESCENDING option, new in Release 6.07, is or categories, of the dependent variable) can be probably the easiest and most straightforward implemented in the SAS System, discussion of these method by which to override the SAS System types of models, and how they are implemented in default, as it avoids potentially unnecessary work in a the SAS System, is beyond the scope of this paper DATA Step before applying PROC LOGISTIC. and will not be considered here. SAS System implementation of these types of models is Implementing a Logistic Regression Analysis discussed in the following SAS Institute publications: The structure and syntax of many features in PROC • Logistic Regression Examples Using the LOGISTIC are similar to those used in PROCs REG SAS System (1995) and GLM, which facilitates comparison of how to • SAS Technical Report R-109: Conjoint perform a logistic regression analysis with linear Analysis Examples (1993) models such as regression and analysis of variance. Effect of Coding Dependent Variable on how The important difference, for our purposes, between PROC LOGISTIC Works what is being estimated by a logistic regression model and that estimated by a linear model is: The zero/one coding scheme is the most commonly employed method by which events/non-events are • linear regression attempts to predict the classified for the purposes of conducting a logistic value of the dependent variable as a regression analysis. By default, however, PROC linear function of one (or more) LOGISTIC will attempt to model (that is, predict the independent variables probability of) the lower of the two values of the • logistic regression attempts to predict dependent variable. which is usually not the desired the probability that a unit under analysis result. will acquire the event of interest as a function of one or more independent For example, if a researcher were attempting to variables. determine the probability that a patient will die, the variable representing "outcome" (i.e., "dead” or Put another way, the logistic regression equation “alive") might be coded zero for patients who predicts the probability that the unit under analysis survived and one for patients who died. Since zero will, as a function of one or more independent "sorts lower" than one, PROC LOGISTIC will attempt variables, obtain the condition of interest which is to ‘model’ (that is, predict) the probability that the (usually) coded as 1 in a zero/one coding scheme. patient will be coded zero (lived) rather than the probability that the patient will be coded one (died). The general form of PROC LOGISTIC is: This is most likely the opposite of what the researcher desired. PROC LOGISTIC DATA=dsn [DESCENDING] ; MODEL depvar = indepvar(s)/options; Overriding SAS System Defaults RUN; Users can override the default attempt by PROC LOGISTIC to predict the probability of the non-event Interpretation of SAS System-Generated Results using one of three approaches: • re-coding the dependent variable Tests of the Global Null Hypothesis (e.g., 0 = ‘died’, 1 = ‘lived’) The default output generated by PROC LOGISTIC • creating and applying a FORMAT to the looks very similar to that generated by PROCs REG dependent variable where the formatted and/or GLM. This output includes several tests of value of the ‘event’ group ‘sorts higher’ overall model adequacy which test the global null hypothesis that none of the independent variables in than the ‘non-event’ group (i.e., the external representation of 0 = ‘Alive’ the model are related to changes in probability of and 1 = ‘Dead’) event occurrence. Of these, the -2 LOG L test is perhaps the most easy to interpret and is analogous Statistics, Data Analysis, and Data Mining to the “Global F” test used in a linear regression emergency room is about 9 times more likely to die analysis. The computation of and rationale for the -2 than a patient admitted from another service. LOG L test, among others, is found in Hosmer and Lemeshow (1989). Other global tests, such as the b) continuous level independent variable SCORE, Akaike Information Criterion, and Consider another analytic situation where the event Schwartz Bayesian Criterion are also provided but of interest to be predicted is patient’s survival after are beyond the scope of this paper. admission to a hospital intensive care unit (ICU), and the independent variable is age of patient in years. Tests of the Local Null Hypotheses Application of PROC LOGISTIC to the Hosmer and Tests of the ‘statistical significance’ of each Lemeshow data set yielded a parameter estimated independent variable are also provided. The Wald for the variable AGE as 0.0275; exponentiation of Chi-Square test (and its associated p-value) are that estimate gives an odds ratio of 1.028. In this printed along with the parameter estimate and example, a one unit (that is, one year) increase in a standardized parameter estimate. As with linear patient’s age increases by 2.8 percent the chance regression analysis, the parameter estimate can be they will die (i.e., acquire the event of interest). [Of conceptualized as how much mathematical impact a course, while this result may be ‘statistically unit changes in the value of the independent variable significant’, the clinical relevance to a health care has on increasing or decreasing the probability that provider of patient age, without regard to other the dependent variable will achieve the value of one prognostic factors (such as disease severity) may in the population from which the data are assumed limit the practical usefulness of the results.] to have been randomly sampled. Customized Odds Ratios The Odds Ratio As with the previous example, a unit change in the Exponentiation of the parameter estimate(s) for the values of the independent variable(s) may not be independent variable(s) in the model by the number substantively relevant or useful to the analyst. In the e (about 2.17) yields the odds ratio, which is a more ICU survival study, a five, ten, or twenty year change intuitive and easily understood way to capture the in patient age may be of more clinical relevance than relationship between the independent and dependent a change of just one year. Customized odds ratios variables. This quantity is automatically portrayed in can be obtained by: PROC LOGISTIC starting in Release 6.07; users • with earlier SAS System releases can easily hand, using a calculator • compute this quantity by hand. placing the parameter estimates generated by PROC LOGISTIC into an The odds ratio gives the increase or decrease in output SAS data set using the OUTEST probability that a unit change in the independent option and then working in the data step variable has in the probability that the event of using the UNITS option, which is available in interest will occur. Two analytic scenarios will be Release 6.10 and above.