Logistic Regression Maths and Statistics Help Centre
Total Page:16
File Type:pdf, Size:1020Kb
Logistic regression Maths and Statistics Help Centre Many statistical tests require the dependent (response) variable to be continuous so a different set of tests are needed when the dependent variable is categorical. One of the most commonly used tests for categorical variables is the Chi-squared test which looks at whether or not there is a relationship between two categorical variables but this doesn’t make an allowance for the potential influence of other explanatory variables on that relationship. For continuous outcome variables, Multiple regression can be used for a) controlling for other explanatory variables when assessing relationships between a dependent variable and several independent variables b) predicting outcomes of a dependent variable using a linear combination of explanatory (independent) variables The maths: For multiple regression a model of the following form can be used to predict the value of a response variable y using the values of a number of explanatory variables: y 0 1x1 2 x2 ..... q xq 0 Constant/ intercept , 1 q co efficients for q explanatory variables x1 xq The regression process finds the co-efficients which minimise the squared differences between the observed and expected values of y (the residuals). As the outcome of logistic regression is binary, y needs to be transformed so that the regression process can be used. The logit transformation gives the following: p ln 0 1x1 2 x2 ..... q xq 1 p p p probabilty of event occuring e.g. person dies following heart attack, odds ratio 1- p If probabilities of the event of interest happening for individuals are needed, the logistic regression equation exp x x ..... x can be written as: p 0 1 1 2 2 q q , 0 < p < 1 1 exp0 1x1 2 x2 ..... q xq Logistic regression does the same but the outcome variable is binary and leads to a model which can predict the probability of an event happening for an individual. Titanic example: On April 14th 1912, only 705 passengers and crew out of the 2228 on board the Titanic survived when the ship sank. Information on 1309 of those on board will be used to demonstrate logistic regression. The data can be downloaded from biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls The key variables of interest are: Dependent variable: Whether a passenger survived or not (survival is indicated by survived = 1). Possible explanatory variables: Age, gender (recode so that sex = 1 for females and 0 for males), class (pclass = 1, 2 or 3), number of accompanying parents/ children (parch) and number of accompanying siblings/ spouses (sibsp) 1 Logistic regression Maths and Statistics Help Centre Most of the variables can be investigated using crosstabulations with the dependent variable ‘survived’. Another reason for the cross tabulation is to identify categories with small frequencies as this can cause problems with the logistic regression procedure. The number of accompanying parents/ children (parch) and number of accompanying siblings/ spouses (sibsp) were used to create a new binary variable indicating whether or not the person was travelling alone or with family (1 = travelling with family, 0 = travelling alone). Male Female 1st class 2nd Class 3rd class Travelling Travelling alone with family % surviving 19.1% 72.7% 61.9% 43% 25.5% 30.3% 50.3% When tested separately, Chi-squared tests concluded that there was evidence of a relationship between survival and gender, class and whether an individual was travelling alone. Looking at the %’s of survival, it’s clear that women, those in first class and those not travelling alone were much more likely to survive. Logistic regression will be carried out using these three variables first of all. Stage 1 of the following analysis will relate to using logistic regression to control for other variables when assessing relationships and stage 2 will look at producing a good model to predict from. Use ANALYZE Regression Binary logistic to get the following screen: Where there are more than two categories, the last category is automatically the reference category. This means that all the other categories will be compared to the reference in the output e.g. 1st and 2nd class will be compared rd Treatment of categorical explanatory variables to 3 class. When interpreting SPSS output for logistic regression, it is important that binary variables are coded as 0 and 1. Also, categorical variables with three or more categories need to be recoded as dummy variables with 0/ 1 outcomes e.g. class needs to appear as two variables 1st/ not 1st with 1 = yes and 2nd/ not 2nd with 1 = yes. Luckily SPSS does this for you! When adding a categorical variable to the list of covariates, click on the Categorical button and move all categorical variables to the right hand box. The following table in the output shows the coding of these variables. Categorical Variables Codings Parameter coding rd Frequency (1) (2) For class, 3 class is the reference class so st nd class 1st class 323 1.000 .000 if 1 class = 0 and 2 class = 0, the person 2nd class 277 .000 1.000 must have been in 3rd class. 3rd class 709 .000 .000 Travelling alone Alone 790 1.000 Females and those not travelling alone are with family 519 .000 the references for the other groups. Gender male 843 1.000 female 466 .000 2 Logistic regression Maths and Statistics Help Centre Interpretation of the output The output is split into two sections, block 0 and block 1. Block 0 assesses the usefulness of having a null model, which is a model with no explanatory variables. The ‘variables in the equation’ table only includes a constant so each person has the same chance of survival. p exp - 0.481 The null model is: ln 0 0.481, p probability of survival 0.382 1 p 1 exp- 0.481 SPSS calculates the probability of survival for each individual using the block model. If the probability of survival is 0.5 or more it will predict survival (as survival = 1) and death if the probability is less than 0.5. As more people died than survived, the probability of survival is 0.382 and therefore everyone is predicted as dying (coded as 0). As 61.8% of people were correctly classified, classification from the null model is 61.8% accurate. The addition of explanatory variables should increase the percentage of correct classification significantly if the model is good. Block 0: Beginning Block Block 1: Method = Enter Block 1 shows the results after the addition of the explanatory variables selected. The omnibus Tests of Model Co-efficients table gives the result of the Likelihood Ratio (LR) test which indicates whether the inclusion of this block of variables contributes significantly to model fit. A p-value (sig) of less than 0.05 for block means that the block 1 model is a significant improvement to the block 0 model. In standard regression, the co-efficient of determination (R2) value gives an indication of how much variation in y is explained by the model. This cannot be calculated for logistic regression but the ‘Model Summary’ table gives the values for two pseudo R2 values which try to measure something similar. From the table above, we can conclude 3 Logistic regression Maths and Statistics Help Centre that between 31% and 42.1 of the variation in survival can be explained by the model in block 1. The correct classification rate has increased by 16.2% to 78%. Finally, the ‘Variables in the Equation’ table summarises the importance of the explanatory variables individually whilst controlling for the other explanatory variables. The Wald test is similar to the LR test but here it is used to test the hypothesis that each 0 . In the sig column, the p-values are all below 0.05 apart from the test for the variable Alone, (p = 0.286). This means that although the Chi-squared test for Survival vs Alone was significant, once the other variables were controlled for, there is not a strong enough relationship between that variable and survival. Class is tested as a whole (pclass) and then 1st and 2nd class compared to the reference category 3rd class. When interpreting the differences, look at the exp column which represents the odds ratio for the individual variable. For example, those in 1st class were 5.493 times more likely to survive than those in first class. With gender, the odds ratio compares the likelihood of a male surviving in comparison to females. The odds are a lot lower for men (0.084 times that of women). For ease of interpretation, calculate the odds of a female surviving over a male using 1/0.084 = 11.9. Females were 11.9 times more likely to survive. The co-efficients for the model are contained in the column headed B. A negative value means that the odds of survival decreases e.g. for males and those travelling alone. The full model being tested is: p ln 0.460 1.703x1st class 0.832x 2nd class - 2.474x male -0.156x alone 1 p x1st class 1 for 1st class. x2nd class 1 for 2nd class, xmale 1 for men and xalone 1 for person travelling alone 4 .