And Ch. 15 (Sec. 1 & 4): Logistic Regression

22s:152 Applied Linear Regression Unlike OLS regression, logistic regression does • not assume... Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): – linearity between the independent variables Logistic Regression and the dependent ———————————————————— – normally distributed errors Logistic Regression – homoscedasticity Used when the response variable is binary It does assume... • • – we have independent observations We model the log of the odds for Y =1 • i – that the independent variables be linearly P (Y =1) ln i = β + β X + + β X related to the logit of the dependent vari- 1 P (Y =1) 0 1 1i ··· k ki − i able (somewhat difficult to check) Which can be converted to Maximum likelihood estimation (MLE) is used • • to calculate the regression coefficient estimates exp(β0 + β1X1i + + βkXki) P (Yi =1)= ··· 1+exp(β0 + β1X1i + + βkXki) – Ordinary Least Squares (OLS) minimizes ··· the sum of the squared residuals 1 = – MLE finds the parameter estimates that 1+exp[ (β + β X + + β X )] − 0 1 1i ··· k ki maximize the log-likelihood function 1 2 Significance testing The Global Null model (simplest possi- • ∗ ble model) has only an intercept and is the model: – Testing individual coefficients (H0 : βj =0) logit(π )=c for all i Wald tests (i.e. Z-tests) based on asymp- i ∗ ˆ and some constant c totic normality of βj’s are provided in the summary output from R. [i.e. covariates don’t affect πi = P (Yi =1)] Full model vs. Global Null model has – Testing Full vs. Reduced (nested) models ∗ the flavor of an overall F-test from mul- tiple regression (i.e. are any of the vari- Likelihood ratio tests, which are chi-squared ables in the model useful). ∗ tests (χ2 tests). We can use the anova() function in R ∗ to do these likelihood ratio nested tests. We will use the option test=“Chisq” here, ∗ or anova(lm.red, lm.full, test =“Chisq”) 3 4 Example:Incidenceofbirdonislands Look at a (jittered) scatterplot of each • covariate vs. the response. Dichotomous response called incidence. >plot(jitter(incidence,factor=.25)~area, ylab="incidence(jittered)") 1ifislandoccupiedbybird >plot(jitter(incidence,factor=.25)~isolation, incidence = ylab="incidence(jittered)") 0ifbirddidnotbreedthere Two continuous predictor variables: area -areaofislandinkm2 1.0 1.0 0.8 isolation -distancefrommainlandinkm 0.8 0.6 >attach(iso.data) 0.6 0.4 >head(iso.data) 0.4 incidence(jittered) incidence(jittered) 0.2 0.2 incidence area isolation 0.0 0.0 117.9283.317 201.9257.554 0 2 4 6 8 2 4 6 8 312.0455.883 area isolation 404.7815.932 501.5365.308 617.3694.934 We expect incidence to be lower for high isolation,andincidencetobehigherforhigh area. 5 6 Fit the full additive model (2 covariates): Deviance(or residual deviance) >n=nrow(iso.data) – This is used to assess the model fit. >n [1] 50 – In logistic regression, the deviance has the flavor of RSS in ordinary regression. >glm.out.full=glm(incidence~area+isolation, – The smaller the deviance, the better the family=binomial(logit)) fit. >summary(glm.out.full) Coefficients: Null deviance -likeRSSinordinaryre- Estimate Std. Error z value Pr(>|z|) (Intercept) 6.6417 2.9218 2.273 0.02302 * gression when only an overall mean is fit area 0.5807 0.2478 2.344 0.01909 * (see R output: n=50, and df are 49). isolation -1.3719 0.4769 -2.877 0.00401 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual deviance -likeRSSfromthe (Dispersion parameter for binomial family taken to be 1) full model fit Null deviance: 68.029 on 49 degrees of freedom (see R output: n=50, and df are 47). Residual deviance: 28.402 on 47 degrees of freedom AIC: 34.402 Number of Fisher Scoring iterations: 6 7 8 AcomparisonoftheNull deviance and Global null hypothesis test for incidence data: Residual deviance is used to test the global • null hypothesis. In this case... From the summary output... H0 : β1 = β2 =0 Null deviance: 68.029 on 49 degrees of freedom Residual deviance: 28.402 on 47 degrees of freedom HA :Atleastoneβj not equal to 0 χ2 test: Alikelihoodratiotestis used for this nested 2 test which follows a central χ2 distribution >chi.sq=68.029-28.402 2 >pchisq(chi.sq,2,lower.tail=FALSE) under H0 being true. [1] 2.483741e-09 2 χq is a chi-squared distribution with q de- This can also be done using the full vs. re- grees of freedom, and q will be the number of duced likelihood ratio test (use test=“Chisq”): restrictions being made in H0 (or the number of covariates in the full model if doing global >glm.out.null=glm(incidence~1,family=binomial(logit)) >anova(glm.out.null,glm.out.full,test="Chisq") null hypothesis test). Analysis of Deviance Table Model 1: incidence ~ 1 χ2 = 2[log likelihood log likelihood ] q − red − full Model 2: incidence ~ area + isolation =( 2LLred) ( 2LLfull) Resid. Df Resid. Dev Df Deviance P(>|Chi|) − − − 14968.029 =(2LLsaturated 2LLred) (2LLsaturated 2LLfull) − − − 24728.402239.6272.484e-09 =Reducedmodeldeviance Full model deviance − =Nulldeviance Residual deviance Reject H0. − ⇒ 9 10 The deviance is saved in the model fit Individual tests for coefficients output, and it can be requested... • After rejecting the Global Null hypothesis, χ2 =Reducedmodeldeviance 2 − we can consider individual Z-tests for the pre- Full model deviance dictors. >chi.sq=glm.out.null$deviance-glm.out.full$deviance >pchisq(chi.sq,2,lower.tail=FALSE) [1] 2.483693e-09 >summary(glm.out.full) Coefficients: Same p-value as in previous page output. Estimate Std. Error z value Pr(>|z|) (Intercept) 6.6417 2.9218 2.273 0.02302 * area 0.5807 0.2478 2.344 0.01909 * isolation -1.3719 0.4769 -2.877 0.00401 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 isolation is a significant predictor, given we’ve already accounted for area. area is a significant predictor, given we’ve already accounted for isolation. 11 12 Another way to view the data: Interpretation of the parameters: • plot(area,isolation,pch=(incidence+1),col=(incidence+1)) title("Each point represents an island") ˆ legend(6,9,c("incidence 0","incidence 1"), βarea =0.5807 col=c(1,2),pch=c(1,2)) ˆ or eβarea =1.7873 Each point represents an island Holding isolation constant... incidence 0 2 incidence 1 A1km increase in area is associated with 8 an increase in the odds of seeing a bird by a factor of 1.7873. 6 ˆ eβarea represents the multiplicative effect (ap- isolation plied to the odds) of a 1-unit change in area. 4 Increasing the area of an island by 1 km2, increases the odds of seeing a bird by a mul- 2 tiplicative factor of 1.7873. 0 2 4 6 8 area It increases the odds by 78.73% 13 14 Another way to express it, Diagnostics:Outliers,Influentialdata • Odds =1.7872 Odds The car library diagnostics can also be used [(x +1) km2] [(x ) km2] 1 × 1 on generalized linear models... rstudent, hatvalues, cookd, vif, ————————————————– outlier.test, and av.plots. >plot(cooks.distance(glm.out.full),pch=16) >identify(1:n,cooks.distance(glm.out.full)) βˆ = 1.3719 isolation − ˆ or eβisolation =0.2536 0.20 19 Holding area constant... 0.15 47 A1km increase in isolation is associated 0.10 4 with an decrease in the odds of seeing a bird cookd(glm.out.full) by a factor of 0.2536. 0.05 Odds =0.2536 Odds [(x2+1) km] × [(x2) km] 0.00 0 10 20 30 40 50 Index 15 16 >vif(glm.out.full) >avPlot(glm.out.full,"isolation") area isolation 1.040897 1.040897 Added-variable Plot 2 >outlierTest(glm.out.full) No Studentized residuals with Bonferonni p < 0.05 1 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 19 2.250205 0.024436 NA 0 incidence | others >avPlot(glm.out.full,"area") -1 Added-variable Plot -4 -2 0 2 isolation | others 1 0 incidence | others -1 -2 -2 0 2 4 6 area | others 17 18 Diagnostics:Goodnessoffit The fitted value is a probability (or pˆ). • Since the responses are all 0’s and 1’s, it is The logistic regression provides a pˆ for every more difficult to check how well the model x-value. fits our data (compared to ordinary regression). To check the fit, we will partition the observations into 10 groups based on the x-values. If you have data points fairly evenly spread >break.points=quantile(soil,seq(0,1,0.1)) across your x-values, you could try to check >group.soil=cut(soil,breaks=break.points) the fit using the Hosmer-Lemeshow Goodness of >table(group.soil) Fit Test. group.soil (40,89.4] (89.4,151] 11 14 We will return to our original example data (151,239] (239,361] 14 14 on lead levels in children’s blood relative to soil lead levels to show how this test works. (361,527] (527,750] 14 13 (750,891] (891,1330] 14 14 (1330,1780] (1780,5260] 14 14 19 20 For each group g,wewillestimateapˆg. The vertical lines represent the grouping struc- ture of the observations. This pˆg does not consider the other fitted pˆ values (unlike the logistic regression fitted 1.0 values which all fall along a smooth curve). 0.8 0.6 grouped p-hat logistic reg. p-hat 95% CI for grouped p-hat >group.est.p=tapply(highbld,group.soil,mean) highbld >group.est.p=as.vector(group.est.p) 0.4 >round(group.est.p,4) [1] 0.2727 0.1429 0.2143 0.0714 0.5714 0.2 0.7692 0.6429 0.7143 0.9286 1.0000 0.0 0 1000 2000 3000 4000 5000 soil Now we will compare the ‘freely fit’ estimated If the dots fall close to the fitted logistic curve, probabilities, with the logistic regression (re- it’s a reasonably good fit.

Load more