And Ch. 15 (Sec. 1 & 4): Logistic Regression

22s:152 Applied Linear Regression Unlike OLS regression, logistic regression does • not assume... Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): – linearity between the independent variables Logistic Regression and the dependent ———————————————————— – normally distributed errors Logistic Regression – homoscedasticity Used when the response variable is binary It does assume... • • – we have independent observations We model the log of the odds for Y =1 • i – that the independent variables be linearly P (Y =1) ln i = β + β X + + β X related to the logit of the dependent vari- 1 P (Y =1) 0 1 1i ··· k ki − i able (somewhat difficult to check) Which can be converted to Maximum likelihood estimation (MLE) is used • • to calculate the regression coefficient estimates exp(β0 + β1X1i + + βkXki) P (Yi =1)= ··· 1+exp(β0 + β1X1i + + βkXki) – Ordinary Least Squares (OLS) minimizes ··· the sum of the squared residuals 1 = – MLE finds the parameter estimates that 1+exp[ (β + β X + + β X )] − 0 1 1i ··· k ki maximize the log-likelihood function 1 2 Significance testing The Global Null model (simplest possi- • ∗ ble model) has only an intercept and is the model: – Testing individual coefficients (H0 : βj =0) logit(π )=c for all i Wald tests (i.e. Z-tests) based on asymp- i ∗ ˆ and some constant c totic normality of βj’s are provided in the summary output from R. [i.e. covariates don’t affect πi = P (Yi =1)] Full model vs. Global Null model has – Testing Full vs. Reduced (nested) models ∗ the flavor of an overall F-test from mul- tiple regression (i.e. are any of the vari- Likelihood ratio tests, which are chi-squared ables in the model useful). ∗ tests (χ2 tests). We can use the anova() function in R ∗ to do these likelihood ratio nested tests. We will use the option test=“Chisq” here, ∗ or anova(lm.red, lm.full, test =“Chisq”) 3 4 Example:Incidenceofbirdonislands Look at a (jittered) scatterplot of each • covariate vs. the response. Dichotomous response called incidence. >plot(jitter(incidence,factor=.25)~area, ylab="incidence(jittered)") 1ifislandoccupiedbybird >plot(jitter(incidence,factor=.25)~isolation, incidence = ylab="incidence(jittered)") 0ifbirddidnotbreedthere Two continuous predictor variables: area -areaofislandinkm2 1.0 1.0 0.8 isolation -distancefrommainlandinkm 0.8 0.6 >attach(iso.data) 0.6 0.4 >head(iso.data) 0.4 incidence(jittered) incidence(jittered) 0.2 0.2 incidence area isolation 0.0 0.0 117.9283.317 201.9257.554 0 2 4 6 8 2 4 6 8 312.0455.883 area isolation 404.7815.932 501.5365.308 617.3694.934 We expect incidence to be lower for high isolation,andincidencetobehigherforhigh area. 5 6 Fit the full additive model (2 covariates): Deviance(or residual deviance) >n=nrow(iso.data) – This is used to assess the model fit. >n [1] 50 – In logistic regression, the deviance has the flavor of RSS in ordinary regression. >glm.out.full=glm(incidence~area+isolation, – The smaller the deviance, the better the family=binomial(logit)) fit. >summary(glm.out.full) Coefficients: Null deviance -likeRSSinordinaryre- Estimate Std. Error z value Pr(>|z|) (Intercept) 6.6417 2.9218 2.273 0.02302 * gression when only an overall mean is fit area 0.5807 0.2478 2.344 0.01909 * (see R output: n=50, and df are 49). isolation -1.3719 0.4769 -2.877 0.00401 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual deviance -likeRSSfromthe (Dispersion parameter for binomial family taken to be 1) full model fit Null deviance: 68.029 on 49 degrees of freedom (see R output: n=50, and df are 47). Residual deviance: 28.402 on 47 degrees of freedom AIC: 34.402 Number of Fisher Scoring iterations: 6 7 8 AcomparisonoftheNull deviance and Global null hypothesis test for incidence data: Residual deviance is used to test the global • null hypothesis. In this case... From the summary output... H0 : β1 = β2 =0 Null deviance: 68.029 on 49 degrees of freedom Residual deviance: 28.402 on 47 degrees of freedom HA :Atleastoneβj not equal to 0 χ2 test: Alikelihoodratiotestis used for this nested 2 test which follows a central χ2 distribution >chi.sq=68.029-28.402 2 >pchisq(chi.sq,2,lower.tail=FALSE) under H0 being true. [1] 2.483741e-09 2 χq is a chi-squared distribution with q de- This can also be done using the full vs. re- grees of freedom, and q will be the number of duced likelihood ratio test (use test=“Chisq”): restrictions being made in H0 (or the number of covariates in the full model if doing global >glm.out.null=glm(incidence~1,family=binomial(logit)) >anova(glm.out.null,glm.out.full,test="Chisq") null hypothesis test). Analysis of Deviance Table Model 1: incidence ~ 1 χ2 = 2[log likelihood log likelihood ] q − red − full Model 2: incidence ~ area + isolation =( 2LLred) ( 2LLfull) Resid. Df Resid. Dev Df Deviance P(>|Chi|) − − − 14968.029 =(2LLsaturated 2LLred) (2LLsaturated 2LLfull) − − − 24728.402239.6272.484e-09 =Reducedmodeldeviance Full model deviance − =Nulldeviance Residual deviance Reject H0. − ⇒ 9 10 The deviance is saved in the model fit Individual tests for coefficients output, and it can be requested... • After rejecting the Global Null hypothesis, χ2 =Reducedmodeldeviance 2 − we can consider individual Z-tests for the pre- Full model deviance dictors. >chi.sq=glm.out.null$deviance-glm.out.full$deviance >pchisq(chi.sq,2,lower.tail=FALSE) [1] 2.483693e-09 >summary(glm.out.full) Coefficients: Same p-value as in previous page output. Estimate Std. Error z value Pr(>|z|) (Intercept) 6.6417 2.9218 2.273 0.02302 * area 0.5807 0.2478 2.344 0.01909 * isolation -1.3719 0.4769 -2.877 0.00401 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 isolation is a significant predictor, given we’ve already accounted for area. area is a significant predictor, given we’ve already accounted for isolation. 11 12 Another way to view the data: Interpretation of the parameters: • plot(area,isolation,pch=(incidence+1),col=(incidence+1)) title("Each point represents an island") ˆ legend(6,9,c("incidence 0","incidence 1"), βarea =0.5807 col=c(1,2),pch=c(1,2)) ˆ or eβarea =1.7873 Each point represents an island Holding isolation constant... incidence 0 2 incidence 1 A1km increase in area is associated with 8 an increase in the odds of seeing a bird by a factor of 1.7873. 6 ˆ eβarea represents the multiplicative effect (ap- isolation plied to the odds) of a 1-unit change in area. 4 Increasing the area of an island by 1 km2, increases the odds of seeing a bird by a mul- 2 tiplicative factor of 1.7873. 0 2 4 6 8 area It increases the odds by 78.73% 13 14 Another way to express it, Diagnostics:Outliers,Influentialdata • Odds =1.7872 Odds The car library diagnostics can also be used [(x +1) km2] [(x ) km2] 1 × 1 on generalized linear models... rstudent, hatvalues, cookd, vif, ————————————————– outlier.test, and av.plots. >plot(cooks.distance(glm.out.full),pch=16) >identify(1:n,cooks.distance(glm.out.full)) βˆ = 1.3719 isolation − ˆ or eβisolation =0.2536 0.20 19 Holding area constant... 0.15 47 A1km increase in isolation is associated 0.10 4 with an decrease in the odds of seeing a bird cookd(glm.out.full) by a factor of 0.2536. 0.05 Odds =0.2536 Odds [(x2+1) km] × [(x2) km] 0.00 0 10 20 30 40 50 Index 15 16 >vif(glm.out.full) >avPlot(glm.out.full,"isolation") area isolation 1.040897 1.040897 Added-variable Plot 2 >outlierTest(glm.out.full) No Studentized residuals with Bonferonni p < 0.05 1 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 19 2.250205 0.024436 NA 0 incidence | others >avPlot(glm.out.full,"area") -1 Added-variable Plot -4 -2 0 2 isolation | others 1 0 incidence | others -1 -2 -2 0 2 4 6 area | others 17 18 Diagnostics:Goodnessoffit The fitted value is a probability (or pˆ). • Since the responses are all 0’s and 1’s, it is The logistic regression provides a pˆ for every more difficult to check how well the model x-value. fits our data (compared to ordinary regression). To check the fit, we will partition the observations into 10 groups based on the x-values. If you have data points fairly evenly spread >break.points=quantile(soil,seq(0,1,0.1)) across your x-values, you could try to check >group.soil=cut(soil,breaks=break.points) the fit using the Hosmer-Lemeshow Goodness of >table(group.soil) Fit Test. group.soil (40,89.4] (89.4,151] 11 14 We will return to our original example data (151,239] (239,361] 14 14 on lead levels in children’s blood relative to soil lead levels to show how this test works. (361,527] (527,750] 14 13 (750,891] (891,1330] 14 14 (1330,1780] (1780,5260] 14 14 19 20 For each group g,wewillestimateapˆg. The vertical lines represent the grouping struc- ture of the observations. This pˆg does not consider the other fitted pˆ values (unlike the logistic regression fitted 1.0 values which all fall along a smooth curve). 0.8 0.6 grouped p-hat logistic reg. p-hat 95% CI for grouped p-hat >group.est.p=tapply(highbld,group.soil,mean) highbld >group.est.p=as.vector(group.est.p) 0.4 >round(group.est.p,4) [1] 0.2727 0.1429 0.2143 0.0714 0.5714 0.2 0.7692 0.6429 0.7143 0.9286 1.0000 0.0 0 1000 2000 3000 4000 5000 soil Now we will compare the ‘freely fit’ estimated If the dots fall close to the fitted logistic curve, probabilities, with the logistic regression (re- it’s a reasonably good fit.

And Ch. 15 (Sec. 1 & 4): Logistic Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support