149 – Spring 2016 – Assignment 4 Solutions Due Monday April 4, 2016

1. For the Poisson distribution, b(θ) = eθ and thus µ = b′(θ) = eθ. Consequently, θ = ∗ g(µ) = log(µ) and b(θ) = µ. Also, for the saturated model µi = yi.

n ∗ ∗ D(µSy) = 2 Q yi(θi − θi) − b(θi ) + b(θi) i=1 n ∗ ∗ = 2 Q yi(log(µi ) − log(µi)) − µi + µi i=1 n yi = 2 Q yi log Š  − (yi − µi) i=1 µi

2. (a) After following instructions for replacing 0 values with NAs, we summarize the data: > summary(mypima2) pregnant glucose diastolic triceps insulin Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00 Min. : 14.00 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00 1st Qu.: 76.25 : 3.000 Median :117.0 Median : 72.00 Median :29.00 Median :125.00 : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15 Mean :155.55 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00 3rd Qu.:190.00 Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.00 NA’s :5 NA’s :35 NA’s :227 NA’s :374 bmi diabetes age test Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.000 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00 1st Qu.:0.000 Median :32.30 Median :0.3725 Median :29.00 Median :0.000 Mean :32.46 Mean :0.4719 Mean :33.24 Mean :0.349 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 3rd Qu.:1.000 Max. :67.10 Max. :2.4200 Max. :81.00 Max. :1.000 NA’s :11 We can see that the 0s have successively been converted to NAs. We still see that test has a mean of 0.349, indicating that 34.9% received the test, and so on, similar to the previous homework. (b) First we use the na.convert.mean() function. Then we again summarize the data: > summary(mypima2.na) pregnant glucose diastolic triceps insulin Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00 Min. : 14.0 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:25.00 1st Qu.:121.5

1 Median : 3.000 Median :117.00 Median : 72.20 Median :29.15 Median :155.5 Mean : 3.845 Mean :121.69 Mean : 72.41 Mean :29.15 Mean :155.5 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00 3rd Qu.:155.5 Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00 Max. :846.0 bmi diabetes age test glucose.na Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.000 Min. :0.00000 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00 1st Qu.:0.000 1st Qu.:0.00000 Median :32.40 Median :0.3725 Median :29.00 Median :0.000 Median :0.00000 Mean :32.46 Mean :0.4719 Mean :33.24 Mean :0.349 Mean :0.00651 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00 3rd Qu.:1.000 3rd Qu.:0.00000 Max. :67.10 Max. :2.4200 Max. :81.00 Max. :1.000 Max. :1.00000 diastolic.na triceps.na insulin.na bmi.na Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000 Median :0.00000 Median :0.0000 Median :0.000 Median :0.00000 Mean :0.04557 Mean :0.2956 Mean :0.487 Mean :0.01432 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.00000 Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.00000 The fraction of missing values for glucose, diastolic, triceps, insulin, and bmi was 0.007, 0.046, 0.296, 0.487, and 0.014, respectively. The min, max, and quantiles are exactly the same for the observed diastolic and bmi values. The mean is the same for the observed glucose, tricepts and insulin values, but the spread is slightly smaller. When using the na.convert.mean() function, we always expect (1) the mean, min, and max to be the same, and (2) the spread to be smaller, because we imputed the mean of the observed values. (c) > glm.pima2.na = glm(test ~ ., data = mypima2.na, family = binomial) > summary(glm.pima2.na)

Call: glm(formula = test ~ ., family = binomial, data = mypima2.na)

Deviance Residuals: Min 1Q Median 3Q Max -2.7247 -0.7140 -0.3893 0.7147 2.4596

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.3826661 0.8313109 -11.287 < 2e-16 *** pregnant 0.1244084 0.0325203 3.826 0.00013 *** glucose 0.0378306 0.0039461 9.587 < 2e-16 *** diastolic -0.0104368 0.0087623 -1.191 0.23361 triceps 0.0040094 0.0134387 0.298 0.76544 insulin -0.0006452 0.0011822 -0.546 0.58526

2 bmi 0.0959924 0.0180916 5.306 1.12e-07 *** diabetes 0.9765315 0.3059045 3.192 0.00141 ** age 0.0121485 0.0096162 1.263 0.20647 glucose.na 0.4478017 1.0720840 0.418 0.67617 diastolic.na 1.0150992 0.4814151 2.109 0.03498 * triceps.na -0.0368346 0.2905943 -0.127 0.89913 insulin.na 0.3372729 0.2683721 1.257 0.20885 bmi.na -0.9070556 0.8720426 -1.040 0.29827 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 993.48 on 767 degrees of freedom Residual deviance: 704.02 on 754 degrees of freedom AIC: 732.02

Number of Fisher Scoring iterations: 5

The only significant missing-value indicator is that for diastolic.na. This pro- vides evidence of a significant difference in mean response between units that are missing diastolic and those that are not. In particular, those missing diastolic are estimated to be about one unit higher on the log-odds scale, adjusting for all other variables in the model. Furthermore, this also indicates that there is a lack of evidence of a significant difference in the mean response between units that are missing other values and those that are not.

Comparing the other coefficients for this model to that in Homework 3, pregnant is now significant, diastolic is more significant (but still insignificant at the 0.05 level), and age is less significant (and still insignificant at the 0.05 level).

Because pregnant is now significant and has increased in magnitude, there is likely a positive relationship between pregnant and receiving the test among units with missing values. Recall that we made an ignorability assumption when we used the na.convert.mean() function (i.e., the missingness only depends on observed values, and not the missing values themselves). This assumption is especially important if one wants to make inference about the relationship between pregnant and test. We cannot use the data to test this assumption, so one should think carefully about this assumption before making conclusions from the above model.

3. (a) We fit the log-linear model that assumes independence of the four factors.

3 > summary(glm(Count ~ M + G + P + E, data = div, family = poisson)) Estimate Std. Error z value Pr(>|z|) (Intercept) 4.75654 0.06525 72.901 <2e-16 *** MMarried 0.09273 0.06220 1.491 0.136 GWomen 0.63009 0.06525 9.657 <2e-16 *** PYes -1.19355 0.07353 -16.231 <2e-16 *** EYes -2.02313 0.09673 -20.915 <2e-16 ***

Null deviance: 1333.85 on 15 degrees of freedom Residual deviance: 232.14 on 11 degrees of freedom

> anova(mod.no.E, mod.no.P, mod.no.G, mod.no.M, mod.full, test = "Chi") Analysis of Deviance Table

Model 1: Count ~ M + G + P Model 2: Count ~ M + G + E Model 3: Count ~ M + P + E Model 4: Count ~ G + P + E Model 5: Count ~ M + G + P + E Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 12 921.40 2 12 544.43 0 376.98 3 12 330.08 0 214.35 4 12 234.36 0 95.71 5 11 232.14 1 2.22 0.1358

There is some indication that the model does not fit well as the residual deviance over residual degree of freedom is far above 1. Both the p-values and the drop in deviance tests confirm that M is not a significant predictor, while G, P and E are significant predictors. (b) Since we have three significant predictors, G, P and E, we consider models with 2-way and 3-way interactions. Noting the deviances in Figure 1, the best model appears to be the model with the three predictor variables and the 2-way inter- actions G:P and E:P. This corresponds to the factors being pairwise dependent. For instance, GWomen:PYes = -1.25341; this says that women are less likely to have engaged in pre-marital sex than men.

4 Model Deviance df Count ∼ G + P + E 234.36 12 Count ∼ G + P + E + G:P 159.11 11 Count ∼ G + P + E + G:E 221.6 11 Count ∼ G + E + P + E:P 188.35 11 Count ∼ G + E + P + G:P + E:P 113.09 11 Count ∼ G + E + P + G:P + G:E + E:P 110.18 9 Count ∼ G*E*P 110.18 8

Figure 1: Deviances for divorce models

(c) Figure 2 show the diagnostic plots. Both plots are worrying. There are several deviance residuals larger than 2 in absolute value and many of the observations are influential (with Cook’s distance greater than 1).

Residuals vs Fitted values Cook's distances 3.0 2 2.5 2.0 0 1.5 Cook's distances Cook's -2 deviance residuals deviance 1.0 -4 0.5

50 100 150 200 250 5 10 15

fitted values Observation index

Figure 2: Deviance residuals vs fitted values and Cook’s distance

4. (a) We fit the model using all possible predictor variables. The model deviance is 4379.5 on 5177 degrees of freedom. The ratio between residual deviance and degrees of freedom is less than one, indicating goodness of fit.

Estimate Std. Error z value Pr(>|z|) (Intercept) -2.223848 0.189816 -11.716 <2e-16 *** sex 0.156882 0.056137 2.795 0.0052 ** age 1.056299 1.000780 1.055 0.2912 agesq -0.848704 1.077784 -0.787 0.4310

5 income -0.205321 0.088379 -2.323 0.0202 * levyplus 0.123185 0.071640 1.720 0.0855 . freepoor -0.440061 0.179811 -2.447 0.0144 * freerepa 0.079798 0.092060 0.867 0.3860 illness 0.186948 0.018281 10.227 <2e-16 *** actdays 0.126846 0.005034 25.198 <2e-16 *** hscore 0.030081 0.010099 2.979 0.0029 ** chcond1 0.114085 0.066640 1.712 0.0869 . chcond2 0.141158 0.083145 1.698 0.0896 .

Null deviance: 5634.8 on 5189 degrees of freedom Residual deviance: 4379.5 on 5177 degrees of freedom

(b) The lines in Figure 3 correspond to observations with the same observed response. For instance, the lowest line corresponds to observations with yi = 0.

Figure 3: Deviance residuals vs fitted values and jackknifed residuals vs fitted values

(c) We calculate the VIFs. Note that we use a normal linear model to do so. > mod_vif <- lm(formula = doctorco ~ sex + age + agesq + + income + levyplus +

6 + freepoor + freerepa + + illness + actdays + hscore + + chcond1 + chcond2, data = dvisits) > library(car) > vif(mod_vif) sex age agesq income levyplus freepoor 1.186308 71.782113 73.832199 1.517130 1.555639 1.148027 freerepa illness actdays hscore chcond1 2.461726 1.362535 1.135603 1.237602 1.381006 chcond2 1.348898 Only age and agesq have large VIFs and they are highly correlated, r = 0.99. This indicates that we can drop agesq and the model will not suffer (we could have dropped age but age is more interpretable). Once agesq is removed, all VIFs are below 10. Following backwards elimination, we first drop freerepa, which has the largest p-value over 0.05. We next drop levyplus, chcond1 and chcond2. The final model is below

Estimate Std. Error z value Pr(>|z|) (Intercept) -2.051963 0.099522 -20.618 < 2e-16 *** sex 0.175529 0.055433 3.167 0.00154 ** age 0.433532 0.137140 3.161 0.00157 ** income -0.171053 0.081926 -2.088 0.03681 * freepoor -0.496325 0.175304 -2.831 0.00464 ** illness 0.196008 0.017585 11.146 < 2e-16 *** actdays 0.127793 0.004899 26.088 < 2e-16 *** hscore 0.032433 0.009938 3.263 0.00110 **

(d) An older, poor, sick woman whose health insurance is not covered by the govern- ment would be predicted to visit the doctor the most. (e) For the last person in the data set, we would predictµ ˆ = 0.17. The probabilities the last person has 0, 1 and 2 consultations with the doctor are 0.85, 0.14 and 0.01 respectively. This person in fact visited the doctor 0 times. > mu = predict(mod, type = "response")[nrow(dvisits)] > round(dpois(0:2, lambda = mu), 2) [1] 0.85 0.14 0.01

5. Our final model is the model including an offset log(hours), and visits. We carry out a drop in deviance test comparing the model with visits to the model including visits, residency and gender and revenue. The drop in deviance was not significant. The final model output is included below.

glm(formula = complaints ~ offset(log(hours)) + visits, family = poisson,

7 data = esdcomp) Estimate Std. Error z value Pr(>|z|) (Intercept) -7.5331144 0.4030269 -18.69 <2e-16 *** visits 0.0005716 0.0001469 3.89 1e-04 ***

Null deviance: 69.659 on 43 degrees of freedom Residual deviance: 54.380 on 42 degrees of freedom

Analysis of Deviance Table

Model 1: complaints ~ offset(log(hours)) + visits Model 2: complaints ~ offset(log(hours)) + visits + residency Model 3: complaints ~ offset(log(hours)) + visits + gender Model 4: complaints ~ offset(log(hours)) + visits + revenue Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 42 54.380 2 41 52.994 1 1.38556 0.2392 3 41 53.342 0 -0.34743 4 41 54.241 0 -0.89886

6. > melanoma count tumor site dev.resid 1 22 freckle head 5.13537787 2 16 superficial head -3.04533605 3 19 nodular head -0.49711084 4 11 indeterminate head 0.46798432 5 2 freckle trunk -2.82829426 6 54 superficial trunk 0.69899703 7 33 nodular trunk -0.02173229 8 17 indeterminate trunk 0.54787007 9 10 freckle extremity -2.31583297 10 115 superficial extremity 1.00813975 11 73 nodular extremity 0.28104581 12 28 indeterminate extremity -0.66016102 From the deviance residuals, we see that there is an between tumor and site. Notice that when site = trunk or site = extremity, the freckle row has the lowest count. However, when site = head, the freckle row has the highest count. This interaction is being missed by the model, which is predicting a much lower value for the observation where tumor = freckle and site = head than it should. The drop in deviance test shows the model with the interaction (the saturated model) is a 2 better fit (p-value = Pr(χ6 > 51.795) ≈ 0).

8 Estimate Std. Error z value Pr(>|z|) (Intercept) 2.9554 0.1770 16.696 < 2e-16 *** tumorindeterminate 0.4990 0.2174 2.295 0.0217 * tumornodular 1.3020 0.1934 6.731 1.68e-11 *** tumorsuperficial 1.6940 0.1866 9.079 < 2e-16 *** sitehead -1.2010 0.1383 -8.683 < 2e-16 *** sitetrunk -0.7571 0.1177 -6.431 1.27e-10 ***

Null deviance: 295.203 on 11 degrees of freedom Residual deviance: 51.795 on 6 degrees of freedom

7. Summarizing the data.

> summary(alli) Gender Length Choice Female:24 Min. :1.240 F:33 Male :39 1st Qu.:1.600 I:20 Median :1.800 O:10 Mean :2.106 3rd Qu.:2.425 Max. :3.890

Our final model is the model with just Length.

multinom(formula = Food ~ Length, data = alli, maxit = 500)

Coefficients: (Intercept) Length 2 -4.182746 2.473353 3 -5.181420 2.388611

Std. Errors: (Intercept) Length 2 1.530199 0.8519014 3 1.745886 0.9214997

We compare it to the model with Length and Gender and the model with the interac- tion.

Model df Deviance Length 122 109 Length + Gender 120 105 Length * Gender 118 103

9 Comparing the first model to the second and the first model to the third, we see that 2 2 the difference in deviances is not significant according to the corresponding χ2 and χ4 distributions, respectively. 8. Our final model is the model that only includes Gender.

> mod.gender = glm(Count ~ -1 + ResponseObs + Belief*(Gender), data = afterlife, family = poisson) > summary(mod.gender)

Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) ResponseObs1 4.3247 0.1074 40.255 <2e-16 *** ResponseObs2 4.3197 0.1104 39.122 <2e-16 *** ResponseObs3 2.5995 0.1445 17.989 <2e-16 *** ResponseObs4 2.1783 0.1809 12.043 <2e-16 *** BeliefUndecided -0.4282 0.1688 -2.537 0.0112 * BeliefYes 1.5867 0.1163 13.639 <2e-16 *** GenderMale NA NA NA NA BeliefUndecided:GenderMale -0.0906 0.2457 -0.369 0.7123 BeliefYes:GenderMale -0.4008 0.1705 -2.350 0.0188 *

Null deviance: 8055.7080 on 12 degrees of freedom Residual deviance: 2.8481 on 4 degrees of freedom AIC: 85.161

> anova(mod.race, mod.gender, mod.gender.race, test = "Chi") Analysis of Deviance Table

Model 1: Count ~ -1 + ResponseObs + Belief * (Race) Model 2: Count ~ -1 + ResponseObs + Belief * (Gender) Model 3: Count ~ -1 + ResponseObs + Belief * (Gender + Race) Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 4 8.0465 2 4 2.8481 0 5.1984 3 2 0.8539 2 1.9942 0.3689

> prob.male <- c(exp(mod.gender$coef[5:6] + mod.gender$coef[8:9]), 1) > prob.male <- prob.male/sum(prob.male) > prob.male BeliefUndecided BeliefYes 0.1222494 0.6723716 0.2053790

For a white male, the probability of yes, no and undecided is (0.67, 0.21, 0.12).

10