<<

Stat 139 Homework 10 Solutions, Fall 2015

Problem 1. (Based on Exercises 10 and 11 in Chapter 12). A, B, and C are three explanatory variables in a multiple regression with n = 28 cases. The following table shows the residual sum of squares and degrees of freedom for all models (note: this table can be found in the file “ABC.csv” to make life easier to do the calculations in software like Excel or R): Model Residual sum Degrees Variables of squares of freedom K σˆ2 R2 adj-R2 BIC None 8100 27 0 300 0 0 157.298 A 6240 26 1 240 0.23 0.1992 153.5499 B 5980 26 1 230 0.262 0.23248 152.40083 C 6760 26 1 260 0.165 0.1316 155.71109 AB 5500 25 2 220 0.321 0.26442 153.43750 AC 5250 25 2 210 0.352 0.298 152.18147 BC 5750 25 2 230 0.29 0.2308 154.63770 ABC 5160 24 3 215 0.363 0.2799 155.01043

(a) Calculate 3 for each model: the estimate of σ2, adjusted R2, and BIC. See the table above for solutions (note, other stats were calculated, but didn’t need to be).

(b) Summarize which model(s) is/are ranked best for each of the 4 statistics from part (a). Based on the calculations in part (a) we see that the AC model is the best for all the models in all 3 of the criteria.

(c) Using the residual sum of squares, find the model indicated by forward selection. Start with the model “None”, and identify the single-variable model that has the smallest residual sum of squares, then perform an extra-sum-of-squares F -test to determine if that variable is significant. If it is, continue with the 2 predictor model. Continue until no more significant predictores can be added. Is this procedure guaranteed to find the “best” model (best based on residual sum of squares)?

Based on this algorithm, the best 1-predictor model is B, which is significantly better than the “None” variables model: (8100 − 6240)/1 F = = 7.75 6240/26 which has p-value = 0.009879 (using 1-pf(7.75,1,26) in R). Then we consider the best 2-predictor model with B in it, which is AB: (6240 − 5500)/1 F = = 3.364 5500/25 which has p-value = 0.0786 (using 1-pf(3.364,1,25) in R). We then stop and say the AB model is not better than just using B, so the model with B is our best predictive model through forward (using the ESS F -statistic’s p-value as our criterion). Note: we did not find the true “best” model through this sequential method of model selection.

Problem 2. Predicting Income in GSS What factors are associated with income in the US. Several variables from the General Social Survey (GSS) are stored in the data file “gssincome.csv”, and the codebook for the data can be found in the file “GSS Income Codebook.txt”.

1 (a) Explore the data graphically and decide whether the outcome, income, or any predictor(s) should be transformed. Make sure you define any categorical variables as factors in R.

√ 1 Here are the histograms of untransformed income, income, income 4 , and log(income). A transfor- mation is definitely needed, and either sqrtinc or loginc could be acceptable. Something in between 1 1 would be best (like income 4 ), but interpretation of coefficients would be difficult. We chose income 4 for the rest of this problem because it is the most symmetric√ histogram, but the interpretatbility of the results will be difficult. Using log(income) or income would possibly be better choice if interpretability of results is the most important consideration.

Histogram of gss$income Histogram of sqrt(gss$income) Histogram of gss$income^(1/4) Histogram of log(gss$income) 350 300 300 300 250 300 250 250 200 200 200 200 150 150 Frequency Frequency Frequency Frequency 150 100 100 100 100 50 50 50 0 0 0 0

0 50000 100000 150000 200000 250000 300000 350000 0 100 200 300 400 500 600 5 10 15 20 25 6 7 8 9 10 11 12 13

gss$income sqrt(gss$income) gss$income^(1/4) log(gss$income)

(b) Regress the outcome on educ, race and their interaction (call this Model 1). What is the estimated slope of the line for “black” subjects (i.e. race=2)? What is it for subjects who self-identify as “other race” (i.e., race=3)? Here is the relevant R output:

> summary(model1<-lm(inc4~(as.factor(race)+educ)^2,data=gss))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.01866 0.55358 14.485 <2e-16 *** as.factor(race)2 -2.99906 1.46651 -2.045 0.041 * as.factor(race)3 0.91033 1.20903 0.753 0.452 educ 0.40211 0.03758 10.701 <2e-16 *** as.factor(race)2:educ 0.14454 0.10240 1.412 0.158 as.factor(race)3:educ -0.10530 0.08577 -1.228 0.220 --- Residual standard error: 3.253 on 1373 degrees of freedom Multiple R-squared: 0.1214, Adjusted R-squared: 0.1182 F-statistic: 37.95 on 5 and 1373 DF, p-value: < 2.2e-16

This question (and remaining questions) was intended to look at the slope relating transformed income to educ among the race categories, but the wording of the question was not clear. So throughout this problem, 2 different answers were acceptable (though the interpretation of dealing with the main ef- fects of race in the presence of the interaction term is not ver important scientifically). The main effect for black subjects was -2.999 and for other race subjects is was 0.9103. This just is the estimate 1 of the difference in average income 4 of these racial groups compared to the reference group (white group) for indivudals who have zero years of education (which is not very important). The estimate 1 of the slope of the line relating income 4 to educ for black subjects was 0.40211 + 0.14454 = 0.54665, and for the other racial group it was 0.40211 − 0.10530 = 0.29681.

2 (c) Perform a hypothesis test in R to determine whether the slope for “black” subjects is different than for “other race” subjects. This can be done via a t-test for a linear combination of coefficients. The R function vcov will be useful. Here’s the important R output in combination with Model 1’s summary output above: > round(vcov(model1),5) (Intercept) race2 race3 educ race2:educ race3:educ (Intercept) 0.30645 -0.30645 -0.30645 -0.02045 0.02045 0.02045 race2 -0.30645 2.15064 0.30645 0.02045 -0.14803 -0.02045 race3 -0.30645 0.30645 1.46174 0.02045 -0.02045 -0.10050 educ -0.02045 0.02045 0.02045 0.00141 -0.00141 -0.00141 race2:educ 0.02045 -0.14803 -0.02045 -0.00141 0.01049 0.00141 race3:educ 0.02045 -0.02045 -0.10050 -0.00141 0.00141 0.00736 Here’s the calculation performed manually:

H0 : β4 − β5 = 0 vs. HA : β4 − β5 6= 0

(βˆ − βˆ ) − 0 0.14454 − (−0.10530) 0.24984 t = 4 5 = = = 2.038 q p0.01049 + 0.00736 − 2(0.00141) 0.12260 Var(βˆ4) + Var(βˆ5) − 2Cov(βˆ4, βˆ5) This t-statistics has df= 1373, which is the degrees of freedom associated with estimating the variance of the residuals with this model. The critical value is essentially 1.96, and the p-value = 0.0417 can be calculated in R: 2*(1-pt(2.038,df=1373)). Thus we can reject the null hypothesis; black and other race subjects have statisticantly significantly different slopes (every extra year of education is more important for black individuals than for other race individuals). The calculations can be done automatically in R using matrix multiplication: > C=c(0,0,0,0,1,-1) > t(C)%*%coef(model1)/sqrt((t(C)%*%vcov(model1)%*%C)) [,1] [1,] 2.038711 (d) Perform a hypothesis test in R to determine whether the slopes (for associating income with education) for the 3 race groups are all the same or not. This can be done via an Extra-Sum-of- Squares F -test. You will need to run a second regression model to perform this test. We need to first fit the smaller model not including the interacti0n terms. Here’s the important R output: > summary(model1b<-lm(inc4~(race+educ),data=gss))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.04485 0.47245 17.028 < 2e-16 *** race2 -0.96740 0.24635 -3.927 9.03e-05 *** race3 -0.50956 0.29780 -1.711 0.0873 . educ 0.40031 0.03186 12.564 < 2e-16 *** --- Residual standard error: 3.255 on 1375 degrees of freedom Multiple R-squared: 0.1187, Adjusted R-squared: 0.1168 F-statistic: 61.76 on 3 and 1375 DF, p-value: < 2.2e-16 > summary(model1)$sigma [1] 3.252873 > summary(model1b)$sigma [1] 3.255432

3 Here’s the calculation performed manually:

H0 : β4 = β5 = 0 vs. HA : β4 6= 0 or β5 6= 0

2 2 (SSE1b − SSE1)/∆df 1375(3.255432 ) − (1373(3.252873 ))/2 F = = 2 = 2.082 SSE1/df1 3.252873 This F -statistics has df = 2, 1373 (note, your answer may be differs from R’s output because of rounding issues of reportingσ ˆ2) . The p-value = 0.125 was calculated in R: 1-pf(2.082,2,1373). Thus we cannot reject the null hypothesis; the slope associating income with race may be the same in all 3 groups. Note, this contradicts are answer in the previous part because the white group is in between the other two groups, thus diluting the significant result from the prvious part. And done automatically in R: > anova(model1b,model1) Analysis of Variance Table

Model 1: inc4 ~ (race + educ) Model 2: inc4 ~ (race + educ)^2 Res.Df RSS Df Sum of Sq F Pr(>F) 1 1375 14572 2 1373 14528 2 44.066 2.0823 0.125 (e) Use R to calculate the 95% prediction interval for a black person with a college degree (educ = 16) using Model 1. Be sure to transform back to the original units, $. Automatically in R using the predict.lm command: > newdata=data.frame(race=as.factor(2),educ=16) > (predict(model1,new=newdata,interval="prediction"))^4 fit lwr upr 1 35911.58 2933.609 165592.4 And performig the calculation “by hand” in R using matrices: > X0=c(1,1,0,16,16,0) > (t(X0)%*%(model1$coef))^4 [,1] [1,] 35911.58 > (c(t(X0)%*%(model1$coef)+qt(c(0.025,0.975),df=model1$df)*sqrt(summary(model1)$sigma^2+(t(X0)%*%(vcov(model1))%*%X0))))^4 [1] 2933.609 165592.383 (f) Fit a model with main effects of all available predictors (call this Model 2). Your model should have 1337 degrees of freedom associated with the residuals (unless you did exotic transformations on your predictors). Identify significant predictors (ignoring multiple comparisons issues).

> summary(model2<-lm(inc4~.,data=gss.lm))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.912699 0.709187 6.927 6.65e-12 *** age 0.043076 0.007289 5.910 4.33e-09 *** foreignborn 0.339899 0.285801 1.189 0.23454 female -1.358120 0.178604 -7.604 5.38e-14 *** numchildren -0.014722 0.063135 -0.233 0.81566 crack -0.931292 0.416649 -2.235 0.02557 * evermarried 0.612812 0.222038 2.760 0.00586 **

4 educ 0.366423 0.031772 11.533 < 2e-16 *** race2 -0.048055 0.250446 -0.192 0.84787 race3 -0.120155 0.316385 -0.380 0.70417 siblings -0.003502 0.031892 -0.110 0.91259 sexorient2 -1.000273 0.459771 -2.176 0.02976 * sexorient3 -0.237336 0.532182 -0.446 0.65569 height 0.082260 0.012750 6.452 1.54e-10 *** marijuana 0.380080 0.176661 2.151 0.03162 * zodiac2 0.430692 0.410225 1.050 0.29396 zodiac3 0.385790 0.438428 0.880 0.37905 zodiac4 -0.067421 0.432158 -0.156 0.87605 zodiac6 0.224012 0.411053 0.545 0.58586 zodiac7 -0.283652 0.405413 -0.700 0.48426 zodiac8 -0.110899 0.406100 -0.273 0.78483 zodiac9 0.105626 0.493785 0.214 0.83065 zodiac10 -0.331320 0.418897 -0.791 0.42912 zodiac11 -0.014109 0.422233 -0.033 0.97335 zodiac12 -0.550089 0.408340 -1.347 0.17816 zodiac13 -0.519113 0.434588 -1.194 0.23250 zodiac14 0.461270 0.402714 1.145 0.25225 political2 0.123641 0.199578 0.620 0.53569 political3 0.262680 0.222661 1.180 0.23832 happinesspretty happy 0.561383 0.281953 1.991 0.04668 * happinessvery happy 0.589373 0.303714 1.941 0.05252 . hispanic -0.258292 0.284252 -0.909 0.36369 owngun 0.419957 0.201994 2.079 0.03780 * ownhome 0.555694 0.175342 3.169 0.00156 ** sexpartners1 0.314003 0.240262 1.307 0.19147 sexpartners2 -0.175385 0.315044 -0.557 0.57783 otherlang 0.127090 0.197011 0.645 0.51898 catholic 0.379994 0.204228 1.861 0.06302 . veteran -0.063158 0.274391 -0.230 0.81799 workgovt 0.255554 0.209410 1.220 0.22255 religious1 0.453493 0.202624 2.238 0.02538 * religious2 0.168417 0.221906 0.759 0.44801 --- Residual standard error: 2.923 on 1337 degrees of freedom Multiple R-squared: 0.3091, Adjusted R-squared: 0.2879 F-statistic: 14.59 on 41 and 1337 DF, p-value: < 2.2e-16

The significant predictors are (in order of smallest to largest p-value): years of education, being female (vs. being male), height, age, having owned a house (vs. not), having ever been married (vs. not), attending church at a moderate level (vs. never attending church), having ever smoked crack, being gay (vs. being straight), having ever smoked marijuana, whether one owns a gun, and being pretty happy (compared to not so happy).

(g) Use the backward variable selection procedure based on the AIC to build a prediction model for income (transformed appropriately), starting from a model with all main effects. You may find the R-function step() and the code introduced in Unit 12 helpful. There is no need to report the intermediate output of the step() function in your write-up, just include a summary of the chosen model and its AIC. Call the resulting model Model 3.

Note: the function lm() in R has some useful shortcuts for specifying formulas. For example,

5 lm(y ~ ., data = MyData)

is equivalent to including main effects of all variables in the dataset MyData as predictors. Also,

lm(y ~ .^2, data = MyData)

means “include all variables and their pairwise interactions”. Finally,

lm(y ~ 1, data = MyData)

means “fit an intercept-only model”. Using AIC as the criterion, below is the final step of the backward selection, and the summary of the resulting model (AIC = 2978.68): > model3=step(lm(inc4~.,data=gss.lm)) Step: AIC=2978.68 inc4 ~ age + female + crack + evermarried + educ + sexorient + height + marijuana + happiness + owngun + ownhome + sexpartners + catholic + religious

Df Sum of Sq RSS AIC 11633 2978.7 - happiness 2 36.31 11669 2979.0 - sexpartners 2 40.25 11673 2979.4 - sexorient 2 44.99 11678 2980.0 - religious 2 45.04 11678 2980.0 - catholic 1 33.03 11666 2980.6 - marijuana 1 40.77 11674 2981.5 - owngun 1 43.89 11677 2981.9 - crack 1 57.77 11691 2983.5 - evermarried 1 78.50 11711 2986.0 - ownhome 1 97.50 11730 2988.2 - age 1 343.26 11976 3016.8 - height 1 350.56 11983 3017.6 - female 1 560.54 12193 3041.6 - educ 1 1417.40 13050 3135.2 > summary(model3)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.892122 0.568021 8.613 < 2e-16 *** age 0.044103 0.006962 6.335 3.22e-10 *** female -1.376508 0.170040 -8.095 1.26e-15 *** crack -1.064978 0.409805 -2.599 0.009458 ** evermarried 0.634365 0.209397 3.029 0.002496 ** educ 0.376211 0.029225 12.873 < 2e-16 *** sexorient2 -1.026570 0.455519 -2.254 0.024378 * sexorient3 -0.276911 0.525831 -0.527 0.598547 height 0.080815 0.012624 6.402 2.11e-10 *** marijuana 0.376913 0.172652 2.183 0.029200 * happinesspretty happy 0.537339 0.279118 1.925 0.054421 . happinessvery happy 0.597725 0.299556 1.995 0.046202 * owngun 0.443528 0.195801 2.265 0.023658 * ownhome 0.580024 0.171794 3.376 0.000755 *** sexpartners1 0.295031 0.237867 1.240 0.215071

6 sexpartners2 -0.197459 0.311274 -0.634 0.525955 catholic 0.374469 0.190576 1.965 0.049625 * religious1 0.456546 0.200020 2.282 0.022614 * religious2 0.234565 0.209902 1.117 0.263979 --- Residual standard error: 2.925 on 1360 degrees of freedom Multiple R-squared: 0.2965, Adjusted R-squared: 0.2872 F-statistic: 31.84 on 18 and 1360 DF, p-value: < 2.2e-16 (h) Next, run a forward variable selection procedure starting with Model 3, with the scope for the final model set to include all the two-way interaction terms for the variables in Model 3. The predic- tors from Model 3 can be printed in a list in R via the command model3$terms[[3]]. This forward variable selction can be performed using the step() function as follows, where interactionModel is the lm fit with all variables from Model 3 and their interactions:

step(Model3, scope = list(upper = interactionModel), direction = "forward")

Report a summary of the chosen model and its AIC. Call this Model 4

> model3b=lm(inc4~(age + female + crack + evermarried + educ + sexorient + height + marijuana + happiness + owngun + ownhome + sexpartners + catholic + religious)^2,data=gss.lm) Step: AIC=2877.46 inc4 ~ age + female + crack + evermarried + educ + sexorient + height + marijuana + happiness + owngun + ownhome + sexpartners + catholic + religious + female:evermarried + age:evermarried + educ:marijuana + female:owngun + crack:owngun + female:catholic + female:ownhome + female:sexorient + age:crack + crack:ownhome + height:owngun + owngun:ownhome + evermarried:ownhome + sexorient:owngun + age:educ + sexorient:catholic + ownhome:sexpartners + evermarried:height + educ:religious + age:marijuana + marijuana:ownhome + crack:sexorient + sexorient:height + evermarried:marijuana > summary(model4) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.603910 1.301504 5.074 4.45e-07 *** age 0.043111 0.033029 1.305 0.192035 female 0.697954 0.337910 2.066 0.039069 * crack 1.178876 1.203601 0.979 0.327532 evermarried 3.939900 0.565335 6.969 5.01e-12 *** educ 0.099152 0.084859 1.168 0.242844 sexorient2 -0.861759 1.087332 -0.793 0.428185 sexorient3 -4.910313 1.578272 -3.111 0.001903 ** height 0.110308 0.023296 4.735 2.42e-06 *** marijuana -2.970938 0.916685 -3.241 0.001221 ** happinesspretty happy 0.558260 0.271172 2.059 0.039718 * happinessvery happy 0.594968 0.291591 2.040 0.041506 * owngun 1.958329 0.467997 4.184 3.05e-05 *** ownhome 0.972724 0.536407 1.813 0.069994 . sexpartners1 0.106884 0.286038 0.374 0.708708 sexpartners2 0.059430 0.361571 0.164 0.869469 catholic 0.678533 0.260605 2.604 0.009325 ** religious1 -1.508000 1.005539 -1.500 0.133931 religious2 -1.516388 1.058003 -1.433 0.152020

7 female:evermarried -1.824715 0.364561 -5.005 6.33e-07 *** age:evermarried -0.081669 0.014717 -5.549 3.46e-08 *** educ:marijuana 0.217941 0.060078 3.628 0.000297 *** female:owngun -1.254368 0.390389 -3.213 0.001345 ** crack:owngun 2.607957 1.124125 2.320 0.020492 * female:catholic -0.755179 0.362295 -2.084 0.037312 * female:ownhome -0.889751 0.323066 -2.754 0.005966 ** female:sexorient2 -2.080899 0.934254 -2.227 0.026092 * female:sexorient3 0.535124 1.065823 0.502 0.615697 age:crack -0.071000 0.035167 -2.019 0.043691 * crack:ownhome -1.835404 0.859673 -2.135 0.032943 * height:owngun -0.057258 0.028232 -2.028 0.042745 * owngun:ownhome -1.071566 0.394395 -2.717 0.006673 ** evermarried:ownhome 0.662249 0.403278 1.642 0.100793 sexorient2:owngun -1.382249 1.411829 -0.979 0.327734 sexorient3:owngun 4.954325 1.484061 3.338 0.000866 *** age:educ 0.003237 0.002147 1.508 0.131798 sexorient2:catholic 0.667331 1.153387 0.579 0.562968 sexorient3:catholic 3.288041 1.289486 2.550 0.010887 * ownhome:sexpartners1 0.150461 0.450205 0.334 0.738277 ownhome:sexpartners2 -0.879761 0.615271 -1.430 0.152988 evermarried:height -0.037008 0.027089 -1.366 0.172126 educ:religious1 0.135093 0.069275 1.950 0.051373 . educ:religious2 0.118669 0.072319 1.641 0.101051 age:marijuana 0.028352 0.013294 2.133 0.033135 * marijuana:ownhome -0.518119 0.364864 -1.420 0.155833 crack:sexorient2 3.062586 1.788762 1.712 0.087107 . crack:sexorient3 3.967990 2.059260 1.927 0.054205 . sexorient2:height 0.082778 0.068171 1.214 0.224859 sexorient3:height 0.188111 0.092739 2.028 0.042718 * evermarried:marijuana -0.562654 0.397205 -1.417 0.156854 --- Residual standard error: 2.789 on 1329 degrees of freedom Multiple R-squared: 0.375, Adjusted R-squared: 0.352 F-statistic: 16.28 on 49 and 1329 DF, p-value: < 2.2e-16

(i) Finally, use a stepwise procedure to perform model selection. Start with a model with all main effects and specify the intercept-only model (Model 0) as a lower-limit model and a full model including all two-way interactions of ALL possible predictor variables as the upper-limit:

step(Model2, scope = list(lower = Model0, upper = FullModel), direction = "both")

Report a summary of the chosen model and its AIC. Call this Model 5

> model5=step(lm(inc4~.,data=gss.lm),direction="both",scope=list(lower="~1",upper="~.^2")) Step: AIC=2805.66 inc4 ~ age + foreignborn + female + numchildren + crack + evermarried + educ + race + siblings + sexorient + height + marijuana + political + happiness + hispanic + owngun + ownhome + sexpartners + otherlang + catholic + veteran + workgovt + religious + female:evermarried + age:evermarried + age:otherlang + political:catholic + age:numchildren + crack:veteran + educ:marijuana + siblings:veteran + female:race + female:numchildren + female:owngun + age:veteran + female:ownhome +

8 crack:ownhome + numchildren:crack + numchildren:workgovt + race:workgovt + foreignborn:catholic + educ:hispanic + female:sexorient + crack:sexpartners + height:political + age:political + female:catholic + female:hispanic + siblings:political + foreignborn:hispanic + ownhome:religious + educ:veteran + evermarried:height + height:religious + hispanic:otherlang + ownhome:sexpartners + owngun:ownhome + race:ownhome + siblings:owngun + educ:happiness + crack:owngun + happiness:owngun + numchildren:ownhome + numchildren:happiness + foreignborn:otherlang + crack:evermarried + political:hispanic + hispanic:catholic + otherlang:catholic + race:political

(j) Select a best final model among Models 1 through 5 based on their AICs and perform a brief model check of assumptions on your selected model. Based on the outputs, the last Model had the best (smallest) AIC of 2805.66. Though this may be overfit (it has A LOT of interaction terms), so doing a cross-validation study may lead to choosing one of the other models as best for new data. Here are the plots to check assumptions on the model, and we notice that linearity and normality seem fine, but constant variance assumption may be violated (more spread out in the vertical direction when the fitted values are on the higher side), but this is not too extreme (as there is simpler more data where the residuals seems to be more spread out). Most likely using Y 1/4 helped in the normality assumption (and the constant variance assumption as well).

Histogram of resid(model5)

● ● ● ● ● ● ● ●● ● ● ● ● ● 200 ● ●● ● ● ● ●● ● ●● ● ● ● 5 ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●●●●● ●●● ●● ●● ● ● ●● ●● ● ● ● ●●●●●●● ●●●●●●●●●●●●● ●●●● ● ● ● ●● ●●●● ● ●●● ●● ●●● ● ● ●●●● ●●●●●●●●●●●●●●●●● ●●● ●● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ● ● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ●● ● ●●● ● ● ● ● 150 ● ● ●● ●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●● ●● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ● ●● ●● ●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●● 0 ●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●● ● ● ●● ● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●● ●● ●● ● ● ● ● ●● ● ●●●●● ●●●●●●●● ●●●●●●●●● ● ● ● ●●●●● ● ● ● ● ●●●●●● ●●●●●● ●●●●●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ● ● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●● ●●● ●● Frequency ● ● ● ● ● ● 100 resid(model5) ●● ● ● ● ● ● ● ●● ●● ●●●●●●●●● ●● ●●●●●●●● ● ●● ● ●● ●●● ● ●● ● ● ●●● ●●● ● ● ●● ●●●● ● ●●●●● ● ●●●● ● ● ●●●●●●● ● ● ● ● ● ● ●●● ●●●●●● ● ● ● ●● ●● ●●● ●● ● ● ●●●● ● ●●● ● −5 ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● 50 ● ● ● ● ● ● ● ●● ● ●●● ● ● −10 0 10 15 20 −10 −5 0 5 fitted(model5) resid(model5)

(k) Use you model to interpret the association of income with the variable female. Note: if female is not in your model, interpret what this means.

female 0.602016 0.371504 1.620 0.105375 female:evermarried -1.128865 0.399461 -2.826 0.004787 ** female:race2 0.980164 0.459570 2.133 0.033132 * female:race3 -0.763402 0.573045 -1.332 0.183036 female:numchildren -0.287685 0.116504 -2.469 0.013667 * female:owngun -1.181516 0.371981 -3.176 0.001527 ** female:ownhome -1.115433 0.321711 -3.467 0.000543 ***

9 female:sexorient2 -2.293394 0.885338 -2.590 0.009695 ** female:sexorient3 0.311684 1.034342 0.301 0.763207 female:catholic -0.833805 0.372915 -2.236 0.025529 * female:hispanic 0.875669 0.506637 1.728 0.084158 .

The variable female shows up as a main effect and in the interaction with many of the other predictors (all the terms involving female are listed above). All else in the model being constant, females actually make more money than males on average (since the estimated main effect was positive: 0.6020), if all the other variables in the interaction terms with female are zero (note: numchildren is rarely zero), though this effect is not significant (p = 0.1054). And compared to this average higher income for women, the affect of being a woman is worse for those that have ever been married (they make less than men since −1.1289 + 0.6020 is less than zero). Similar interpretations can be made with the other interaction terms.

Problem 3. Run a cross-validation study of the 3 automatic variable selection models from Problem 4. For 2,000 iterations, randomly select 1,000 observations in which to train Models 3, 4, and 5 in Problem 4, and keep the remaining 379 observations as a test set in each iteration. Predict both income and your transformed income variable for all 379 observations in the testing set using each of the 3 models based on the training set, and save the residuals sum of squares of your test set in each iteration. Summarize your results as to which of the 3 models performs best in predicting income and tranformed-income. Does this agree with your selection from Problem 2(j)?

Based on the R output below, it does look like this largest model still works best on predicting new observations. It has the smalles Error Sums of Squares, on average, than the other two smaller models (variance estimates of new residuals on the transformed scale are 7.759 for the largest model vs. 8.699 and 8.100 in the two smaller models). Note: the estimate of the variance of residuals for future predictions were much larger than for the internal estimation when using all the data to estimate the model, and using the same observations to estimate the variance of the error (and this difference is greater the more predictors are involved). *Note: we used 1200 observations in the training set to alleviate “estimation errors” which were noticed by R’s warning that “In predict.lm(fit2, new = test) : prediction from a rank-deficient fit may be misleading.” > set.seed(420) > nsims=2000 > n=nrow(gss) > sse1=sse2=sse3=rep(NA,nsims) > sse1b=sse2b=sse3b=rep(NA,nsims) > n.train=1200 > > for(i in 1:nsims){ + reorder=sample(n) + train=gss[reorder[1:n.train],] + test=gss[reorder[(n.train+1):n],] + ###fit1: result of backward model on main effects + fit1=lm(inc4 ~ age + female + crack + evermarried + educ + sexorient + height + + marijuana + happiness + owngun + ownhome + sexpartners + + catholic + religious,data=train) + ###fit2: result of forward model on interaction terms + fit2=lm(inc4 ~ age + female + crack + evermarried + educ + sexorient + height + + marijuana + happiness + owngun + ownhome + sexpartners + + catholic + religious + female:evermarried + age:evermarried +

10 + educ:marijuana + female:owngun + crack:owngun + female:catholic + + female:ownhome + female:sexorient + age:crack + crack:ownhome + + height:owngun + owngun:ownhome + evermarried:ownhome + sexorient:owngun + + age:educ + sexorient:catholic + ownhome:sexpartners + evermarried:height + + educ:religious + age:marijuana + marijuana:ownhome + crack:sexorient + + sexorient:height + evermarried:marijuana,data=train) + ###fit2: result of stepwise model + fit3=lm(inc4 ~ age + foreignborn + female + numchildren + + crack + evermarried + educ + race + siblings + sexorient + + height + marijuana + political + happiness + hispanic + owngun + + ownhome + sexpartners + otherlang + catholic + veteran + + workgovt + religious + female:evermarried + age:evermarried + + age:otherlang + political:catholic + age:numchildren + crack:veteran + + educ:marijuana + siblings:veteran + female:race + female:numchildren + + female:owngun + age:veteran + female:ownhome + crack:ownhome + + numchildren:crack + numchildren:workgovt + race:workgovt + + foreignborn:catholic + educ:hispanic + female:sexorient + + crack:sexpartners + height:political + age:political + female:catholic + + female:hispanic + siblings:political + foreignborn:hispanic + + ownhome:religious + educ:veteran + evermarried:height + height:religious + + hispanic:otherlang + ownhome:sexpartners + owngun:ownhome + + race:ownhome + siblings:owngun + educ:happiness + crack:owngun + + happiness:owngun + numchildren:ownhome + numchildren:happiness + + foreignborn:otherlang + crack:evermarried + political:hispanic + + hispanic:catholic + otherlang:catholic + race:political,data=train) + sse1[i]=sum((test$inc4-predict(fit1,new=test))^2) + sse2[i]=sum((test$inc4-predict(fit2,new=test))^2) + sse3[i]=sum((test$inc4-predict(fit3,new=test))^2) + sse1b[i]=sum((test$income-predict(fit1,new=test)^4)^2) + sse2b[i]=sum((test$income-predict(fit2,new=test)^4)^2) + sse3b[i]=sum((test$income-predict(fit3,new=test)^4)^2) + if(i%%200==0){cat("Finished Iteration #",i,"\n")} + } Finished Iteration # 200 Finished Iteration # 400 Finished Iteration # 600 Finished Iteration # 800 Finished Iteration # 1000 Finished Iteration # 1200 Finished Iteration # 1400 Finished Iteration # 1600 Finished Iteration # 1800 Finished Iteration # 2000 > c(mean(sse1),mean(sse2),mean(sse3))/(n-n.train) [1] 8.698998 8.100079 7.758729 > c(mean(sse1b),mean(sse2b),mean(sse3b))/(n-n.train) [1] 1623122656 1540419140 1471092509 > c(summary(model3)$sigma^2,summary(model4)$sigma^2,summary(model5)$sigma^2) [1] 8.553567 7.776036 7.152661

11