Stat 139 Homework 10 Solutions, Fall 2015

Stat 139 Homework 10 Solutions, Fall 2015 Problem 1. (Based on Exercises 10 and 11 in Chapter 12). A; B; and C are three explanatory variables in a multiple regression with n = 28 cases. The following table shows the residual sum of squares and degrees of freedom for all models (note: this table can be found in the file \ABC.csv" to make life easier to do the calculations in software like Excel or R): Model Residual sum Degrees Variables of squares of freedom K σ^2 R2 adj-R2 BIC None 8100 27 0 300 0 0 157.298 A 6240 26 1 240 0.23 0.1992 153.5499 B 5980 26 1 230 0.262 0.23248 152.40083 C 6760 26 1 260 0.165 0.1316 155.71109 AB 5500 25 2 220 0.321 0.26442 153.43750 AC 5250 25 2 210 0.352 0.298 152.18147 BC 5750 25 2 230 0.29 0.2308 154.63770 ABC 5160 24 3 215 0.363 0.2799 155.01043 (a) Calculate 3 statistics for each model: the estimate of σ2, adjusted R2, and BIC. See the table above for solutions (note, other stats were calculated, but didn't need to be). (b) Summarize which model(s) is/are ranked best for each of the 4 statistics from part (a). Based on the calculations in part (a) we see that the AC model is the best for all the models in all 3 of the criteria. (c) Using the residual sum of squares, find the model indicated by forward selection. Start with the model \None", and identify the single-variable model that has the smallest residual sum of squares, then perform an extra-sum-of-squares F -test to determine if that variable is significant. If it is, continue with the 2 predictor model. Continue until no more significant predictores can be added. Is this procedure guaranteed to find the \best" model (best based on residual sum of squares)? Based on this algorithm, the best 1-predictor model is B, which is significantly better than the \None" variables model: (8100 − 6240)=1 F = = 7:75 6240=26 which has p-value = 0.009879 (using 1-pf(7.75,1,26) in R). Then we consider the best 2-predictor model with B in it, which is AB: (6240 − 5500)=1 F = = 3:364 5500=25 which has p-value = 0.0786 (using 1-pf(3.364,1,25) in R). We then stop and say the AB model is not better than just using B, so the model with B is our best predictive model through forward model selection (using the ESS F -statistic's p-value as our criterion). Note: we did not find the true \best" model through this sequential method of model selection. Problem 2. Predicting Income in GSS What factors are associated with income in the US. Several variables from the General Social Survey (GSS) are stored in the data file \gssincome.csv", and the codebook for the data can be found in the file \GSS Income Codebook.txt". 1 (a) Explore the data graphically and decide whether the outcome, income, or any predictor(s) should be transformed. Make sure you define any categorical variables as factors in R. p 1 Here are the histograms of untransformed income, income, income 4 , and log(income). A transfor- mation is definitely needed, and either sqrtinc or loginc could be acceptable. Something in between 1 1 would be best (like income 4 ), but interpretation of coefficients would be difficult. We chose income 4 for the rest of this problem because it is the most symmetricp histogram, but the interpretatbility of the results will be difficult. Using log(income) or income would possibly be better choice if interpretability of results is the most important consideration. Histogram of gss$income Histogram of sqrt(gss$income) Histogram of gss$income^(1/4) Histogram of log(gss$income) 350 300 300 300 250 300 250 250 200 200 200 200 150 150 Frequency Frequency Frequency Frequency 150 100 100 100 100 50 50 50 0 0 0 0 0 50000 100000 150000 200000 250000 300000 350000 0 100 200 300 400 500 600 5 10 15 20 25 6 7 8 9 10 11 12 13 gss$income sqrt(gss$income) gss$income^(1/4) log(gss$income) (b) Regress the outcome on educ, race and their interaction (call this Model 1). What is the estimated slope of the line for \black" subjects (i.e. race=2)? What is it for subjects who self-identify as \other race" (i.e., race=3)? Here is the relevant R output: > summary(model1<-lm(inc4~(as.factor(race)+educ)^2,data=gss)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.01866 0.55358 14.485 <2e-16 *** as.factor(race)2 -2.99906 1.46651 -2.045 0.041 * as.factor(race)3 0.91033 1.20903 0.753 0.452 educ 0.40211 0.03758 10.701 <2e-16 *** as.factor(race)2:educ 0.14454 0.10240 1.412 0.158 as.factor(race)3:educ -0.10530 0.08577 -1.228 0.220 --- Residual standard error: 3.253 on 1373 degrees of freedom Multiple R-squared: 0.1214, Adjusted R-squared: 0.1182 F-statistic: 37.95 on 5 and 1373 DF, p-value: < 2.2e-16 This question (and remaining questions) was intended to look at the slope relating transformed income to educ among the race categories, but the wording of the question was not clear. So throughout this problem, 2 different answers were acceptable (though the interpretation of dealing with the main ef- fects of race in the presence of the interaction term is not ver important scientifically). The main effect for black subjects was -2.999 and for other race subjects is was 0.9103. This just is the estimate 1 of the difference in average income 4 of these racial groups compared to the reference group (white group) for indivudals who have zero years of education (which is not very important). The estimate 1 of the slope of the line relating income 4 to educ for black subjects was 0:40211 + 0:14454 = 0:54665, and for the other racial group it was 0:40211 − 0:10530 = 0:29681. 2 (c) Perform a hypothesis test in R to determine whether the slope for \black" subjects is different than for \other race" subjects. This can be done via a t-test for a linear combination of coefficients. The R function vcov will be useful. Here's the important R output in combination with Model 1's summary output above: > round(vcov(model1),5) (Intercept) race2 race3 educ race2:educ race3:educ (Intercept) 0.30645 -0.30645 -0.30645 -0.02045 0.02045 0.02045 race2 -0.30645 2.15064 0.30645 0.02045 -0.14803 -0.02045 race3 -0.30645 0.30645 1.46174 0.02045 -0.02045 -0.10050 educ -0.02045 0.02045 0.02045 0.00141 -0.00141 -0.00141 race2:educ 0.02045 -0.14803 -0.02045 -0.00141 0.01049 0.00141 race3:educ 0.02045 -0.02045 -0.10050 -0.00141 0.00141 0.00736 Here's the calculation performed manually: H0 : β4 − β5 = 0 vs. HA : β4 − β5 6= 0 (β^ − β^ ) − 0 0:14454 − (−0:10530) 0:24984 t = 4 5 = = = 2:038 q p0:01049 + 0:00736 − 2(0:00141) 0:12260 Var(β^4) + Var(β^5) − 2Cov(β^4; β^5) This t-statistics has df= 1373, which is the degrees of freedom associated with estimating the variance of the residuals with this model. The critical value is essentially 1.96, and the p-value = 0.0417 can be calculated in R: 2*(1-pt(2.038,df=1373)). Thus we can reject the null hypothesis; black and other race subjects have statisticantly significantly different slopes (every extra year of education is more important for black individuals than for other race individuals). The calculations can be done automatically in R using matrix multiplication: > C=c(0,0,0,0,1,-1) > t(C)%*%coef(model1)/sqrt((t(C)%*%vcov(model1)%*%C)) [,1] [1,] 2.038711 (d) Perform a hypothesis test in R to determine whether the slopes (for associating income with education) for the 3 race groups are all the same or not. This can be done via an Extra-Sum-of- Squares F -test. You will need to run a second regression model to perform this test. We need to first fit the smaller model not including the interacti0n terms. Here's the important R output: > summary(model1b<-lm(inc4~(race+educ),data=gss)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.04485 0.47245 17.028 < 2e-16 *** race2 -0.96740 0.24635 -3.927 9.03e-05 *** race3 -0.50956 0.29780 -1.711 0.0873 . educ 0.40031 0.03186 12.564 < 2e-16 *** --- Residual standard error: 3.255 on 1375 degrees of freedom Multiple R-squared: 0.1187, Adjusted R-squared: 0.1168 F-statistic: 61.76 on 3 and 1375 DF, p-value: < 2.2e-16 > summary(model1)$sigma [1] 3.252873 > summary(model1b)$sigma [1] 3.255432 3 Here's the calculation performed manually: H0 : β4 = β5 = 0 vs. HA : β4 6= 0 or β5 6= 0 2 2 (SSE1b − SSE1)=∆df 1375(3:255432 ) − (1373(3:252873 ))=2 F = = 2 = 2:082 SSE1=df1 3:252873 This F -statistics has df = 2; 1373 (note, your answer may be differs from R's output because of rounding issues of reportingσ ^2) .

Stat 139 Homework 10 Solutions, Fall 2015

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support