Midterm I Solutions Stat 112, Fall 2004

Total Page:16

File Type:pdf, Size:1020Kb

Midterm I Solutions Stat 112, Fall 2004

Midterm I Solutions – Stat 112, Fall 2004

1. (i) C; (ii) B; (iii) D; (iv) D; (v) C

2. (a) The assumptions of the simple linear regression model are: (1) Linearity; (2) Constant variance; (3) Normality; (4) Independence. Linearity can be checked by looking at whether there is a pattern in the mean of the residuals as X (Television) increases in the residual plot. There does not appear to be any pattern in the mean of the residuals, indicating that linearity is a reasonable assumption. Constant variance can be checked by looking at whether there is a pattern in the spread of the residuals as X increases in the residual plot. The spread of the residuals in the residual plot appears to be roughly constant, indicating that constant variance is a reasonable assumption. The normality assumption can be checked by looking at whether the histogram of the residuals appears bell shaped and whether the points in the normal quantile plot of the residuals all fall within the confidence bands. The histogram of the residuals appears roughly bell shaped but the data does appear somewhat skewed to the left. Several points fall outside of the confidence bands in the normal quantile plot. Thus, the normality assumption appears to be violated. The independence assumption cannot be checked as the data are not collected over time.

(b) The simple linear regression model is E(Debt | Television)   0  1 *Television . To test whether there is strong evidence that number of hours the television is turned on is associated with debt, the null hypothesis is H 0 : 1  0 and the alternative hypothesis is

H a : 1  0 . The p-value for the test from the JMP output is <0.0001. Thus, there is strong evidence that number of hours the television is turned on is associated with debt.

(c) A 95% confidence interval for the difference between the mean debt of the subpopulation of families whose television is turned on 21 hours per week and the mean debt of the subpopulation of families whose television is turned on 20 hours per week is a 95% confidence interval for the slope of the simple linear regression model. A 95% confidence interval for the slope is approximately ˆ ˆ 1  2* SE(1 )  2581.79  2*187.54  (2206.71,2956.87) .

(d) Your friend’s family predicted debt is ˆ ˆ ˆ E(Debt | Television  30)   0  1 *30  48039.68  2581.79*30  $125,493.39

(e) Let Y=your friend’s family’s debt. Using the property of the simple linear regression model that Y|X has a normal distribution with mean  0  1 X and standard deviation  and using the least squares estimates of  0 , 1 , (the least squares intercept, least squares slope and root mean square error respectively), we have Y 125,493.39 150,000 125,493.39 150,000 125,493.39 P(Y  150,000 | X  30)  P(  )  P(Z  )  38,671.29 38,671.29 38,671.29 P(Z  0.634)  1 0.7357  .2643 . Thus, there is probability .2643 that your friend’s family will be in debt more than $150,000.

(f) The sociologist’s claim is not justified. There is strong evidence that television watching is associated with debt but this does not show that television watching causes debt. The association could be the result of debt causing television watching or the result of some lurking variable Z that is associated with both television watching and debt. Additional data that would be useful to collect would be data on potential lurking variables such as income of the family, socioeconomic status, employment status of the parents in the family and number of children in the family. We could then compare the debt of families who watch the same amount of television and also have the same income, socioeconomic status, employment status and number of children in the family.

3. (a) The predicted absentee difference is Eˆ(absentee _ difference | machine _ difference  564)  100.30  0.1267*(564)  171.77 , i.e., we predict the Republican to have 172 more absentee votes.

(b) iii (300)

(c) The interval that best supports the Republican party’s claim that the difference of absentee votes (Democrat minus Republican) was unreasonably high in view of the difference of machine count votes is the 95% prediction interval. The 95% prediction interval is a range of values in which the absentee difference is likely to lie for a given single election with a machine difference of -564. If an absentee difference falls outside of the 95% prediction interval, it is an unusual election. On the other hand, the 95% confidence interval for the mean response for a machine difference of -564 is a range of plausible values for the mean absentee difference for all elections with a machine difference of -564. It would not be unusual for an election with a machine difference of -564 to have an absentee difference that falls outside of the 95% confidence interval for the mean response.

4. (a) The brain weight that is most surprisingly high is the dolphin with a residual of 1312.31. The brain weight that is most surprisingly low is the hippo with a residual of -1346.57.

(b) The elephant and hippo are influential observations because their Cook’s distances are greater than 1. The elephant and hippo are also high leverage observations because their leverages are greater than 6/n=6/96=0.0625.

(c) The removal of the elephant and hippo is justified because they are high leverage points. We should report that we have removed the elephant and hippo from our analysis and that our conclusions only apply to mammals with body sizes in the range 0-600 kilograms.

(d) Eˆ(brain _ weight | body _ weight  5)  exp{Eˆ[log(brain _ weight) | body _ weight  5]}  exp{Eˆ[log(brain _ weight) | log(body _ weight)  log(5)]  exp{Eˆ[log(brain _ weight) | log(body _ weight)  1.609]}  exp{2.33  0.72*1.609}  exp{3.49}  32.79.

Recommended publications