Regression Analysis: Adjusted Versus Unadjusted

COR1-GB.1305.03 FINAL EXAM

This is the question sheet. There are 10 questions, each worth 10 points. Please write all answers in the answer book, and justify your answers. Good Luck!

For questions 1 to 4, we consider data on state reading/math tests for the 10 most populous states in the US in 2013, based on 4th and 8th grades, combined.

The explanatory variable (x) is “Unadjusted”, the number of months ahead of the national average for the given state based on raw (unadjusted) scores on NAEP tests (National Assessment of Educational Progress). The states with the highest unadjusted scores are Ohio and Pennsylvania.

The response variable (y) is “Adjusted”, the number of months ahead of the national average for the given state based on test scores that are adjusted for student demographic factors such as poverty and share of students in special education.

1) Figure 1 below provides a scatterplot of Adjusted vs. Unadjusted, followed by the Minitab output from a simple regression of Adjusted on Unadjusted.

Fig 1: Scatterplot of Adjusted vs Unadjusted

5.0

2.5 d

e 0.0 t s u j d A -2.5

-5.0

-7.5 -7.5 -5.0 -2.5 0.0 2.5 5.0 Unadjusted Regression Analysis: Adjusted versus Unadjusted

Analysis of Variance

Source DF SS MS F-Value P-Value Regression 1 55.27 55.269 5.82 0.042 Error 8 75.92 9.490 Total 9 131.19

Model Summary

S R-sq R-sq(adj) R-sq(pred) 3.08050 42.13% 34.90% 8.18%

Coefficients

Term Coef SE Coef T-Value P-Value Constant 0.502 0.974 0.52 0.620 Unadjusted 0.746 0.309 2.41 0.042

Regression Equation

Adjusted = 0.502 + 0.746 Unadjusted

A) Give an interpretation of the intercept of the fitted model, in practical terms. (2 points).

B) Is there evidence of a positive linear relationship between Unadjusted and Adjusted? (3 points).

C) California had an Unadjusted score of −6.4 and an Adjusted score of −6.3. Is this data point above the fitted line? Justify your answer. (3 points).

D) Do you think that the true regression line goes through the origin, (0,0)? (2 points)

2) Consider the same simple regression as in Problem 1. A) Test the null hypothesis that the true slope is 1, at the 5% level of significance. (5 points).

B) Construct a 95% confidence interval for the true slope. (5 points).

3) Now, we introduce a second explanatory variable: the population of the 10 states, in Millions. Figure 2 gives a fitted line plot for the simple regression of Adjusted on Population, followed by the Minitab output for the multiple regression of Adjusted on Unadjusted and Population.

Fig 2: Fitted Line Plot for Adjusted vs. Population Adjusted = 1.874 - 0.0825 Population

S 3.96426 5.0 R-Sq 4.2% R-Sq(adj) 0.0%

2.5 d e

t 0.0 s u j d A -2.5

-5.0

-7.5 10 15 20 25 30 35 40 Population Regression Analysis: Adjusted versus Unadjusted, Population

Analysis of Variance

Source DF SS MS F-Value P-Value Regression 2 61.473 30.736 3.09 0.109 Error 7 69.712 9.959 Total 9 131.185

Model Summary

S R-sq R-sq(adj) R-sq(pred) 3.15577 46.86% 31.68% 0.00%

Coefficients

Term Coef SE Coef T-Value P-Value Constant -1.36 2.56 -0.53 0.612 Unadjusted 0.926 0.391 2.37 0.049 Population 0.109 0.137 0.79 0.456

Regression Equation

Adjusted = -1.36 + 0.926 Unadjusted + 0.109 Population

A) Does Population seem to be a useful variable for explaining the Adjusted score? (2 points).

B) Did the introduction of Population as an explanatory variable weaken or strengthen the overall regression model, compared to the simple regression on Unadjusted? (5 points).

C) Use the AICC to select between the simple regression model on Unadjusted, and the multiple regression on both Unadjusted and Population. (3 points).

4) Figure 3 below gives the plot of residuals vs. fitted values for the multiple regression of Adjusted on Unadjusted and Population. Does this plot reveal any potential problems with the model? Fig 3: Residuals vs. Fitted Values for Multiple Regression (response is Adjusted) 5

4

3

2 l a

u 1 d i s

e 0 R

-1

-2

-3

-4 -4 -3 -2 -1 0 1 2 3 4 5 Fitted Value

5) Weekly wages for individuals working in Manhattan have a skewed (asymmetrical) distribution with a population mean of $2,749. The right tail of this distribution is much heavier (longer) than the left tail. You are going to select a random sample of size 20 from this population and compute the sample mean weekly wage, X . Given this situation, is the expected value of the sample mean weekly wage greater than $2,749, less than $2,749, or equal to $2,749? Select one of these three scenarios, and defend your selection.

6) In simple linear regression, if the least-squares line has a positive slope, is it possible to have R2=0? Justify your answer.

7) Suppose you have a simple regression data set, where the least-squares line has a slope of 2.1. Consider any particular data point in the scatterplot of y versus x, and move this data point by one unit to the right. What will be the resulting change in the total sum of squares, SST? Give a numerical answer, if possible.

8) In simple linear regression, based on a data set with n=50, if R2=0, is it possible for the estimated slope to be statistically significantly different from zero at the 5% level of significance? Justify your answer. 9) If a discrete random variable has a standard deviation of zero, must it have a mean of zero? Justify your answer.

10) In testing H 0 :   1 versus H A :   1 based on a sample of size n=250, suppose you get a p-value of .006. If the sample standard deviation is 3.1, what is the value of the sample mean?