Covers Categorical Coding, Centering and VIF

Homework 6 - 4 Questions

Covers Categorical Coding, Centering and VIF

Use the Project Talent data set.

1. A lack-of-fit test regressing Math on Gender, SES, Sociability, Reading, and Mechanical Reasoning indicates that a possible curvature exists for the variables Reading and Mechanical Reasoning. Without centering, create new, second-order variables for Reading and for Mechanical Reasoning. Repeat the multiple regression analysis including these second order terms (will result in a model with seven (7) predictors). Select Options and select Variance Inflation Factor AND Lack-of-Fit > Data Subsetting and answer the following questions.

a. What are the VIF values for all of the predictors?

Predictor Coef SE Coef T P VIF

Gender -1.238 1.895 -0.65 0.522 1.542

Reading -1.6071 0.6963 -2.31 0.034 63.366

Mech -2.040 1.159 -1.76 0.096 43.422

Social 0.5179 0.2971 1.74 0.099 1.365

SES 0.22559 0.09012 2.50 0.023 1.543

r2 0.03581 0.01264 2.83 0.011 69.894

me2 0.14575 0.06323 2.30 0.034 52.568

b. Does this indicate the presence of Multicollinearity? Why or why not?

Yes, since the VIF values for Reading, Mechanical and their second order terms exceed 10

c. What is the p-value of the lack-of-fit tests and what does this indicate about model fit?

No evidence of lack of fit (P >= 0.1) indicating that there is no lack of model fit

2. Create new second order terms for Reading and Mechanical Reasoning after centering both the first order terms. Repeat the multiple regression analysis including these centered first and second order terms. Select Options and select Variance Inflation Factor AND Lack-of-Fit > Data Subsetting and answer the following questions.

a. What are the VIF values for all of the predictors?

Predictor Coef SE Coef T P VIF

Gender -1.238 1.895 -0.65 0.522 1.542

Social 0.5179 0.2971 1.74 0.099 1.365

SES 0.22559 0.09012 2.50 0.023 1.543

cR 0.5960 0.1435 4.15 0.001 2.692

cM 1.2242 0.4320 2.83 0.011 6.033

cR2 0.03581 0.01264 2.83 0.011 2.651

cM2 0.14575 0.06323 2.30 0.034 2.963

b. Does this indicate the presence of Multicollinearity and support your answer?

With all VIF values being less than 10 there is no indication of multicollinearity. The centering reduced the correlation between the first and second order terms for Reading and Mechanical Reasoning

c. What is the p-value of the lack-of-fit test and what does this indicate about model fit?

No evidence of lack of fit (P >= 0.1) indicating that there is no lack of model fit

3. The variable School Size is interpreted as follows:

1 = number of students is less than 100 (call this "small")

2 = number of students is from 100 to 399 (call this "medium")

3 = number of students is 400 or more (call this "large")

The data set includes dummy coding, effect coding (large school as reference group for both), and orthogonal coding for Small - Medium, and Small/Medium - Large. Regressing Math on just these coded variables (i.e. regress Math separately on the two DV, then the two effect variables, then the two orthogonal variables) and answer the following questions:

a: Provide an interpretation of the t-tests for regressing Math on the DV.

Predictor Coef SE Coef T P

Constant 31.571 3.495 9.03 0.000

Small_DV -13.429 4.943 -2.72 0.013

Medium_DV -8.481 4.471 -1.90 0.071

Since dummy coded, the T-tests are comparing the means for the categorical level in the model to the mean for the category not included in the model. In this case, the mean Math score of the Small schools and mean Math score of the Medium schools are being compared to the mean Math score of the Large schools. Using alpha of 0.05, we would conclude a significant difference between the mean math scores for the Small and Large schools. The negative slope for Small indicates that the mean math score for Small schools is less than the mean math score for Large schools. Since the p-value for Medium exceeds 0.05 we cannot conclude a difference in mean math scores between the Medium and Large schools. Note that in this model the intercept is the mean math score for the Large schools.

b: Provide an interpretation of the t-tests for regressing Math on the effect variables.

Predictor Coef SE Coef T P

Constant 24.268 1.892 12.83 0.000

Small_eff -6.126 2.766 -2.21 0.037

Med_eff -1.177 2.484 -0.47 0.640

Since effect coded AND unequal sample sizes, the T-tests are comparing the means for the categorical level in the model to the average of the three group means. That is, if you find the mean math score for the three school sizes (18.14, 23.09, and 31.57) this average is 24.3 – the intercept of the model. In this model, the mean Math score of the Small schools and mean Math score of the Medium schools are being compared to the this average mean. Using alpha of 0.05, we would conclude a significant difference between the mean math scores for the Small and average of the mean Math scores for the three sizes. The negative slope for Small indicates that the mean math score for Small schools is less than this average. Since the p-value for Medium exceeds 0.05 we cannot conclude a difference between the mean math scores for the Medium and average of the mean Math scores for the three sizes.

c: Provide an interpretation of the t-tests for regressing Math on the orthogonal variables.

Predictor Coef SE Coef T P

Constant 24.080 1.850 13.02 0.000

Small-Med -0.2749 0.2484 -1.11 0.280

SM-Large -0.4162 0.1648 -2.53 0.019

Since orthogonal coded, the T-tests are comparing the means for the categorical levels being compared in that specific contrast. For Small-Med this compares the mean Math score between these two school sizes; for SM-Large we are comparing the mean Math score of the Small/Medium schools to the mean math score of the Large schools. Again using alpha of 0.05, we only reject the hypothesis for the SM-Large contrast and conclude a difference in mean math scores for the Small/Medium schools compared to the Large schools. With a negative slope and the contrast difference being calculated by “SM minus Large” we can add that the Large school mean math scores are higher than the mean math scores of the Small/Medium sized schools. Note that the intercept in this model is the overall mean of the math scores, i.e. the grand mean for math.

4. Regress Math on each orthogonal coded variable separately (i.e. conduct two simple linear regressions) and answer the following:

a. What is the R-squared value for regressing Math on Small-Med?

R-sq = 4.1%

b. What is the R-squared value for regressing Math on SM-Large?

R-sq = 21.6%

c. What is the R-squared value for the multiple regression model when regressing Math on both orthogonal variables (i.e. from the model in 3c)?

R-sq = 25.7%

d. Did the sum of the R-squared values in parts a and b equal the R-squared value in c?

Yes they did. NOTE: this is the result of having orthogonal (i.e. correlation between predictors is zero). In such instances when all the model predictors are orthogonal, the R-squared values are additive.

1