SELECTING THE “BEST” MODEL FOR MULTIPLE

Introduction • In multiple regression a common goal is to determine which independent variables contribute significantly to explaining the variability in the dependent variable.

• A goal in determining the best model is to minimize the residual mean square, which would intern maximize the multiple correlation value, R2.

• The model that contains all independent variables will give the maximum R2 value, but not all variables may contribute significantly to explaining the variability in the dependent variable.

• Another statistic provided in determining the “best” model is the CP criterion. The CP values will decrease as the number of independent variables in the model increases.

!"# • � = � − � − 1 − 1 + (� + 1), where ! !!

o N = number of observations o P = number of independent variables in the model o RMS = residual mean square of the P selected variables o �! = is the residual mean square when all independent variables are included in the model.

Stepwise Regression • A variable selection method where various combinations of variables are tested together.

• The “first step” will identify the “best” one-variable model. Subsequent steps will identify the “best” two-variable, three-variable, etc. models.

• An F-test on each independent variable in the model

2 • The “best” models are typically identified as those that maximize R , CP, or both.

• Variations of stepwise regression include Forward Selection Method and the Backward Elimination Method.

o Forward selection: a method of stepwise regression where one independent variable is added at a time that increases the R2 value. . Addition of variables to the model stops when the “minimum F-to-enter” exceeds a specified probability level. . The default minimum F-to-enter” in SAS is 0.15. o Backward elimination: a method of stepwise regression where all independent variables begin in the model and subsequent variables are eliminated.

. The variables eliminated first are those that contribute the least to the model. . Elimination continues until the “minimum F-to-remove” drops below a specified probability level. . The default minimum F-to-remove in SAS is 0.15.

Stepwise Regression Using SAS • In this example, the lung function will be used again, with two separate analyses.

o Analysis 1: Determining which independent variables for the father (fage, fheight, fweight) significantly contribute to the variability in the father’s (ffev1)?

o SAS commands are: Proc STEPWISE; Model ffev1=fage fheight fweight; Title 'Stepwise regression of father data'; Run;

o Analysis 2: Determining which independent variables for the youngest child (ycage, ycheight, ycweight) significantly contribute to the variability in the father’s (ycfev1)?

o SAS commands are: Proc STEPWISE; Model ycfev1=ycage ycheight ycweight; Title 'Stepwise regression of youngest child’s data'; Run;

Stepwise regression of father data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ffev1

Number of Observations Read 150 Number of Observations Used 150

Stepwise Selection: Step 1

Variable fheight Entered: R-Square = 0.2544 and C(p) = 23.1084

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 16.05317 16.05317 50.50 <.0001 Error 148 47.04513 0.31787 Corrected Total 149 63.09830

Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -4.08670 1.15198 4.00046 12.59 0.0005 fheight 0.11811 0.01662 16.05317 50.50 <.0001

Bounds on condition number: 1, 1

Stepwise Selection: Step 2

Variable fage Entered: R-Square = 0.3337 and C(p) = 7.1217

Stepwise regression of father data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ffev1

Stepwise Selection: Step 2

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 21.05697 10.52848 36.81 <.0001 Error 147 42.04133 0.28600 Corrected Total 149 63.09830

Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -2.76075 1.13775 1.68392 5.89 0.0165 fage -0.02664 0.00637 5.00380 17.50 <.0001 fheight 0.11440 0.01579 15.01346 52.50 <.0001

Bounds on condition number: 1.0032, 4.0127

Stepwise Selection: Step 3

Variable fweight Entered: R-Square = 0.3563 and C(p) = 4.0000

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 22.48181 7.49394 26.94 <.0001 Error 146 40.61649 0.27820 Corrected Total 149 63.09830

Stepwise regression of father data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ffev1

Stepwise Selection: Step 3

Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -3.38388 1.15541 2.38620 8.58 0.0040 fage -0.02652 0.00628 4.95872 17.82 <.0001 fheight 0.13590 0.01824 15.43974 55.50 <.0001 fweight -0.00478 0.00211 1.42484 5.12 0.0251

Bounds on condition number: 1.3767, 11.259

All variables left in the model are significant at the 0.1500 level.

All variables have been entered into the model.

Prediction Equation is Found in Step 3 ����!1 = −3.384 − 0.027(����) + 0.136(�ℎ���ℎ�) − 0.005(�����ℎ�)

• 35.6% of the variation in ffev1 is explained by having fage, fheight, and fweight in the model.

Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 fheight 1 0.2544 0.2544 23.1084 50.50 <.0001 2 fage 2 0.0793 0.3337 7.1217 17.50 <.0001 3 fweight 3 0.0226 0.3563 4.0000 5.12 0.0251

Collective R2 values

Stepwise regression of youngest child data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1

Number of Observations Read 150 Number of Observations Used 24 Number of Observations with Missing Values 126

Forward Selection: Step 1

Variable ycheight Entered: R-Square = 0.7801 and C(p) = 1.8825

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 11.96424 11.96424 78.05 <.0001 Error 22 3.37234 0.15329 Corrected Total 23 15.33658

Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -5.04327 0.81916 5.81028 37.90 <.0001 ycheight 0.12608 0.01427 11.96424 78.05 <.0001

Bounds on condition number: 1, 1

Forward Selection: Step 2

Variable ycage Entered: R-Square = 0.7933 and C(p) = 2.5745

Stepwise regression of youngest child data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1

Forward Selection: Step 2

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 12.16581 6.08290 40.29 <.0001 Error 21 3.17078 0.15099 Corrected Total 23 15.33658 ycage is non-

Parameter Standard significant Variable Estimate Error Type II SS F Value Pr > F Intercept -4.23968 1.06990 2.37097 15.70 0.0007 ycage 0.09168 0.07934 0.20157 1.33 0.2609 ycheight 0.09510 0.03033 1.48458 9.83 0.0050 Bounds on condition number: 4.5847, 18.339

Forward Selection: Step 3

Variable ycweight Entered: R-Square = 0.7990 and C(p) = 4.0000

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 12.25435 4.08478 26.51 <.0001 Error 20 3.08223 0.15411 Corrected Total 23 15.33658

Stepwise regression of youngest child data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1

Forward Selection: Step 3 ycage and ycweight are non- Parameter Standard Variable Estimate Error Type II SS F Value Pr > F significant Intercept -5.16902 1.63451 1.54127 10.00 0.0049 ycage 0.07990 0.08165 0.14759 0.96 0.3395 ycheight 0.11965 0.04459 1.10969 7.20 0.0143 ycweight -0.00401 0.00528 0.08854 0.57 0.4573

Bounds on condition number: 9.7102, 57.311

All variables have been entered into the model.

Prediction Equation is Found in Step 1 �����!1 = −5.043 + 0.126(��ℎ���ℎ�)

• Over 73% of the variation of ycfev1 is explained by having ycheight in the model.

Stepwise regression of youngest child data

The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1

Forward Selection: Step 3

Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F Value Pr > F 1 ycheight 1 0.7801 0.7801 1.8825 78.05 <.0001 2 ycage 2 0.0131 0.7933 2.5745 1.33 0.2609 3 ycweight 3 0.0058 0.7990 4.0000 0.57 0.4573

Summary of Identifying the “Best” Model • The F-test for each independent variable is testing to determine if that variable contributes significantly to the model given that the other independent variables in the step are included in the model.

• For example, in step 2 in the analysis of the father’s data, the null hypothesis being tested on the F-test for fage is Ho:fage = 0 given fheight is already in the model.

• For example, in step 3 in the analysis of the father’s data, the null hypothesis being tested on the F-test for fage is Ho:fage = 0 given fheight and fweight are already in the model.

• To determine which model “best” explains the variation in the dependent variable, find the model where for the first time you detect that one of the independent variables does not significantly contribute to the model.

• The model prior to this model is the one that should be used.

• The cumulative R2*100 for this model tells you the percent of the variation in the dependent variable that is explained by having the identified independent variables in the model.

Considerations When Conducting Stepwise Regression • The selection of the “best” model is as good as the independent variables used in the analyses. If important independent variables are not considered or are left out of the analyses, the results obtained may have biased regression coefficients, low R2 values or both.

• The tests should be considered a screening method, not tests of significance since the F- values calculated don’t necessarily match up with values in an F-table.

• Like multiple linear regression, results from stepwise regression are sensitive to violations of the assumptions underlying regression or problematic data.

• To test the robustness of the independent variables identified to be important, analyze subsets of the data to determine if the identified independent variables continue to be detected as significant.