Selecting the “Best” Model – Stepwise Regression

Selecting the “Best” Model – Stepwise Regression

SELECTING THE “BEST” MODEL FOR MULTIPLE LINEAR REGRESSION Introduction • In multiple regression a common goal is to determine which independent variables contribute significantly to explaining the variability in the dependent variable. • A goal in determining the best model is to minimize the residual mean square, which would intern maximize the multiple correlation value, R2. • The model that contains all independent variables will give the maximum R2 value, but not all variables may contribute significantly to explaining the variability in the dependent variable. • Another statistic provided in determining the “best” model is the CP criterion. The CP values will decrease as the number of independent variables in the model increases. !"# • � = � − � − 1 − 1 + (� + 1), where ! !! o N = number of observations o P = number of independent variables in the model o RMS = residual mean square of the P selected variables o �! = is the residual mean square when all independent variables are included in the model. Stepwise Regression • A variable selection method where various combinations of variables are tested together. • The “first step” will identify the “best” one-variable model. Subsequent steps will identify the “best” two-variable, three-variable, etc. models. • An F-test on each independent variable in the model 2 • The “best” models are typically identified as those that maximize R , CP, or both. • Variations of stepwise regression include Forward Selection Method and the Backward Elimination Method. o Forward selection: a method of stepwise regression where one independent variable is added at a time that increases the R2 value. Addition of variables to the model stops when the “minimum F-to-enter” exceeds a specified probability level. The default minimum F-to-enter” in SAS is 0.15. o Backward elimination: a method of stepwise regression where all independent variables begin in the model and subsequent variables are eliminated. The variables eliminated first are those that contribute the least to the model. Elimination continues until the “minimum F-to-remove” drops below a specified probability level. The default minimum F-to-remove in SAS is 0.15. Stepwise Regression Using SAS • In this example, the lung function data will be used again, with two separate analyses. o Analysis 1: Determining which independent variables for the father (fage, fheight, fweight) significantly contribute to the variability in the father’s (ffev1)? o SAS commands are: Proc STEPWISE; Model ffev1=fage fheight fweight; Title 'Stepwise regression of father data'; Run; o Analysis 2: Determining which independent variables for the youngest child (ycage, ycheight, ycweight) significantly contribute to the variability in the father’s (ycfev1)? o SAS commands are: Proc STEPWISE; Model ycfev1=ycage ycheight ycweight; Title 'Stepwise regression of youngest child’s data'; Run; Stepwise regression of father data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ffev1 Number of Observations Read 150 Number of Observations Used 150 Stepwise Selection: Step 1 Variable fheight Entered: R-Square = 0.2544 and C(p) = 23.1084 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 16.05317 16.05317 50.50 <.0001 Error 148 47.04513 0.31787 Corrected Total 149 63.09830 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -4.08670 1.15198 4.00046 12.59 0.0005 fheight 0.11811 0.01662 16.05317 50.50 <.0001 Bounds on condition number: 1, 1 Stepwise Selection: Step 2 Variable fage Entered: R-Square = 0.3337 and C(p) = 7.1217 Stepwise regression of father data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ffev1 Stepwise Selection: Step 2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 21.05697 10.52848 36.81 <.0001 Error 147 42.04133 0.28600 Corrected Total 149 63.09830 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -2.76075 1.13775 1.68392 5.89 0.0165 fage -0.02664 0.00637 5.00380 17.50 <.0001 fheight 0.11440 0.01579 15.01346 52.50 <.0001 Bounds on condition number: 1.0032, 4.0127 Stepwise Selection: Step 3 Variable fweight Entered: R-Square = 0.3563 and C(p) = 4.0000 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 22.48181 7.49394 26.94 <.0001 Error 146 40.61649 0.27820 Corrected Total 149 63.09830 Stepwise regression of father data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ffev1 Stepwise Selection: Step 3 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -3.38388 1.15541 2.38620 8.58 0.0040 fage -0.02652 0.00628 4.95872 17.82 <.0001 fheight 0.13590 0.01824 15.43974 55.50 <.0001 fweight -0.00478 0.00211 1.42484 5.12 0.0251 Bounds on condition number: 1.3767, 11.259 All variables left in the model are significant at the 0.1500 level. All variables have been entered into the model. Prediction Equation is Found in Step 3 ����!1 = −3.384 − 0.027(����) + 0.136(�ℎ���ℎ�) − 0.005(�����ℎ�) • 35.6% of the variation in ffev1 is explained by having fage, fheight, and fweight in the model. Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 fheight 1 0.2544 0.2544 23.1084 50.50 <.0001 2 fage 2 0.0793 0.3337 7.1217 17.50 <.0001 3 fweight 3 0.0226 0.3563 4.0000 5.12 0.0251 Collective R2 values Stepwise regression of youngest child data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1 Number of Observations Read 150 Number of Observations Used 24 Number of Observations with Missing Values 126 Forward Selection: Step 1 Variable ycheight Entered: R-Square = 0.7801 and C(p) = 1.8825 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 11.96424 11.96424 78.05 <.0001 Error 22 3.37234 0.15329 Corrected Total 23 15.33658 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -5.04327 0.81916 5.81028 37.90 <.0001 ycheight 0.12608 0.01427 11.96424 78.05 <.0001 Bounds on condition number: 1, 1 Forward Selection: Step 2 Variable ycage Entered: R-Square = 0.7933 and C(p) = 2.5745 Stepwise regression of youngest child data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1 Forward Selection: Step 2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 12.16581 6.08290 40.29 <.0001 Error 21 3.17078 0.15099 Corrected Total 23 15.33658 ycage is non- Parameter Standard significant Variable Estimate Error Type II SS F Value Pr > F Intercept -4.23968 1.06990 2.37097 15.70 0.0007 ycage 0.09168 0.07934 0.20157 1.33 0.2609 ycheight 0.09510 0.03033 1.48458 9.83 0.0050 Bounds on condition number: 4.5847, 18.339 Forward Selection: Step 3 Variable ycweight Entered: R-Square = 0.7990 and C(p) = 4.0000 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 12.25435 4.08478 26.51 <.0001 Error 20 3.08223 0.15411 Corrected Total 23 15.33658 Stepwise regression of youngest child data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1 Forward Selection: Step 3 ycage and ycweight are non- Parameter Standard Variable Estimate Error Type II SS F Value Pr > F significant Intercept -5.16902 1.63451 1.54127 10.00 0.0049 ycage 0.07990 0.08165 0.14759 0.96 0.3395 ycheight 0.11965 0.04459 1.10969 7.20 0.0143 ycweight -0.00401 0.00528 0.08854 0.57 0.4573 Bounds on condition number: 9.7102, 57.311 All variables have been entered into the model. Prediction Equation is Found in Step 1 �����!1 = −5.043 + 0.126(��ℎ���ℎ�) • Over 73% of the variation of ycfev1 is explained by having ycheight in the model. Stepwise regression of youngest child data The STEPWISE Procedure Model: MODEL1 Dependent Variable: ycfev1 Forward Selection: Step 3 Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F Value Pr > F 1 ycheight 1 0.7801 0.7801 1.8825 78.05 <.0001 2 ycage 2 0.0131 0.7933 2.5745 1.33 0.2609 3 ycweight 3 0.0058 0.7990 4.0000 0.57 0.4573 Summary of Identifying the “Best” Model • The F-test for each independent variable is testing to determine if that variable contributes significantly to the model given that the other independent variables in the step are included in the model. • For example, in step 2 in the analysis of the father’s data, the null hypothesis being tested on the F-test for fage is Ho:fage = 0 given fheight is already in the model. • For example, in step 3 in the analysis of the father’s data, the null hypothesis being tested on the F-test for fage is Ho:fage = 0 given fheight and fweight are already in the model. • To determine which model “best” explains the variation in the dependent variable, find the model where for the first time you detect that one of the independent variables does not significantly contribute to the model. • The model prior to this model is the one that should be used. • The cumulative R2*100 for this model tells you the percent of the variation in the dependent variable that is explained by having the identified independent variables in the model.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us