Stepwise Regression and All Possible Subsets Regression in Education

©EIJEAS 2016 Volume: 2 Issue: Special Issue, 60-81, Ohio, USA Electronic International Journal of Education, Arts, and Science http://www.eijeas.com Stepwise Regression and All Possible Subsets Regression in Education Ke Wang Texas A&M University, [email protected], Zhuo Chen Texas A&M University, [email protected] Abstract Stepwise methods are quite common to be reported in empirically based journal articles (Huberty, 1994). However, many researchers using stepwise methods failed to realize that software packages had been programmed in error. The purpose of this study is to introduce the procedure of stepwise regression and used experiments and Venn diagrams to illustrate the three main problems of stepwise regression: wrong degree of freedom, capitalization on sampling, and error R2 not optimized. Meanwhile, the study also depicted an alternative method: All-possible-subsets regression and used an experiment to illustrate how to use it in real complex study. Finally, some matters needing attention when using both automatic procedures are discussed. Keywords: Stepwise Regression, Degree of Freedom, Sampling Error, R2, All-possible Subsets ©EIJEAS 2016 Volume: 2 Issue: 1, 60-81, Ohio, USA Electronic International Journal of Education, Arts, and Science http://www.eijeas.com Introduction One of common problems in regression analysis is to make the decision in selecting variables. Researchers usually utilize or manipulate numbers of predictor variables in research and they are trying to create a best-fit model with a relatively smaller subset of predictors. During the process of building the model, researchers could try many combinations based on theories. However, it should be noteworthy that the entire process may take a considerable amount of time. As one of commonly-used approaches in variable selection, the stepwise method is often reported in empirically based journal articles (Huberty, 1994). As a fact, 260 academic journal papers (peer reviewed) after 1994 can be found in EBSCOhost Online Research Databases by using the key words of "stepwise regression". Although the stepwise is extensively utilized in selecting variables, the step procedure in commonly-sued software packages is programmed with errors. In this case, in spite of the prevalence of stepwise methods in variable selection, the results or outcomes from the stepwise method may not be appropriate. In the present paper, we firstly argue for the procedure of stepwise regression and three major problems in stepwise procedure with examples and Venn diagrams, then, illustrate the problem caused by collinearity in stepwise. In addition, we also introduce the all-possible-subsets analysis as an alternative of selecting variables and illustrate how to use all-possible-subsets analysis. What is stepwise regression? Stepwise regression aims to select a model step by step, adding or deleting one predictor only based on the statistical significance. The result of this process is a single regression model. Stepwise analysis has either forward or backward progression. The forward progression is more commonly encountered than backward analysis. Nowadays, researchers can control details of the process, including the significance level and variable manipulation (e.g., add or remove) in statistical software programs, such as Minitab and Statistical Software (Frost, 2012). Taking forward stepwise regression as an example, firstly, the stepwise process computes all bivariate r2 values for all independent variables and dependent variable. Then, it selects the independent variable with the largest r2. At the second stage, the remaining independent variables will be added in the model separately with the “best” single 61 ©EIJEAS 2016 Volume: 2 Issue: 1, 60-81, Ohio, USA Electronic International Journal of Education, Arts, and Science http://www.eijeas.com independent variable and the stepwise will select the independent variable that yield the largest increase in R2. At the third stage, the stepwise regression will evaluate the contributions of remaining variables to the model with the best single variable and the second "biggest contribution" variable. Following the same rule, the variable, which can yield the largest increase in R2 with the two selected predictors, can be chosen as the third predictor. The forward selection procedure in the commonly-used statistical packages will stop when the increase in R2 from one step versus the R2 in the previous step is not statistically significant. It should be noteworthy that, over the course of selecting variables, stepwise 2 2 2 process utilizes the equation (Fcalculated=[(RLarger - RSmall )/(kL-kS)]/[(1-RLarger )/(n-kL-1)]) to 2 2 test the null hypothesis H0: RLarger =RSmall . Specifically, kL is the number of predictors used to 2 2 obtain RLarger , and kS is the number of predictors used to obtain RSmaller . The degrees of freedom are (kL-kS)for the numerator, and (n-kL-1) for the denominator (Thompson, 2006). Therefore, the sample size plays an essential role in calculating Fcalculated value on the above null hypothesis. In other words, researchers may evaluate the same Delta R2 with the same number of stepwise procedures but get different pcalculated value based on different sample sizes (Thompson, 2006). Why Stepwise Regression Do Not Work: Three Problems Criticisms of the stepwise method has been given by increasing numbers of scholars (e.g., Huberty, 1989, 1995; Snyder, 1991, Thompson, 1989, 1995, 2001). The first major problem of the stepwise procedure is the fact that commonly-used statistical program packages choose the wrong degrees of freedom in their statistical tests, providing incorrect Fcalculated and pcalculated. The second major problem is that stepwise multiple regression packages tend to capitalize on small amount of sampling error, thus, offering the final model with different selected variables. The third major problem is that stepwise multiple regression packages cannot yield the best predictor set for a given size. Wrong Degree of Freedom Commonly-used statistical packages utilize incorrect degrees of freedom to calculate the mean square (MS), F value and p value in stepwise regression. According to Walker (1940), the number of degrees of freedom is equal to the number of observations minus the number of necessary relations obtaining among these observations. In other words, the number of degrees of freedom is equal to the number of original observations minus the number of parameters estimated from them (Walker, 1940). 62 ©EIJEAS 2016 Volume: 2 Issue: 1, 60-81, Ohio, USA Electronic International Journal of Education, Arts, and Science http://www.eijeas.com Based on the definition of degree of freedom, dftotal is n-1 (n is the number of sample size) in the stepwise regression analysis; dfregression is the number of variables that stepwise has entered; and dfresidual equals dftotal minus regression degrees of freedom (Thompson, 2006). Therefore, dfregression is the number of predictor variables in a given study. However, stepwise regression has special procedures being mentioned in the first section to select variables into the model. In the selection process after the first step, all remaining variables are separately entered into the model and the predictor which yields the largest R2 is selected. As Thompson (2006) noted, software should charge for 1 degree of freedom for every predictor "tasted", regardless of how many predictors are ultimately retained in the analysis. Table 1 presents a heuristic example regarding the wrong degree of freedom. Presuming that there are 526 samples, 5 steps of forward stepwise with 50 predictor variables, 2 and an R of 10%. The computer packages use the incorrect degree of freedom 5, and pcalculated is 0.022 (statistically significant). However, when the correct degree of freedom should be 50, the pcalculted is 0.369 (not statistically significant). Thus, this problem easily causes Type I errors (Cliff, 1987). Table 1. Stepwise NHSST for Hypothetical Example 2 Analysis/source SOS df MS Fcalculated pcalculated R Incorrect Regression 100.0 5 20.00 2.667 0.0215 10.00% Residual 900.0 520 7.50 Total 1000.0 525 1.90 Correct Regression 100.0 50 2.00 1.06 0.3686 10.00% Residual 900.0 475 1.89 Total 1000.0 525 1.90 Note: The original table (Thompson 2006, p. 273) was modified to this table Capitalization on Sampling Error Based on procedures of stepwise statistical packages, the packages do not recognize the mistakes caused by sampling errors. Unfortunately, statistical software will select the 63 ©EIJEAS 2016 Volume: 2 Issue: 1, 60-81, Ohio, USA Electronic International Journal of Education, Arts, and Science http://www.eijeas.com predictor that contributes more to the increase of R2 than other predictors in a given step, even if the difference is very small between two predictors and the difference is caused by the sampling error. Snyder (1991) presented a heuristic example of these dynamics on sampling error. Any mistakes in the sequence will influence the following subsequent choices because stepwise regression is a linear sequence of selection based on the rules mentioned in the previous section and only one predictor is selected in each step. If one predictor enters into the model because of an infinitesimal advantage being caused by a small amount of sampling error, this may produce an incorrect regression model. For example, as Figure 1 shown, the common area of Y and X1 is 80 (B+C=80) and is bigger than other X2 and X3 variables. Thus, X1 is the first predictor in the model. However, if the difference of 2 contribution to R between X1 and X2 is caused by sampling error, namely, X2 should be the first predictor in the model. Then, in stepwise regression, X2 will be the second predictor because X2 can contribute unique 39 to the overlap area with Y. But if the first predictor is X2, 2 the second predictor should be X3 because X3 can contribute 50 to R and X2 is 40. Figure 1. Venn diagram on sampling error. Therefore, sampling error is another important issue in the process of stepwise regression.

Load more