A Simulation Study of Alternatives to Ordinary Least Squares

Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 A Simulation Study of Alternatives to Ordinary Least Squares A. P. Dempster , Martin Schatzoff & Nanny Wermuth To cite this article: A. P. Dempster , Martin Schatzoff & Nanny Wermuth (1977) A Simulation Study of Alternatives to Ordinary Least Squares, Journal of the American Statistical Association, 72:357, 77-91 To link to this article: http://dx.doi.org/10.1080/01621459.1977.10479910 Published online: 06 Apr 2012. Submit your article to this journal Article views: 42 View related articles Citing articles: 47 View citing articles Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=uasa20 Download by: [University of Michigan] Date: 11 April 2017, At: 09:10 A Simulation Study of Alternatives to Ordinary Least Squares A. P. DEMPSTER, MARTIN SCHATZOFF, and NANNY WERMUTH* Estimated regression coefficients and errors in these estimates are work by Charles Stcin and his colleagues, (e.g., [l, 13, computed for 160 artificial data sets drawn from 160 normal linear 18]), has served to draw attention to the potential weak- models structured according to factorial designs. Ordinary multiple regression (OREG) is compared with 56 alternatives which pull ness of straightforward maximum likelihood estimation some or all estimated regression coefficients some or all the way when more than a very few parameters must be estimated. to zero. Substantial improvements over OREG are exhibited when col- Efron and Morris [4, 5, 6, 7, 8, 91 have recently extended linearity effects are present, noncentrality in the original model is and advocated the Stein approach at great length. Also small, and selected true regression coefficients are small. Ridge re- taking a mainly frequentist point of view, Hoerl and gression emerges as an important tool, while a Bayesian ex- Kennard [ll, 121 introduced and defended ridge regres- tension of variable selection proves valuable when the true regression coefficients vary widely in importance. sion as having good mean squared error properties in practically relevant regions of the parameter space, at KEY WORDS: Least squares; Multiple regression; Ridge regression; least when the independent variables multicorrelate Simulation; Variable selection. strongly. (See [2, 10,171 for various views of ridge regres- 1. INTRODUCTION sion.) From the standpoint of a subjectivist Bayesian theory, posterior mean squared error is substantially We report here summary results of a numerical study reduced if a flat prior distribution can be replaced by a undertaken to compare the properties of a collection of prior distribution which clusters about some prior mean, alternatives to ordinary least squares for multiple linear taken here to be zero. (See [15, 211 for recent Bayesian regression analysis. The approach is a broad-brush ex- discussions of Stein-type and ridge-type estimates.) ploration of the relative performance, from the stand- Alongside the alternative regression methods just points of estimation and prediction, of different tech- cited, the present study includes forward and backward niques over a range of conditions which are syst,ematically selection methods and Bayesian selection procedures varied according to factorial designs. The variable factor proposed in [S]. Thus, we hope to draw attention to the levels include different patterns of true regression coeffi- substantial improvements over ordinary least-squares cients, different amounts of noncent.rality, and different methods which are afforded by a wide variety of alterna- degrees of collinearity or multicollinearity among inde- tive methods. pendent variables. The substantive conclusions from the Our results are empirical results derived from numeri- study are indications of possible drastic improvements cal experiments. Specifically, we created two series of- over least squares, especially through the technique of artificial data sets. The first series, referred to as Experi- ridge regression, and especially when a high degree of ment 1, contains 32 data sets in the form of a Z5 factorial correlation exists among the independent variables. design, while the second series, or Experiment 2, contains We have not attempted to study alternatives to stan- 128 data sets in the form of a quarter replicate of a 2g dard regression analysis which are designed to be robust design. Each data set was drawn from a normal linear against failures of the normal error model. Instead we model with 6 regression coefficients to be estimated and have focused on the recently prominent difficulties with 14 degrees of freedom for error. Each of 57 estimation least squares under the normal model. From a frequentist procedures was applied to each of the 160 data sets, standpoint, it has long been recognized that good mean yielding a set of 6 estimated regression coefficients, which squared error properties do not necessarily follow from were compared to the true regression coefficients in the the celebrated minimum variance unbiasedness proper- simulated models, using mainly the two end-point ties of least squares, since in certain regions of the parameter space the loss from increasing the squared bias criteria SEB (sum of squared errors of betas) and SPE can be overcompensated by reducing variance. Important (sum of squared prediction errors). Basic notation, including precise definitions of SEB and * A.P. Dempster is Professor, Theoretical Statistics, Harvard University, SPE, appears in Appendix A. Further details of the 57 Cambridge, MA 02138. Martin Schatzoff is Manager of Operations Research. IBM Cambridge Scientific Center, Cambridge, MA 02139. Nanny Wermuth is estimation procedures appear in Section 2 and Appendix with the Institute fur Medizinische Statistik. UnivenitBt Maina, Federal Republic of Germany. The first author’s research waa supported in part by National Science Foundation Grant GP-31003X. The second author’s research was supported by @ Journal of the American Statistical Association the IBM Cambridge Scientific Center. The third author’s research wna supported March 1977. Volume 72, Number 357 by the Cambridge Project. Computing facilitien were provided by IBM. Theory and Methods Section 77 78 Journal of the American Statistical Association, March 1977 B, while the design of the study is described in Section 3 more or less clustered about the origin. The two extremes and Appendix C. The analysis of experimental data is in our list of procedures are OREG, or ordinary least presented in Section 4. Further detailed analyses may be squares, which does not shrink, and ZERO which achieves found in Wermuth [19]. total shrinkage by setting all estimated regression coeffi- - The study is broad in some ways, e.g., in its range of cients to zero. Between these extremes the procedures estimation procedures and range of underlying models. differ in the pattern and degree of shrinking. In Section In other ways, the study is narrowly focused, e.g., in its 2.2 we discuss procedures in the classes STEIN and RIDGE restriction to 6 and 14 degrees of freedom and its limita- which shrink each estimate according to a continuous tion to just one replication of each model. In view of the formula, while in Section 2.3 me describe the families latter restriction, it is clear that we are not attempting FSL, BSL, CP, REGF, RREG, and PRI which shrink discretely detailed analysis of the frequency properties of estimators in the sense that selected coefficients are pulled back to under each specified model, and so we are not meeting zero, or nearly to zero. The concepts used to define the objectives of the mathematical statistician who specific variants within the families are described in wishes to calculate such frequency properties. We would Section 2.4. prefer to shed light on the conceptually more difficult task of the data analyst, who knows only his data and 2.2 Continuous Shrinking Methods not the underlying parameters. Our data base enables us Following the notation est,ablished in Appendix A, the to study the actual errors which a data analyst using standard least-squares estimator b = (XTx)-'XTY can be specified rules of estimation would encounter under a generalized to simulated range of data sets which we believe could 8 = (XTX + kQ)"X'Y (2.1) typify certain types of real world experience. Any particular user of regression techniques may where Q is a positive definite symmetric matrix and k is legitimately criticize us for not including the specific a nonnegative scalar. In practice, Q is allowed to depend variant procedures which interest him in the context of only on the design matrix X, while k is generally allowed a specific class of real world situations. For such a reader, to depend on Y as well. we believe that we have provided a concrete illustration The choices Q = XTX and Q = I in (2.1) define the of a type of study which can produce interesting or even classes of estimators which we call STEIN and RIDGE. startling results. Given adequate computational facility, The principal components transformation C defined in whose availability continues to develop rapidly, the study Appendix A simultaneously diagonalizes both XTX and could be repeated holding the design matrix X fixed at 1, and therefore provides a simple representation for the values for a given data set and varying the factors RIDGE estimators. After transforming we deal with and procedures of greatest concern to a particular data &R = Ca,, a = CB, and a = C@ where C is defined by analyst, including of course the shape of the error distribu- (A.5) and (A.6). By transforming (2.1) we see that the tion and appropriate robust procedures, as may be in- components of ib and a are related by dicated either by prior understanding of circumstances or diR = fiai , i = 1, 2, . m -1 p 1 (2.2) by the properties of the given data set.

Load more