<<

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20

A Simulation Study of Alternatives to Ordinary

A. P. Dempster , Martin Schatzoff & Nanny Wermuth

To cite this article: A. P. Dempster , Martin Schatzoff & Nanny Wermuth (1977) A Simulation Study of Alternatives to Ordinary Least Squares, Journal of the American Statistical Association, 72:357, 77-91 To link to this article: http://dx.doi.org/10.1080/01621459.1977.10479910

Published online: 06 Apr 2012.

Submit your article to this journal

Article views: 42

View related articles

Citing articles: 47 View citing articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=uasa20

Download by: [University of Michigan] Date: 11 April 2017, At: 09:10 A Simulation Study of Alternatives to Ordinary Least Squares

A. P. DEMPSTER, MARTIN SCHATZOFF, and NANNY WERMUTH*

Estimated regression coefficients and errors in these estimates are work by Charles Stcin and his colleagues, (e.g., [l, 13, computed for 160 artificial sets drawn from 160 normal linear 18]), has served to draw attention to the potential weak- models structured according to factorial designs. Ordinary multiple regression (OREG) is compared with 56 alternatives which pull ness of straightforward maximum likelihood estimation some or all estimated regression coefficients some or all the way when more than a very few parameters must be estimated. to zero. Substantial improvements over OREG are exhibited when col- Efron and Morris [4, 5, 6, 7, 8, 91 have recently extended linearity effects are present, noncentrality in the original model is and advocated the Stein approach at great length. Also small, and selected true regression coefficients are small. Ridge re- taking a mainly frequentist point of view, Hoerl and gression emerges as an important tool, while a Bayesian ex- Kennard [ll, 121 introduced and defended ridge regres- tension of variable selection proves valuable when the true regres- sion coefficients vary widely in importance. sion as having good squared error properties in practically relevant regions of the parameter space, at KEY WORDS: Least squares; Multiple regression; Ridge regression; least when the independent variables multicorrelate Simulation; Variable selection. strongly. (See [2, 10,171 for various views of ridge regres- 1. INTRODUCTION sion.) From the standpoint of a subjectivist Bayesian theory, posterior is substantially We report here summary results of a numerical study reduced if a flat prior distribution can be replaced by a undertaken to compare the properties of a collection of prior distribution which clusters about some prior mean, alternatives to ordinary least squares for multiple linear taken here to be zero. (See [15, 211 for recent Bayesian . The approach is a broad-brush ex- discussions of Stein-type and ridge-type estimates.) ploration of the relative performance, from the stand- Alongside the alternative regression methods just points of estimation and prediction, of different tech- cited, the present study includes forward and backward niques over a of conditions which are syst,ematically selection methods and Bayesian selection procedures varied according to factorial designs. The variable factor proposed in [S]. Thus, we hope to draw attention to the levels include different patterns of true regression coeffi- substantial improvements over ordinary least-squares cients, different amounts of noncent.rality, and different methods which are afforded by a wide variety of alterna- degrees of collinearity or among inde- tive methods. pendent variables. The substantive conclusions from the Our results are empirical results derived from numeri- study are indications of possible drastic improvements cal . Specifically, we created two series of- over least squares, especially through the technique of artificial data sets. The first series, referred to as Experi- ridge regression, and especially when a high degree of ment 1, contains 32 data sets in the form of a Z5 factorial correlation exists among the independent variables. design, while the second series, or 2, contains We have not attempted to study alternatives to stan- 128 data sets in the form of a quarter replicate of a 2g dard regression analysis which are designed to be robust design. Each data set was drawn from a normal linear against failures of the normal error model. Instead we model with 6 regression coefficients to be estimated and have focused on the recently prominent difficulties with 14 degrees of freedom for error. Each of 57 estimation least squares under the normal model. From a frequentist procedures was applied to each of the 160 data sets, standpoint, it has long been recognized that good mean yielding a set of 6 estimated regression coefficients, which squared error properties do not necessarily follow from were compared to the true regression coefficients in the the celebrated minimum unbiasedness proper- simulated models, using mainly the two end-point ties of least squares, since in certain regions of the pa- rameter space the loss from increasing the squared bias criteria SEB (sum of squared errors of betas) and SPE can be overcompensated by reducing variance. Important (sum of squared prediction errors). Basic notation, including precise definitions of SEB and * A.P. Dempster is Professor, Theoretical , Harvard University, SPE, appears in Appendix A. Further details of the 57 Cambridge, MA 02138. Martin Schatzoff is Manager of Operations Research. IBM Cambridge Scientific Center, Cambridge, MA 02139. Nanny Wermuth is estimation procedures appear in Section 2 and Appendix with the Institute fur Medizinische Statistik. UnivenitBt Maina, Federal Republic of Germany. The first author’s research waa supported in part by National Science Foundation Grant GP-31003X. The second author’s research was supported by @ Journal of the American Statistical Association the IBM Cambridge Scientific Center. The third author’s research wna supported March 1977. Volume 72, Number 357 by the Cambridge Project. Computing facilitien were provided by IBM. Theory and Methods Section 77 78 Journal of the American Statistical Association, March 1977

B, while the design of the study is described in Section 3 more or less clustered about the origin. The two extremes and Appendix C. The analysis of experimental data is in our list of procedures are OREG, or ordinary least presented in Section 4. Further detailed analyses may be squares, which does not shrink, and ZERO which achieves found in Wermuth [19]. total shrinkage by setting all estimated regression coeffi- - The study is broad in some ways, e.g., in its range of cients to zero. Between these extremes the procedures estimation procedures and range of underlying models. differ in the pattern and degree of shrinking. In Section In other ways, the study is narrowly focused, e.g., in its 2.2 we discuss procedures in the classes STEIN and RIDGE restriction to 6 and 14 degrees of freedom and its limita- which shrink each estimate according to a continuous tion to just one of each model. In view of the formula, while in Section 2.3 me describe the families latter restriction, it is clear that we are not attempting FSL, BSL, CP, REGF, RREG, and PRI which shrink discretely detailed analysis of the properties of estimators in the sense that selected coefficients are pulled back to under each specified model, and so we are not meeting zero, or nearly to zero. The concepts used to define the objectives of the mathematical who specific variants within the families are described in wishes to calculate such frequency properties. We would Section 2.4. prefer to shed light on the conceptually more difficult task of the data analyst, who knows only his data and 2.2 Continuous Shrinking Methods not the underlying parameters. Our data base enables us Following the notation est,ablished in Appendix A, the to study the actual errors which a data analyst using standard least-squares estimator b = (XTx)-'XTY can be specified rules of estimation would encounter under a generalized to simulated range of data sets which we believe could 8 = (XTX + kQ)"X'Y (2.1) typify certain types of real world experience. Any particular user of regression techniques may where Q is a positive definite symmetric and k is legitimately criticize us for not including the specific a nonnegative scalar. In practice, Q is allowed to depend variant procedures which interest him in the context of only on the X, while k is generally allowed a specific class of real world situations. For such a reader, to depend on Y as well. we believe that we have provided a concrete illustration The choices Q = XTX and Q = I in (2.1) define the of a type of study which can produce interesting or even classes of estimators which we call STEIN and RIDGE. startling results. Given adequate computational facility, The principal components transformation C defined in whose availability continues to develop rapidly, the study Appendix A simultaneously diagonalizes both XTX and could be repeated holding the design matrix X fixed at 1, and therefore provides a simple representation for the values for a given data set and varying the factors RIDGE estimators. After transforming we deal with and procedures of greatest concern to a particular data &R = Ca,, a = CB, and a = C@ where C is defined by analyst, including of course the shape of the error distribu- (A.5) and (A.6). By transforming (2.1) we see that the tion and appropriate robust procedures, as may be in- components of ib and a are related by dicated either by prior understanding of circumstances or diR = fiai , i = 1, 2, . m -1 p 1 (2.2) by the properties of the given data set. We hope, there- where fore, that we may be contributing to the development of fi = Xi/(Xi + k) , (2.3) a methodology which will ultimately be of broader use than the specific results of this paper. and XI, Xz, . . ., A, denote as in (A.6) the eigenvalues of XTX. The analogs of (2.2) and (2.3) for the STEIN estima- 2. ALTERNATIVES TO LEAST SQUARES tor &S = C@Sare likewise seen to be 2.1 Overview di&q= fa; , i = 1, 2, . . ., p , (2.4) where The 57 estimation procedures under study can be = 1/(1 . (2.5) grouped into several major classes or families each con- f + k) taining a number of variants. Both the classes and the Note that the STEIN estimator 0s = f@,and so shrinks variants within each class are denoted by capitalized all components, whatever the coordinate system, by the abbreviations. For example, RIDGE denotes a family of same factor f. ridge regression techniques, whilc within thc family we As remarked in Appendix A, the ai have independent study five spccific procedures labelled SRIDG, RIDGM, N (ai, u2/Xi) distributions, whence from (2.2) CRIDG, ICRIDG, and ZCRIDG. and (2.4) the di~have independent N(fiail fi2U2/Xi) The classes are distinguished by different technical ap- sampling distributions and the &is have independent proaches adopted in the attempt to reduce error of estima- N(fai,f"u2//x,) sampling distributions. For a given value tion, but all approaches produce estimates which shrink of k, the fi are smaller when the Xi are smaller, and at the or pull back the least-squares estimates toward the origin. same time the error variance 2/Xi of $he least squares ai Shrinking can be justified either by the frequentist yard- is larger. Thus the RIDGE approach applies more drastic stick of improvement in the sum of variance and squared shrinking where it has greater effect in reducing mean bias, or by the Baycsian device of a prior distribution squared error. The STEIN approach by contrast shrinks Simulation of Alternatives to Least Squares 79 all components equally. The advantage of RIDGE is po- is minimized at each step. A BSL tentially large, therefore, when certain of the Xi are close method chooses among p + 1 partitions of the inde- to zero, and when the weights components pendent variables, one for each number r of variables equally, as does SEB defined in (A.lO) or (A.12). We may selected, as do the FSL methods, but the set of partitions anticipate less advantage of RIDGE over STEIN when the may differ between BSL and FSL, depending on the data loss function weights the components by Xi, as does SPE set. The MCP methods consider all 2p possible partitions of defined in (A.ll) or (A.13). the p independent variables and select one for fitting. The general estimator (2.1) has a simple Bayesian in- The abbreviation MCP is an oblique reference to the C, terpretation. If @ has the multivariate normal prior dis- which Mallows [l6] uses as a criterion for select- tribution N (0, wzQ~l),then the posterior distribution of ing among all possible regressions. The specific variants @ is N(0, (XTX/u2 + Q/wZ)-’), where 8 is determined by of FSL, BSL, and MCP used in our study are described in (2.1) with k given by Section 2.4. From the standpoint of a Bayesian whose prior dis- = u2/w2 (2.6) k . tribution for @ is centered about the zero vector, the FSL, Thus OR is a posterior mean corresponding to a prior BSL, and CP methods have the weakness that zero pos- N (0,w21) distribution for 0, and 8s is a posterior mean cor- terior estimates for a subset of @ components follow in responding to a prior N(0, wZ(XTX)-l) distribution for @. general only from a prior judgment that those com- In principal component terms, if the ai are a priori in- ponents are precisely zero. In practice it may be plausible dependently N(0, wz) distributed, then they are a to judge a priori that some subset of the @ components posteriori independently N (&in, fiuz/lxi) distributed for are close to zero, but it would rarely be possible to match i = 1, 2, . . ., p. The corresponding result for STEIN prior judgments with a subset chosen from the data by a estimators is that if the ai are a priori independently somewhat ad hoc selection criterion. The REGF procedures N(0, wz/Xi) distributed, then the a, are a posteriori inde- proposed by Dempster [3] attempt to soften the diffi- pendently N(&s, fuZ/Xi) distributed for i = 1, 2, . . ., p. culty by supposing that soine subset of p - r independent For a Bayesian, therefore, the choice between RIDGE or variables have zero @ components, and supposing for STEIN hinges on whether he regards the prior fixed r that all (,!J possible subsets are a priori equally of the ai to be roughly equal or roughly inversely propor- likely. The equiprobability assumption might well be tional to the hi. To assert that RIDGE is better in practice, altered in real world practice, but is the most plausible is equivalent to asserting that its prior assumptions are assumption to make in a study of automatic data analysis more nearly correct over the range of the statistician’s procedures. The precise definition of a REGF procedure experience. Note especially that if the RIDGE prior is requires a device for choosing r, and a specific rule for correct then the RIDGE estimator is optimum for any computing a posterior probability that each of the (,”) quadratic loss function, including both SEB and SPE. subsets is the true subset with zero regression coefficients. Corresponding remarks can of course be made about the The least-squares estimates for each postulated set of T Bayesian view of STEIN. nonzero regression coefficients are then averaged over The precise realization of a RIDGE or STEIN estimator the posterior distribution of subsets to yield the REGF requires a rule for determining k from the data. estimator 8. These rules are discussed in Section 2.4. The second class of procedures in Group 2, namely RREG, is a forward selection analog of REGF which avoids 2.3 Discrete Shrinking Methods the necessity of computing least-squares estimates for We consider here six classes of estimation procedures all subsets of size r, a task which becomes increas- which may be classified into three groups: Group 1 in- ingly onerous as r and p increase. An RREG procedure cludes FSL, BSL, and MCP; Group 2 includes REGF and carries out the REGF procedure for r = 1, and then repeats RREG ; and Group 3 includes PRI. the REGF r = 1 procedure on the residuals from the first Group 1 consists of methods which partition the com- paas, and so on until the residual sum of squares is judged ponents of @ into two subsets. The components in one to be small enough. The RREQ estimate is formed by subset are estimated by least squares under the constraint summing over the 8 produced by each application of that the components in the remaining subset are zero. REGF.We do not recommend that RREG should neces- The FSL or forward selection methods proceed by intro- sarily be pursued as a practical tool, in part because of its ducing independent variables int.0 the least-squares pro- ad hoc nature and in part because the iterations were cedure one at a time, choosing at each step the variable observed to converge very slowly in the presence of even which produces the largest reduction in sum of squares a few regressors which are highly collinear. at that step. An FSL method chooses among a set of Finally, the PRI procedures of Group 3 are selection p + 1 partitions of the p independent variables, i.e., one procedures based on the principal components representa- partition which fits r variables for each of r = 0, 1, tion of the model. Assuming that the principal compo- 2, . . ., p. The BSL or backward selection methods proceed nents, as defined in Appendix A, are ordered so that by dropping variables one at a time from the complete XI 2 Xz 2...2 A,, a commonly used procedure is to least-squares fit in such a way that the increase in modify the least-squres estimate a of a so that the last 80 Journal of the American Statistical Association, March 1977

p - T components are set to zero, yielding FSLN, 1Fsm; and CFSL, ICFSL, 2CFSL. The first two sub- classes are defined using different F statistics. Suppose = (Ul, a,, . . ., a,, 0, 0, . . ., 0) . i (2.7) that T variables have been included in the fit, and we are The corresponding B is computed as CTi. A specific considering whether to include the (r + 1)st forward variant of PRI is defined by a rule for determining T. selected variable. Define

2.4 Fifty-Seven Varieties

OREG and ZERO are single procedures, but each of the and families STEIN, RIDGE, FSL, BSL, MCP, REGF, RREG, and PRI (2.10) have several specific variants. Variants of three kinds are used. First, in the case of STEIN and RIDGE, there are where RSS; denotes the residual sum of squares after methods which aim directly at reducing squared error, fitting the best forward selected variables. F1 and F, whether through frequency concepts or through empirical t Bayes concepts. Second, in the case of the remaining are F statistics with nominal degrees of freedom (1, n - T - 1) and (p - T, n - p), respectively. The spe- families, there are methods related to F tests for the cific procedures FSLA, OFSL, and lFSL are based on the significance of variables not yet included in the fit. statistics computed at each stage of selection. FSLA Third, there are three methods associated with each Fl selects a further variable if the Fl test rejects at level family, indicated by the prefixes C, lC, and 2C, which .05/(p - r), OFSL selects if F1 rejects at level .05, and. control the maximum permissible deviation from the lFsL selects if Fl > 1. FSLA thus sets a fairly rigorous least-squares estimator b, where such deviation is mea- standard of significance for a variable to be included, sured in terms of standard confidence contours about b. OFSL uses a mild standard, and IFSL includes a variable 24.1. Continuous Shrinking Methods. In the RIDGE and if there is any indication at all of positive effect with no STEIN classes, the specific variants are SRIDG, RIDGM, requirement of a small tail area. Similarly, FSLN selects CRIDG, lCRIDG, ZCRIDG, and STEINM, CSTEIN, lCSTEIN, a further variable if F, exceeds its nominal .05 critical BCSTEIN. The precise definitions of the last three pro- value, while IFSLN selects if F2 > 1. The C, lC, and 2C cedures in each class are given in Section 2.4.5. The SRIDG variants will be defined in Section 2.4.5. procedure seeks to minimize the frequentist expectation The corresponding BSL variants are : BSLA, OBSL, lBsL ; of the criterion SEB defined by (A.12). It is easily shown BSLN, ~BSLN;and CBSL, lCBSL, 2CBSL. The first two sub- that this expectation is minimum when classes are again determined by the statistics Fl and F,, = Xi(krri2 - d) still defined from (2.9) and (2.10) except that RSS~refers c =o. to residuals from the backward selected fit of t inde- i=l (Xi + w3 pendent variables. At the stage of deciding whether to The SRIDG method is defined by choosing k to satisfy retain the (T + 1)st variable or drop it from the fit, a (2.8), after s2 from (A.3) and &R from (2.2) are sub- value of Fl less than its nominal .05/(p - T) critical stituted for 2 and a, in (2.8). The RIDGM and ATEINM value indicates dropping the variable under procedure methods are motivated by the Bayesian interpretation BSLA. Level .05 is used similarly for OBSL, and the cri- of RIDGE and STEIN discussed in Section 2.2. The prior terion F1 < 1 is used for IBSL. BSLN drops the (T + 1)st distribution for the a, which leads to the posterior mean variable if Fz is less than its nominal .05 level, while interpretation of RIDGE also implies that the observable 1Bsm drops a variable if F2 < 1. least-squares estimators ai are marginally independently In the case of CP procedures, t,he FI criterion is not N(0, w2 + a2/X,) distributed. It follows that the prior always sensible because the best variable set of size T + 1 expectation of C ur2/(w2+ $/hi) is p. RIDGM chooses k to need not contain the best variable set of size T. We there- make this quantity equal to its prior expectation, when fore consider only two subclasses : OMCP, lMcP ; and CMCP, s2 is substituted for u2 and k = u2/w2. Similarly, the STEIN IcMcP, zcmp. The procedures OMCP and lMcP use Fz at procedure is associated with a marginal N (0, (w2 + u2)/X;) nominal .05 level and F2 = 1, respectively, as criteria for distribution for the a,, and STEINM is defined by the choice passing from T to T + 1. of k such that C hi~;~/(w~+ 8) equals its marginal 2.4.3. REGF Methods. There are two parallel series of expectation p, where again s2 is substituted for u2 and REGF methods, typically labelled REGF and DRGF, which k = u2/w2. It is perhaps unfortunate that we did not are defined by alternative prior distributions of 0. The adopt the specific choice of k recommended by Stein. In two series each appear in three subclasses analogous to the retrospect, however, we can see that Stein’s method would three subclasses of FSL or BSL methods: FREGF, 1FREGF; have performed worse on our data sets than STEINM, the OREGF, IREGF ; CREGF, ICREGF, 2CREGF ; and FDRGF, reason being that improved estimates on our data sets IFDRGF ; ODRGF, IDRGF ; CDRGF, ICDRGF, PCDRGF. require STEIN to shrink more than STEINM provides, while The structure of each of these methods runs as follows. Stein’s recommendation shrinks considerably less. A non-Bayesian scheme is used to select an T on T = 0, 2.4.2. Subset Regression. The FSL variants group 1, 2, . . ., p, whereupon a Bayesian analysis takes over. naturally into the three subclasses : FSLA, OFSL, iFsL ; The Bayesian analysis assumes that exactly T of the p Simulation of Alternatives to Least Squares 81 components of @ are nonzero, but assumes that all (:) the guiding principle of all of the C, lC, and 2C methods. possible subsets are equally probable a priori. Suppose For example, the ICMCP method chooses the best subset that 9, denotes the class of (:) subsets consisting of r of of independent variables of size r, when T is as small as the p independent variables. For each I E g,, a posterior possible consistent with (2.12), choosing the right side in probability o(I) is computed for the event that I is the (2.12) to be unity. true subset. Also, for each I E 9, we compute an estimate 8 (I)whose components in the Z positions consist of least- 3. DESIGN AND EXECUTION squares estimatcs of the corresponding @ components, assuming that the remaining coefficients are all set to The plan of our study required drawing from the model zero. These remaining coefficients are of course all esti- (A.l) with p = 6 and n = 20. Conceptually, this meant mated at zero in @(Z). The final estimator @ is defined to be fixing X, @,and 2, and then drawing a random vector e using a standard normal random number generator. In B = c aBm . (2.11) practice, since all of our methods depend only on the 4 €9. sufficient statistics (A.2) and (A.3), we did not actually The details of how to compute w(I) for each of the REGF generate X, el or Y, but instead generated first XTX, and and DRGF series are given in Appendix B. then b and (n - p)s2, where the latter required ran- In Appendix B we also give reformulated definitions dom number generators to simulate the 6-variate appropriate for REGF of the Fl and Fz criteria defined in N(@,uZ(XTX)-l) and XIh2 distributions. The 6-variate (2.9) and (2.10). FREGF and RDRGF use a nominal .05 normal was found by linear transformation of 6 standard critical level for the Fl criterion, and lFREGF and lFDRGF normal deviates and the Xl2 was found by summing the use Fl = 1 as the cutoff point, where in both cases larger squares of 14 standard normal deviates. Uniformly dis- values of F1 force an increase from r to r + 1. The pairs tributed pseudorandom numbers were generated by the OREGF, ODRGF and 1REGF, 1DRw operate similarly in algorithm described in [14], and were transformed to relation to the Fz criterion. normal random deviates by a table look-up procedure The RREG Variants are: RREGl ; and CRREG, ICRREG, applied to the cumulative . Further ZCRREG. RREGl makcs repeated use of the FREGF tech- details may be found in [19, 201. nique with r = 1 and stops when the F1 criterion as- We made one drawing from each of 160 different sociated with FREGF fails to indicate proceeding from models. The factors and factor levels of the Z5 structure r = ltor = 2. of Experiment 1 are described. After creating and par- 2.4.4. Regression on Principal Components. The PRI tially analyzing these data, we decided not simply to variants are : PRIF, 1PRIF ; PRIB, lPRm ; and CPRI, 1cPRi, replicate Experiment 1, but instead to create a somewhat ~CPRI.The methods PRIF and PRIB both use the Fz cri- larger data set with more levels of certain factors and terion (2.11) at its nominal .05 level, while lPmF and generally less correlation among the independent vari- IPRIB use the critical value Fz = 1. The difference is that ables. The result u*asthe 128 simulated models of Experi- the F methods proceed through the estimators (2.7) in ment 2, also described. At this point we decided to the order r = 0, 1, . . ., p while the B methods use the analyze and report the results from Experiments 1 and 2 order r = p - 1, . . ., 0. without creating another series of models or drawing 2.4.5. Confidence Contour Constraints. Finally, we de- replicated data sets from the two available series. scribe the C, lC, and 2C variants which are associated The five factors in Experiment 1 are labelled EIG, ROT, with each family. As is well known, an ellipsoidal 1 - a COL, CEN, and BET, The first three of these define the for @ centered at the least-squares XTX matrix of a model. The two levels of factor EIG were estimate b is defined by determined by

= Fp,n-p,l-u (2.12) T diag (32, 25, 16, 9, 4, 1) (0 - b)TXq(@- b)/ps2 I (3.1) = diag (64, 16, 4, 1, .25, .0625) where Fp,n-p,l-udenotes the level a critical value for F on p and n - p degrees of freedom. The idea of C, lC, where diagonal vectors of the diagonal matrices T should and 2C methods is to limit the deviation of @ from b by be regarded as preliminary eigenvalues of XTX. The two requiring that @ lie within an ellipsoid of the form (2.12). levels of EIG specified by (3.1) constitute one device for This idea is similar to the limited risk proposal of Efron putting into the experiment variation in the amount of and Morris [4, 51. The C method uses a = .05 and the correlation among the independent variables. 1C method uses Fp,n--p,l-u= 1. The criterion in the 2C The two levels of ROT correspond to two replications of method is that the residual sum of squares is allowed to matrices with eigenvalues fixed by a given level of EIG. rise by at most 20 percent. When p = 6 and n = 20, the From a pair of 6 x 6 arrays of simulated standard normal 2C criterion is equivalent to the choice Fpsn--p,l--.! = .46. deviates, a pair of random 6 X 6 orthogonal matrices G Given any value of Fp,n--p,l-awe can adjust the k in was created by Gram-Schmidt orthogonalization followed RIDGE or STEIN methods, or the r in the selection methods, by scaling to unit length. We then formed the four inner in such a way that the resulting estimator B is shrunk as product matrices GTTG corresponding to the four levels much as possible subject to the constraint (2.12). This is of EIG X ROT. Finally, these four inner product matrices 82 Journal of the American Statistical Association, March 1977 were reduced to four correlation matrices via scalar same as in Experiment 1, changing (3.2) to division of each row and column by the square root of its diagonal member. These correlation matrices specify the p’(X%)p/uZ = 100 = 500 four choices of XTX actually used at one level of COL.Note (3.2)’ that the eigenvalues of WTG are given by T, but that = 10 these are no longer the eigenvalues of the final correlation = 50 matrices XTX. and changing (3.3) to The second level of COL is defined by modifying the cor- responding XTX at the first level so that it has .99 in the e = (1, 1, 1, 1, 1, 1) (1, 2) position, thus introducing substantial collinearity = (32, 16, 8, 8, 8, 8) (3.3)’ between the first two independent variables. The modifi- = (1, 1, 1, 0, 0, 0) cation was not simply a replacement of the (1, 2) element = (32, 16, 8, 0, 0, 0) . by .99, which could have destroyed positive definiteness, but a linear transformation scheme described in Appendix The experimental data sets were created and analyzed C. We have now described, modulo G, the eight matrices using the APL computer language as implemented under XTX used in Experiment 1, corresponding to the eight the CP-67 system at IBM Cambridge Scientific Center. levels of EIG X ROT X COL. An advantage of APL is that a large number of small but The factors CEN and BET jointly define @ and u2. CEN mathematically complex program units can be written refers to two levels of the noncentrality parameter, and put together with relative ease. A disadvantage is specifically that a large number of routine repetitions of the pro- @T(XTX)O/c72 = 100 grams, as would be required for standard large sample (3.2) Monte Carlo, becomes prohibitively expensive due to the = 200 . interpretive nature of APL. The programs described in The factor BET refers to two vectors of regression coeffi- [20] make it feasible to reproduce much of the data cients, specifically generation and analysis or to replicate the experiments if so desired. @ = (32, 25, 16, 9, 4, 1) (3.3) = (64, 16, 4, 1, .25, .0625) . 4. NUMERICAL RESULTS

In practice, two values of uZ were determined from (3.2) 4.1 Overall Comparisons of 57 Methods for each of the two vectors @ in (3.3). Since our endpoint Many different analyses were carried out for purposes criteria SEB (A.lO) and SPE (A.ll) are unaffected by of comparing the properties of the various estimators, and scale changes @ + c@ and u .+ cu, we could equally well relating them to the design factors. Overall comparisons have set u arbitrarily and computed scalar multipliers of the methods under study are provided in Table 1, from (3.2) for the @ vectors in (3.3). which shows the mean values and of the two In Experiment 2, the factors are EIG at two levels, ROT criteria, SEB and SPE, together with their ranks on the 57 at four levels, COL at two levels, MCL at two levels, CEN methods, for each of the experiments. The methods are at four levels, and BET at four levels. The actual design is arranged so as to put together different versions of the a quarter replicate of the 23 X 43 complete design. The same method, as indicated by the extra space separating preliminary eigenvalue levels of EIG in Experiment 2 the ten different groups. were changed to Examination of Table 1 leads to a number of interesting observations. T = diag (30, 30, 30, 20, 20, 20) (3.1)’ 1. Ordinary regression (OREQ) is inferior to all nontrivial = diag (64, 16, 4, 2, 1, .5) . methods of estimation with respect to observed SEB averaged over each series of data sets. In the first experiment, it is For ROT we created four new random orthogonal matrices even worse than the trivial method of estimating all co- G by the same algorithm used in Experiment 1. The EIG efficients to be equal to zero. and ROT levels were crossed as in Experiment 1, yielding 2. The reductions in SEE, on average over the observed data now 2 X 4 = 8 correlation matrices XTX, which were sets, achieved by some of the methods under study are as further crossed with the factors COL and MCL to obtain large as 90 percent. 8 X 2 X 2 = 32 correlation matrices XTX altogether. 3. Average reductions in SPE are at most 2030 percent, and ordinary least squares performs better than a number of its COL the presence or absence of a deliberately in- competitors on this criterion. These results corroborate our troduced correlation .95 between XI and Xz,while MCL observation in Section 2.3 to the effect that the advantage means the presence or absence of a deliberately introduced of RIDGE over STEIN would be less on SPE than on SEB, since correlation .92 between XI - XZand X3,thus providing SPE weights individual components by xi. a partially hidden substantial correlation among the first 4. The methods which produced the best overall resulta were not the customary ones such as ordinary least squares, three independent variables XI,Xz, and X3.The COL selection of variables, or regression on principal components, and MCL algorithms are described in Appendix C. but rather versions of RIDQE and REQF. In particular, it is Given XTX, the procedure for fixing 0 and u2 is the interesting to note that RIDQM was best with respect to mean Simulation of Alternatives to Least Squares 83

1. Means and Ranks and Medians and Ranks of 57 Methods 1. Continued

Experiment 7 Experiment 2 Experiment 7 Experiment 2

Method Mean Mean Rank Method Mean Rank Mean Rank

SEB SPE SEE SPE SEB SPE SEB SPE SEE SPE SEB SPE SEB SPE SEB SPE

a. Means and ranks b. Medians and ranks

OREG 542.86 5.70 57 27 78.37 6.26 56 21 OREG 178.83 5.57 57 33 25.94 5.05 46 24 ZERO 143.92 150.00 35 57 134.19 165.00 57 57 ZERO 128.87 150.00 54 57 55.42 75.00 57 57

FSLA 93.59 8.26 29 47 68.42 16.37 52 55 FSLA 62.15 5.28 50 27 31.1 1 8.23 54 52 CFSL 84.35 8.12 27 46 61 :63 10.38 39 50 CFSL 39.55 6.01 30 40 30.36 7.86 51 49 ~CFSL 93.13 4.56 28 10 51.53 6.79 22 29 1CFSL 33.23 3.02 24 3 23.24 5.19 39 26 2CFSL 466.92 4.82 52 17 63.42 6.04 42 16 2CFSL 58.77 4.36 49 19 20.22 4.84 30 16 OFSL 78.61 4.78 22 16 56.27 7.69 30 38 OFSL 40.06 3.53 31 10 24.12 5.96 40 36 1FSL 215.22 4.89 36 20 60.09 6.08 35 17 1FSL 56.71 4.79 44 25 19.47 4.83 27 15 FSLN 80.03 6.86 25 39 52.46 8.88 23 43 FSLN 45.09 6.12 36 42 26.80 7.76 47 47 IFSLN479.95 4.93 53 21 66.02 8.03 48 14 ~FSLN 49.95 4.72 41 22 20.31 4.87 31 17

BSLA 76.22 5.96 19 29 64.34 9.94 43 48 BSLA 25.20 4.86 8 26 33.76 7.74 56 46 CBSL 78.43 7.97 21 45 67.83 1137 51 53 CBSL 31.1 1 6.27 18 46 33.34 8.78 55 53 ~CBSL 273.01 5.06 39 24 65.58 7.33 46 34 1CBSL 27.97 3.31 14 7 28.51 5.90 49 35 2CBSL 499.55 4.87 56 19 60.84 6.33 37 24 2CBSL 37.29 4.23 27 17 25.32 5.32 45 29 OBSL 78.88 4.77 23 13 58.39 7.23 32 33 OBSL 31.87 3.92 21 13 28.99 5.62 50 33 ~BSL 461.50 5.16 51 26 71 31 6.23 53 20 1BSL 54.93 5.43 43 28 25.30 4.90 43 20 BSLN 77.50 7.38 20 41 60.86 9.82 38 47 BSLN 27.15 6.69 13 47 31.11 7.91 53 51 ~BSLN 421.26 4.97 49 23 74.20 6.36 54 25 1BSLN 37.29 4.76 28 23 25.31 5.43 44 30 CMCP 78.91 7.55 24 43 65.17 10.52 45 52 CMCP 31.1 1 30.36 7.91 52 50 1CMCP 272.53 4.83 38 18 66.50 7.02 49 31 1CMCP 46.40 3.31 37l9 4i24.39 5.43 41 31 2CMCP 493.32 4.76 55 12 64.48 6.36 44 26 2CMCP 49.07 4.13 39 16 24.96 5.29 42 28 OMCP 81.47 6.74 26 37 56.46 9.18 31 44 OMCP 31.1 1 5.97 20 39 27.70 7.78 48 48 1MCP 416.98 4.78 48 15 75.89 6.17 55 19 1MCP 39.53 4.35 29 18 20.31 5.26 32 27 CPRl 73.75 11.46 18 53 28.12 8.58 7 42 CPRl 57.82 11.63 47 54 13.05 7.04 10 42 1CPRl 66.10 6.82 16 38 27.72 6.02 6 13 1CPRl 43.36 5.46 35 30 9.41 5.03 4 22 2CPRl 131.44 6.01 33 30 51.09 5.94 21 12 2CPRI 47.27 5.69 38 35 12.93 5.02 8 21 PRlB 100.25 8.51 30 49 39.36 7.13 11 32 PRlB 53.40 6.99 42 49 11.24 6.58 6 37 1PRlB 355.37 6.11 43 31 61.92 6.15 40 18 1PRIB 49.37 5.57 40 34 15.00 5.14 14 25 PRlF 102.37 9.21 31 52 27.34 7.50 5 37 PRlF 57.75 8.62 46 52 11.92 6.72 7 39 ~PRIF 363.24 6.35 44 ,33 55.57 6.03 20 15 1PRIF 58.36 5.71 48 36 12.93 5.03 9 23. STEINM 482.16 5.94 54 28 63.40 5.84 41 9 STEINM 161.79 5.50 56 31 19.42 4.68 26 11 CSTEI 219.56 24.17 37 56 40.26 16.41 12 56 CSTEI 91.07 24.55 52 56 18.36 14.99 21 56 1CSTEI 325.61 11.76 40 54 46.86 8.54 19 41 1CSTEI 113.48 11.39 53 53 15.87 7.51 17 43 2CSTEI 386.46 8.30 46 48 53.94 6.55 27 28 2CSTEI 129.22 7.23 55 50 17.60 5.51 19 32

SRIDG 65.02 6.57 15 36 25.76 5.51 43 SRIDG 42.75 6.25 34 45 8.16 4.54 3 10 RlDGM 45.16 4.95 1 22 17.59 4.98 21 RlDGM 31.87 3.93 22 14 7.43 4.11 1 1 CRlDG 69.51 21.17 17 55 29.29 14.48 8 54 CRlDG 57.69 20.72 45 55 16.98 11.94 18 55 ~CRIDG 53.85 9.00 6 51 18.71 6.84 3 30 1CRIDG 42.59 8.29 33 51 10.13 5.71 5 34 2CRIDG 52.25 6.13 4 32 17.48 5.25 12 2CRlDG 36.72 5.54 26 32 8.08 4.53 2 9

CREGF 57.67 7.54 11 42 52.97 9.73 24 46 CREGF 25.43 6.18 9 43 22.80 7.67 36 45 1CREGF 60.76 4.01 13 2 46.53 6.40 18 27 1CREGF 21.21 3.09 3 4 18.58 4.45 23 8 2CREGF 137.73 3.90 34 1 38.60 5.54 10 4 2CREGF 24.62 3.27 7 6 15.47 4.11 16 2 OREGF 56.76 6.50 9 35 44.86 8.08 16 40 OREGF 25.43 5.87 10 38 19.99 7.03 29 41 ~REGF 355.31 4.39 42 8 58.76 5.62 33 7 1REGF 28.68 4.47 15 20 14.38 4.35 11 7 FREGF 49.32 4.15 2 6 42.89 7.45 14 35 FREGF 18.88 2.77 1 1 18.59 4.89 24 19 1FREGF 379.35 4.75 45 11 65.87 5.92 47 11 1FREGF 32.59 4.66 23 21 19.09 4.82 25 14

CDRGF 56.97 7.56 10 44 53.32 , 9.66 26 45 CDRGF 25.49 6.21 11 44 23.08 7.61 37 44 1CDRGF 126.37 4.12 32 4 48.88 6.30 20 22 ICDRGF 23.21 3.17 5 5 19.92 4.35 28 6 2CDRGF 327.46 4.02 41 3 43.95 5.55 15 5 2CDRGF 35.67 3.33 25 9 15.35 4.18 15 3 ODRGF 56.52 6.50 8 34 45.13 7.92 17 39 ODRGF 25.49 5.78 12 37 21.07 6.85 33 40 ~DRGF 416.96 4.42 47 9 58.92 5.62 346 1DRGF 42.03 4.05 32 15 14.95 4.28 13 4 FDRGF 52.20 4.14 3 5 54.80 7.48 28 36 FDRGF 19.77 2.78 2 2 17.65 4.79 20 12 1FDRGF 457.61 4.78 50 14 67.59 5.90 50 10 1FDRGF 77.45 4.76 51 24 21.11 4.81 34 13

RREGI59.00 6.96 12 40 60.53 10.49 36 51 RREG~ 29.51 5.44 16 29 21.56 6.59 35 38 CRREG 62.15 8.97 14 50 53.18 10.22 25 49 CRREG 30.69 6.94 17 48 23.17 9.24 38 54 ICRREG52.61 5.11 5 25 42.64 6.31 13 23 1CRREG 23.57 3.90 6 12 18.53 4.87 22 18 2CRREG 54.30 4.35 7 7 38.23 5.73 98 2CRREG 22.30 3.57 4 11 14.91 4.29 12 5 a4 Journal of the American Statistical Association, March 1977

SEB in the first experiment and second best in the second. In to compare the performance of the other methods with the latter instance, another version of RIDGE, namely ~CRIDQ these bounds as well. A description of the bounds follows. was slightly better than RIDGM. On the of , In terms of the principal components transformation C rather than mean values, FREQF was best on both criteria in the first experiment, while RIDGM was the number one per- defined in Appendix A, minimization of SEB (or SPE) in former in the second experiment. Comparisons of RIDQY the direction of the ith principal axis, for estimators of against OREQ indicated improvements in average SEB of the form fiai, results in approximately 92 and 78 percent in the two experiments. RIDQM was also the best method on average SPE in Experiment 2, providing a reduction of 20 percent. In the first experiment it ranked only 22nd on mean SPE although it still provided where ki = (u/ai)). One cannot use t,his estimator in an average reduction of about 14 percent. practice, because the ki are unknown. However, we can 5. The data do not indicate any consistent patterns of behavior calculate the ki for every sample in our simulation study, with respect to variations in the confidence levels employed since we know the values of u and a. The estimates based in the confidence contour procedures, or in the significance on (4.1) are referred to below as OPT. levels used in the various selection-type methods. In most instances, changes in level produced opposite effects on the OPT is similar in form to RIDGE except that there is two criteria. For example, referring to Table 1, for the PRI now a separate factor ki along each principal axis. OPT and STEIN methods, variations from C to 1C to 2C were provides a stochastic lower bound for RIDGE methods be- generally accompanied by improved SPE, but degraded SEE. cause it has additional degrees of freedom and also uses In other cases, such as BSL, REGF, and DRGF, the directions true, rather than estimated, values of the adjustment of change on SEE were opposite in the two experiments. Similar observations could be made with respect to choice factors. The frequentist expectations of our two criteria of level for the various selection methods. Since the per- for OPT are given by formance of these techniques is generally sensitive to the P particular choice of level, their use can be risky. EOPT(SEB)= C fih (4.2) 1 and P 4.2 A Closer Look at Selected Methods EOPT(SPE)= C fi , (4.3) In this section, we focus our attention on six particular 1 methods, with the objective of learning more about their A. CUmU/afiVe Distributions Of fop^ (SPE)/E REG (SEB) detailed behavior patterns. The selected methods are by Noncentrality Level, Experiment 1 representative of the total spectrum of 57 methods in that they include a member of each major class of pullback 0 procedures, as follows :

Type of pullback Selected method

None OREQ Selection of variables OFSL Principal components PRIF STEIN STEINM RIDQE RIDGM REQF FREQF The specific choice of a member within each class was based on performance within the group, and similarity to procedures used in current practice. For example, RIDGM was chosen to represent the RIDGE class of estima- tors, because it appeared to be the best performer within that class. However, PRIF was chosen to represent the principal component class of estimators, even though 1cPRI seemed to be a better performer, because the type of selection procedure used by PRIF is more commonly used in practice. The OFSL, PRIF, and FREGF methods are comparable to one another in the sense that they all employ forward selection procedures, and use stopping rules based on significance at the .05 level. The reason for simulating the performance of these methods is, of course, that desired distributions of estima- tion error can usually not be calculated analytically. In one instance, however, that of RIDGE regression, we can calculate stochastic lower bounds for frequentist expecta- tions of our two criteria. It is of interest then to determine how closely RIDGM approximates these lower bounds, and ORDERED OBSERVATIONS Simulation of Alternatives to Least Squares 85 whereas the corresponding expectations for ordinary C. Cumulative Distributions of EOpT(SEE)/fOREG (SEE) regressions are by Noncentrality Level, Experiment 2 P Ratio EOREG(SEB)= C 1/Xi (4.4) 500 1 .9 - and EORCG(SPE)= p . (4.5) 100 In evaluating the performance of different regression .8 - methods, we can use (4.1)-(4.5) in two ways to provide 50 stochastic lower bound type comparisons. First, if we pretended that we knew the values of the fi, we could .7 - apply these factors to the sample data, and evaluate the actual performance of OPT on such data. Second, the - ratios of the corresponding expected values, (4.2)/ (4.4) .6 and (4.3)/(4.5), provide information as to the expected gains in SEB and SPE which could be realized by using the - optimal pull-back procedure. .5 The shrinking factor fi can be interpreted in terms of the noncentrality parameter, Aiz, along the ith principal - axis, as .4 fi = Ai2/(1 4- A?) . (4.6) Thus, when the noncentrality parameter is large, the .3 - shrinking factor is close to one, showing that least squares is asymptotically efficient. .2 - B. Cumulative Distributions of fOPT(SPE)/EOREG(SEE) by Noncentrality Level, Experiment 1 .1 - io

0 ORDERED OBSERVATIONS

One view of the particular parameter sets chosen for our two series of data sets is given in Figures A-D, which present' cumulative sample distributions of the ratio of EOPFto EoREG for each of the criteria, by noncentrality level. It is readily apparent from these graphs that op- portunities for large improvements via shrinking are greatest for small noncentralities. This property is borne out in our observed data, as illustrated in Table 2 which shows for the various methods the ratios of observed mean criterion values (SEB and SPE) to corresponding least-squares values, separately by noncentrality level. Generally, the largest relative gains have been achieved at the lowest noncentralities. This is particularly evident for SEB, which has larger expected gains that does SEP. Further insight into the relative performance proper- ties of the selected estimators is provided by Figures E-H, which show the cumulative sample distributions of the ratios of the observed values of the criteria for each of the methods, to their corresponding least-squares ob- served values. Examination of the SEB graphs (Figures E and G) shows that STEINM is the most conservative procedure of those being compared, in that it most closely approximates OREG. It provides improved per- I I 1 5 10 15 formance in most cases (80 percent of the cases for ORDERED OBSERVATIONS Experiment 1, and 90 percent for Experiment 2), al- 86 Journal of the American Statistical Association, March 1977

D. Cumulative Distributions of Eom (SPE)/EOREG(SPE ) though the improvements are not as large on the average by Noncentrality Level, Experiment 2 as those produced by the other methods. It also is less ti0 risky than the others in the sense that it seldom produces large degradations with respect to least squares.

E. Cumulative Distributions of SEE Ratios, Experiment 1 Ratio / IrFREGF

I 1 I I I I 5 10 15 20 25 30 ORDERED OBSERVATIONS

I 5 10 15 20 25 30 ORDERED OBSERVATIONS 2. Ratios of Average Criteria Values for Individual Methods to Those for Least Squares, by Noncentrality Level The FREGF procedure appears best on both criteria in Experiment 1, while RIDGM is the clear winner in Experi- Experiment 7 Experiment 2 ment 2. These conclusions are borne out also by Table 1, Method NCP which shows that FREGF and RIDGM have the smallest median values on the two criteria for Experiments 1 and 700 200 70 50 700 500 2, respectively. The apparent reason for the reversal in -SEB performance from Experiment 1 to 2 is that FREGF is OFSL .07 .46 .22 .49 1.03 .94 affected by the distribution of beta values, whereas PRlF .14 38 .13 31 .39 .46 RIDGM is not. In general, the performance of FREGF im- STEINM .88 .92 .38 .78 .90 .97 RlDGM .05 .25 .09 .21 .27 .27 proves as the distribution of the beta values becomes more FREGF .04 .31 .12 .37 .65 .84 skewed, and as the degree of truncation (i.e., number of EOPT .03 .04 .03 .09 .12 .20 zero or near-zero values) increases. In Experiment 1, the 0 PT .03 .21 .06 .09 .17 .19 ' mean SEB for RIDGM was approximately the same (46.2 -SPE and 44.1) at both levels of BETA, while the corresponding OFSL - .65 1.06 .88 1.37 1.43 1.17 values for FREGF decreased from 75.1 to 23.5, for BETO PRlF 1.70 1.54 1.14 1.26 1.13 1.22 and BETl, respectively. Thus, the sharply improved per- STEINM 1.02 1.06 .75 1.03 .93 .98 RlDGM .80 .94 61 .88 .78 .88 formance of FREGF at Bm1, corresponding to the extreme FREGF .57 .91 .73 1.50 1.30 1.10 distribution (64, 16, 4, 1, 25, .0625), appears to be re- €OPT . .72 .59 .66 .34 .53 .61 sponsible for the better overall performance of FREGF in OPT .54 .77 .35 .53 .61 .71 the first experiment. Simulation of Alternatives to Least Squares 87

F. Cumulative Distributions of SPE Ratios, cant interactions, and that they generally have much Experiment 1 R tio smaller F values than the significant. main effects. In the I( case of OREG, there are no significant interactions at all. For most methods, eigenvalue structure, multicol- I linearity, and noncentrality are significant. Noncentrality c is a significant factor in all methods except for OREG, and performance of each of these methods degrades, as pre- viously noted in Section 4.2, as the degree of noncentrality L increases. The largest effect in OREG is the eigenvalue structure, as /pR'F might be expected from (4.4). Collinearity and multicol- linearity also have significant effects in OREG, displaying i approximately the same F values and directions. The effects of rotation are not significant, which provides some reassurance as to the validity of the experimental results, since the random rotations can be regarded as I replications in the case of ordinary regression. .€ The effects of the regression coefficient structure are significant only for those methods which select variables .€ (ie., OFSL and FREGF).The coefficient, structure has no effect on OREG, a result which should be expected since EORE~(SEB)does not depend on the coefficient structure. L Nor is it a significant factor in the smoother pull-back .. 1 procedures RIDGM and STEINM. A closer examination of the data is provided by Table

.2 G. Cumulative Distributions of SEE Ratios, Experiment 2 Ratio

.1 5 10 15 20 25 30 ORDERED OBSERVATIONS

Comparisons of the distributions of SPE ratios (Figures F and H) yield similar results. Again, FREGF appears to be the best choice in the first experiment, while RIDGM looks best in the second.

4.3 Effects of Design Factors In this section we assess further how the design factors affect the performance of the various methods. Log- arithmic transformations were carried out on both SEB and SPE values because they have highly skewed chi- square distributions. Normal probability plots of the log- transformed sample values indicated reasonably normal behavior. Initial analyses treated the data simply by computing analyses of variance and looking for significant main effects and first-order interactions. A typical analysis of this type for logs~~,Experiment 2, is presented in Table 3 for OREG, OFSL, STEINM, RIDGM, and FREGF. It shows F values for main effects and first-order interac- tions, together with signs of the effects, for all effects which are significant at the .05 level. The error sums of I I 1 I I I squares are based on pooling of second- and higher-order .01 20 40 60 80 100 120 interactions. We note that there are relatively few signifi- ORDERED OBSERVATIONS 88 Journal of the American Statistical Association, March 1977

H. Cumulative Distributions of SPE Ratios, gains over least-squares prediction (SPE) are either ob- Experiment 2 served for FREGF (in Experiment 1) or for RIDGM (in Ratio Experiment 2). From the paired comparisons with RIDGM it can be seen that FREGF is its only serious competitor. STEINM gives significantly larger estimation and predic- tion errors in both experiments, and generally, OFSL and lcpm give worse results as well. The classical selection-of- variables method OFSL turns out to be clearly inferior to the Bayes approach to variable selection : the significantly larger estimation errors of OFSL relative to FREGF in both experiments are not counterbalanced by smaller predic- tion errors.

4. Mean Differences in Logs SEB and SPE

Compared methods Standard Experi-

/?lethod ment OREG OFSL FREGF 1CPRl RIDGM STEINM

-SEB OREG 1 -1.14b -1.76b -0.99b -1 .32b -0.05 2 -0.14 -0.4gb -0.84b -1.12b -0.22b AlDGM 1 1.32b 0.12 -0.44 0.33 1.26b 2 1.1P 0.9Eb 0.63b 0.27b 0.85b

FREGF 1 1.76b 0.61 0.77 0.44 1.70b 2 0.4gb 0.36b -0.35b -0.63b 0.23 -SPE OREG 1 -0.41a -0.66b 0.10 -0.15 0.06 2 -0.04 -0.15 -0.10 -0.29 -0.08 RIDGM 1 0.15 -0.27 -0.51b 0.25a 0.21b 2 0.2gb 0.25b 0.14 0.19b 0.20b

FREGF 1 0.66b 0.25b 0.76b 0.51 0.72b 20 40 60 80 loo I20 2 0.15 0.1 1 0.05 -0.14 0.06

ORDERED OBSERVATIONS ~~ ~ Indicates signlficance at the 0.05 level. Indicates significance at the 0.01 IEVEI. 4 which presents the mean differences in logs~~and logsp~between each of five methods and either OREG, The next level of analysis attempted to introduce the RIDGM, and FREGF. The results differ somewhat in the two eigenvalue and noncentrality structures into the regres- experiments, but the overall picture is unchanged. Least- sion models in such a way as to reflect the ways in which squares estimation (SEB) is strongly improved by FREGF, they enter the expressions for the expected values of our 1cPq and RIDGM in both experiments, and outstanding two criteria, as given by (4.2)- (4.5). Specifically, using SEB as an example, we attempt to partition the 3. F Values for .05 Level Significant Effects, total sums of squares of log SEB into components due Together with Signs of Effects, for to log EOPT(SEB),(log EOREG(SEB)- log EOPT(SEB)), Log SEB, Experiment 2 (log SEBOREG - log EoREG(~~~)),and components due to the various design factors. OREG STEINM RIDGM FREGF OFSL The results of these analyses are presented in Table 5, EIG 57.42 65.17 40.14 24.94 28.85 which shows F values corresponding to terms in the MCL 16.81 15.91 6.00 10.69 13.90 models together with the squared multiple correlation COL 15.43 13.68 CEN~ 12.92 28.80 27.16 13.74 coefficients. The rows of the tables are ordered according CEN2 --8.00 -32.36 -59.46 -42.19 to the sequence of introduction of terms into the regres- Beta shape -5.16 sion models. In all cases, error sums of squares are based Beta truncation -15.45 on pooled interactions of all orders. In all of these EIG X COL -5.41 EIG X CEN2 -8.98 -7.65 analyses, the salient feature is that most of the explained MCL X CEN1 10.68 variation is,associated with the partitioning of the co- MCL X CEN2 -6.61 -13.97 SEBORE COL X CEN2 -6.71 variate (i.e., log or log SPEOREG) into its indicated CENl X CEN2 8.29 6.16 19.94 23.06 components. Thus, after introducing these terms, which CEN~x beta trunc -4.28 -7.64 represent specific formulations of the ways in which the 4.53 MCL X ROT1 eigenvalue and noncentrality structures affect the re- Simulation of Alternatives to Least Squares 89

5. F Values for Regression Models

Experiment 1 Experiment 2 Regression model OPT PRlF STEINM RIDGM FREGF OFSL OPT PRlF STEINM RIDGM FREGF OFSL

a. Log SEB Rotations 4.68" .81 .93 9.24b 2.51 2.63 26 1.05 5.10b .84 .76 .16 Log (XfA) 64.43b 16.43b 19.13b 60.06b 2.83 5.29" 290.8gb 227.1 1 591 .53b 173.73b 105.24b 76.3Zb Log (XlIAi)/(Zfi/Ai) .03 .18 341 7.9gb 1.60 5.70a 21.16b 1.20 1.93 762.2Zb 4.47" 3.93" 14.1Bb Log (SEE OREG/X~/AI) .74 3.65 2329.73b 4.21 3.45 4.73" 8.38b 11.22b 1180.22b 46.04b 14.31 3.92" Log NCP .OO 2.07 .36 3.05 .07 .19 .04 1.77 43.91 4.83" 11.20b 4.70" EIG .06 . .05 .01 .46 .OO .66 1.33 1.30 .82 .85 .90 2.09 MCL 2.92 59 .02 3.31 1.20 .08 COL .82 2.15 .01 .78 3.20 3.48 .13 .39 .01 .oo 25 .02 Beta level .ll 3.48 .35 1.08 11.54b 12.80b Beta shape 1.85 .05 .31 .02 4.38" 2.06 Beta trunc .18 .33 6.41" 2.76 3.30 14.15b R= .75 .56 1 .oo .78 56 .69 .72 .68 .96 .67 .76 .50

b. Log SPE Rotations 8.51b 4.63" .55 .95 1.12 .05 .91 .42 2.05 1.49 .05 23 Log SPE OAEG 2.74 63.70b .Mi .18 1.03 .35 120.9gb 22.77b 257.80b 136.61 56.07b 32.52" Log Xf, 76.07b .03 .98 1.81 22.56b 49.80b 70.26b 3.00 8.93b 20.21b 14.7Zb 13.4Eb Log NCP 6.96" .03 .98 .33 1.08 1.21 .03 .87 4.01" .66 1.04 1.56 EIG .64 .12 .01 .07 .33 .43 .56 29 1.25 .47 1.33 1.51 MCL 3.01 2.24 7.44" 5.07" .OO .30 COL .03 .15 27 .87 3.81 7.69" .oo .10 .oo 1.53 1.54 1.59 Beta level 2.49 .15 .oo .03 .89 .07 Beta shape .58 .81 .45 .17 19.37b 15.31b Beta trunc .05 1.18 .46 2.70 1.90 10.6Zb RP. .80 .74 .11 .15 .56 .71 .63 .22 .71 .60 .45 .40

'Indicates Significance at the 0.05 level. Indicates significance at the 0.01 level. spective criteria, the residual effects of the original design either Bayesian or frequentist theoretical considerations, factors eigenvalue structure, collinearity, and multicol- we believe numerical depictions make the results both linearity, usually become insignificant. In most cases, concrete and vivid. noncentrality level also disappears as a significant factor, We would like to see the methodology of the study although it has a sizeable effect on log SEB for the STEINAI moved closer to helping the statistician with a specific and FREGF methods. The effects of beta structure are body of data under analysis. For example, as remarked highly significant only for FREGF and OFSL, which select in Section 1, we could perform similar studies with the variables corresponding to significant beta values. design matrix X fixed at the values of a particular data set. Another device would be to apply regression methods to 5. CONCLUDING REMARKS n - 1 ro\w of (Y, X) and to evaluate predictions for the complementary row, and repeat with different selection We have illustrated the large improvements in SEB of rows, much in the spirit of jackknifing. and SPE which are theoretically known to be possible through the substitution of either smooth or discon- Yet another, and more Bayesian way, would be to ask tinuous pull-back procedurcs for ordinary least squares. the statistician to specify 5 or 10 prior distributions The potential relative gains in accuracy for individual covering a plausible range, and then to compare the estimated regression coefficients are seen to be typically closeness of various alternative regression procedures to much larger than those for predicted Y values. The the 5 or 10 posterior means. The Bayesian view offers some clarification of the real problem posed by a given expected gains in SEB and SPE are fairly small for STEINM, set of data: if the correct analysis depends critically on while for RIDGM and PRIF they are larger and exhibit dependencies on design variables similar to those of the the model and prior adopted, over some reasonable range, then the statistician should not expect any favorite pro- artificial method OPT, i.e., dependencies mainly on eigen- cedure taken from his kit of tools to be automatically values and noncentrality factors. Methods such as OFSL applicable. Our study suggests that conflicts will often or FREGF, which select variables, are highly dependent additionally on the pattern of true regression coefficients. appear among REGF, RIDGE, and STEIN estimates which should cause to reexamine both their data The relatively conservative STEINM procedure has one and their prior understanding for clues. advantage, in that it less often does worse than OREG. The relative accuracy of traditional selection of variables APPENDIX A methods based on significance testing behaves erratically in relation to significance levels adopted. While these The normal is written in the form results could have been predicted in a general way from Y=Xg+e (A. 1) 90 Journal of the American Statistical Association, March 1977 where Y is an n X 1 vector of values on the response or dependent APPENDIX B variable, X is an n X p matrix giving corresponding values on p independent variables, and e is an n X 1 vector of independent Further details of the REGF methods are sketched here, In par- ticular, we derive the posterior weights w used in (Zll), and we N (0, uz) disturbances. The least-squares estimator b of p is expressible (I) in terms of Y and X according to define the stopping criteria F1 and Fz. The notation established in Section 2.4 represents by I a subset of T of the p independent vari- b = (XTX)-l(XTY) , (A.2) ables, and by g, the class of all such subsets. Given that only the components @(I)of p are nonzero, (A.l) can be written in the form which together with the residual mean square Y = X(I)p(I) e S' S' = (Y - Xb)T(Y - Xb)/(n - p) , , (A.3) + (B.1) defines sufficient statistics for the parameters p and uz of (A.1). We where X(I) and p(I) denote the parts of X and p corresponding to can summarize the aspects of the model by 1. The obvious generalizations of (A.2) and (A.3) define b(1) and asserting that b and (n - p)s2 are independently distributed as S (I)*. N(p, u'(XTX)-l) and uz~4-p2,given fixed nonsingular XTX. For a given value of r, the REQF procedures assume that the (r) g, T The RIDQE and PRI methods are closely related to a principal members of are a pri07i equally likely candidates to specify the components analysis of the p independent variables. Accordingly, it nonzero components of p. Given any particular I E gr, it remains to is convenient to have notation for replacing the given independeqt specify a prior density for p(Z) and u2. The REQF and DRQF sub- variables X by p linear combinations families are specified by flat prior density elements of the form K (I)@ (I) where x* = XCT (A.4) K(I) = 1 and K(I) = [det (X(Z)TX(I))]+, (B.2) where C is chosen such that respectively. The alternative forms of K(I) arise from considering X*TX* = C(XTX)CT = diag (XI, Xz, . . ., hp) (A.5) p (I) to have a multivariate normal distribution with alternative and matrices CI and C(X(Z)TX(I))-l, respectively, and then CCT = I. (A.6) letting C --t m . We also use the common form of flat gamma prior The model (A.l) is correspondingly reexpressed as density for h = l/uz, namely density element where a = 0, 1, 2 are typical choices. Thus the joint prior density element for I, Y = X*a + e @(I)and h is where [K(I)d@(I)]x [hal'-'dhl , 03.3) a=C@. where a remains to be chosen. The least-squares estimates a = Cb The likelihood of I, p(Z), and h from (B.l) is proportional to hnI2exp (- (h/2) (&I of (Y have a simple sampling distribution. Specifically, the com- + &z)) (B.4) ponents ai are independently distributed with N (at, uz/Xi) distri- where bution for i = 1, 2, . . ., p. &I = (b(Z) - p(I))TX(I)TX(Z)(b(I)- @(I)) (B.5) The principal components transformation C is not invariant in and general under linear transformations of the p independent variables, &2 = (n - r)s(Z)2 . (B.6) and in particular is not invariant under linear changes of scale in each variable separately. In our work we have rescaled the variables Multiplying (B.3) and (B.4) and integrating out @(I)and h, we find so that the diagonal elements of XTX are all unity, so that our that the posterior probabilities are proportional to principal components are, apart from a single scale factor, the u*(I) = (B.7) principal components of the correlation matrix among the inde- K(I)[det(X(I)TX(I))]*~(I)('-5-")'z pendent variables. This convention is an essential part of the def- whence the actual posterior weights used in (2.11) are inition of the RIDQE and PRI methods which we study. Two criteria are used throughout our study to measure the devi- U(I) = U*(Z)/ c W*(J) . (B.8) ation of an estimated vector $ from its true value p, namely JE $7 The choices SEB = (8 - p)'@ - p)/d (A.lO) a=r and a=r-1 (B.9) and SPE = (B - D)TxTx(B- B)/Uz. (A.ll) were associated with the choices (B.2) for K(I) to define REGF and DRQF, respectively. SEB abbreviates sum of errors of betas, while SPE abbreviates sum The selection criterion F1 used by FREQF, FDRQF, lFREGF, and of prediction errors. The connection between SPE and prediction iFDRoF is an analog of a 2 log-likelihood ratio statistic. Specifi- is follows. Suppose that a new set of responses Y* drawn from the cally, the statistic used to judge whether r should be increased from linear model Y* = Xp + e* using the same design matrix X appear- rtor+lis ing in the original data set which yielded and using the same @, B, FI = Lr+1 - Lr (B.lO) but using new e* independent of the original e. If Xi is used to predict Y*, then the sum of squares of prediction errors averaged over e* for where L, = - c w(l)log ((n - r)s(I)2) . (B.ll) fixed 6 is uz + uz SPE. 'E #r The formulas expressing SEB and SPE in terms of principal com- ponents are We treated Ft as nominally an F on (1, m) degrees of freedom. P The F2criterion used by OREQF, ODRQF, IREQF, and IDRQF is a180 SEB = (a; - ui)'/u' (A.12) defined by posterior averaging. Specifically, i-1 and (B.12) P SPE = hi(& - ai)l/U* . (A.13) i-1 where FZQ is defined by appIying (2.11) to the fitted subset I. Simulation of Alternatives to Least Squares 91

APPENDIX C C6l -, and Morris, Carl, ‘‘Empirical Bayes on Vector Ob- servations-an Extension of Stein’s Method,” Biometrika, 59, The devices used for introducing prespecified collinearity and 2 (1972), 335-47. multicollinearity into a given correlation matrix R consist of repeated c71- , and Morris, Carl, “Stein’s Estimation Rule and Its applications of transformations of the type Competitors-an Empirical Bayes Approach,” Journul of the R + CRCT (C.1) American Statistical Association, 68 (March 1973), 117-30. C8l -, and Morris, Carl, “Combining Possibly Related Esti- where is square and nonsingular. may be naturally interpreted C C mation Problems,” Journal of the Royal Statistical Society, as a mapping from one basis of the linear space of six independent Ser. B, 35, 3 (1973), 379-421. variables to another basis, where R and CRCT are the initial and (9) -, and Morris, Carl, “Data Analysis Using Skin’s Esti- final covariance matrices. For example, in Experiment 1, to introduce mator and Its Generalizations,” Journal of the American .99 into the (1, 2) position of R we first used a C which replaced the Statistical Association, 70 (June 1975), 311-9. standardized variables by their sum and difference, so that (C.l) [lo] Goldstein, M., and Smith, Adrian F.M., “Ridge-Type Esti- took the form mations for Regression Analysis,” Journal of the Royal Sta- 1 1 + 9.12 0 tistical Society, Ser. B, 36, 2 (1974), 284-91. + [ 0 1 - r12 . (c.2) [11] Hoerl, Arthur, and Kennard, Robert, “Ridge Regression: [r; :::I :::I Biased Estimation for Non-Orthogonal Problems,” Techno- metrics, 12 (February 1970), 55-63. We then rescaled the variables by [1.99/(1 rln)]* and new + r121 -, and Kennard, Robert, “Ridge Regression : Applications [.Ol/(l - ~IZ)]+so that the right side of (C.2) achieved a similar to Non-Orthogonal Problems,” Technometria, 12 (February form with rlp replaced by .99. Finally, we reversed the transformation 1970), 69-82. used in (C.2) to get the desired result. In Experiment 2, the first [13] James, W., and Stein, Charles, “Estimation with Quadratic step (C.2) was the same. We then rescaled the difference variable Loss,” in J. Neyman, ed., Proceedings of the 4th Berkeley Sym- to have unit variance, and in order to introduce a correlation .92 posium on and Probability, Berkeley : between this variable and the third variable we applied the same The University of California Press, 1961, 361-79. technique as in Experiment 1. Finally, we unstandardired the [14] Lewis, Peter A.W., Goodman, A.S., and Miller, J.M., “A difference variable to get back the form on the right side of (C.2) Pseudo-Random Number Generator for the System/36O,” and we repeated the technique of Experiment 1 to introduce corre- IBM Systems Journal, 8, 2 (1969), 136-46. lation -95 between the first two variables. [15] Lindley, Dennis V., and Smith, Adrian F.M., “Bayes Esti- [Received May 1974. Revised September 1976.1 mates for the Linear Model,” Journal oj the Royal Statistid Society, Ser. B, 34, 1 (1972), 141. REFERENCES [l6] Mallows, Colin L., “Some Comments on C,,” Technometrim, 15 (November 1973), 661-75. Baranchik, Alvin J., “A Family of Minimax Estimators of the [17] Rolph, John E., “Choosing Shrinkage Estimators for Regres- Mean of a Multivariate Normal Distribution,” Annals of sion Problems,” CommunicatiwLs in Statistics-Theory and Mathematieal Statistics, 41, 2 (1970), 642-5. Methods, A5(9) (1976), 789-801. Conniffe, D., and Stone, J., “A Critical View of Ridge Regres- [l8] Sclove, Stanley, “Improved Estimators for Coefficients in sion,” The Statistician, 22 (1973), 181-7. ,” Journal of the American Statistical AasocG Dempster, Arthur P., “Alternatives to Least Squares in ation, 63 (June 1968), 596-606. Multiple Regression,” in D. Kabe and R.P. Gupta, eds., [19] Wermuth, Nanny, An Empirical Comparison of Regression Multivariate , Amsterdam : North-Holland Methods, unpublished Ph.D. thesis, Department of Statistics, Publishing Co., 1973, 2540. Harvard University, Cambridge, Mass., 1972. Efron, Bradley, and Morris, Carl, “Limiting the Risk of Bayes C20l -, APL-Funcths for Data Simulation, Regression Methods and Empirical Bayes Estimators-Part I: The Bayez Case,” and Dda Analysis Techniques, Research Report CP-15, De- Journal of the American Statistical Association, 66 (December partment of Statistics, Harvard University, Cambridge, Mass., 1971), 807-15. 1972. , and Morris, Carl, “Limiting the Risk of Bayea and [2l] Zellner, Arnold, and Vandaele, Walter, “Bayes-Stein Esti- Empirical Bayes Estimators-Part I1 : The Empirical Bayes mators for k-Means Regression and Simultaneous Equations,” Case,” Journal of the American Statietical Association, 67 H.G.B. Abzander Research Founddim, Graduate School of (March 1972), 130-9. Business, University of Chicago, 1972.

Comment

BRADLEY EFRON and CARL MORRIS*

Drs. Dempster, Schatzoff, and Wermuth deserve which have resulted in a comparison of 57 competing commendation for their substantial and novel efforts estimation methods over a variety of situations. We

amlaudI& their effort to shed light- on the task of the data * is Profeaaor and Chairman, Department of statbtics and analyst, and share their hope that their methodology also , Stanford University, Stanford, CA 94306. Carl Morris is Senior Statistician, The RAND Corporation, Santa Monica, CA 90408. will be of use in other problems. Simulation of Alternatives to Least Squares 91

APPENDIX C C6l -, and Morris, Carl, ‘‘Empirical Bayes on Vector Ob- servations-an Extension of Stein’s Method,” Biometrika, 59, The devices used for introducing prespecified collinearity and 2 (1972), 335-47. multicollinearity into a given correlation matrix R consist of repeated c71- , and Morris, Carl, “Stein’s Estimation Rule and Its applications of transformations of the type Competitors-an Empirical Bayes Approach,” Journul of the R + CRCT (C.1) American Statistical Association, 68 (March 1973), 117-30. C8l -, and Morris, Carl, “Combining Possibly Related Esti- where is square and nonsingular. may be naturally interpreted C C mation Problems,” Journal of the Royal Statistical Society, as a mapping from one basis of the linear space of six independent Ser. B, 35, 3 (1973), 379-421. variables to another basis, where R and CRCT are the initial and (9) -, and Morris, Carl, “Data Analysis Using Skin’s Esti- final covariance matrices. For example, in Experiment 1, to introduce mator and Its Generalizations,” Journal of the American .99 into the (1, 2) position of R we first used a C which replaced the Statistical Association, 70 (June 1975), 311-9. standardized variables by their sum and difference, so that (C.l) [lo] Goldstein, M., and Smith, Adrian F.M., “Ridge-Type Esti- took the form mations for Regression Analysis,” Journal of the Royal Sta- 1 1 + 9.12 0 tistical Society, Ser. B, 36, 2 (1974), 284-91. + [ 0 1 - r12 . (c.2) [11] Hoerl, Arthur, and Kennard, Robert, “Ridge Regression: [r; :::I :::I Biased Estimation for Non-Orthogonal Problems,” Techno- metrics, 12 (February 1970), 55-63. We then rescaled the variables by [1.99/(1 rln)]* and new + r121 -, and Kennard, Robert, “Ridge Regression : Applications [.Ol/(l - ~IZ)]+so that the right side of (C.2) achieved a similar to Non-Orthogonal Problems,” Technometria, 12 (February form with rlp replaced by .99. Finally, we reversed the transformation 1970), 69-82. used in (C.2) to get the desired result. In Experiment 2, the first [13] James, W., and Stein, Charles, “Estimation with Quadratic step (C.2) was the same. We then rescaled the difference variable Loss,” in J. Neyman, ed., Proceedings of the 4th Berkeley Sym- to have unit variance, and in order to introduce a correlation .92 posium on Mathematical Statistics and Probability, Berkeley : between this variable and the third variable we applied the same The University of California Press, 1961, 361-79. technique as in Experiment 1. Finally, we unstandardired the [14] Lewis, Peter A.W., Goodman, A.S., and Miller, J.M., “A difference variable to get back the form on the right side of (C.2) Pseudo-Random Number Generator for the System/36O,” and we repeated the technique of Experiment 1 to introduce corre- IBM Systems Journal, 8, 2 (1969), 136-46. lation -95 between the first two variables. [15] Lindley, Dennis V., and Smith, Adrian F.M., “Bayes Esti- [Received May 1974. Revised September 1976.1 mates for the Linear Model,” Journal oj the Royal Statistid Society, Ser. B, 34, 1 (1972), 141. REFERENCES [l6] Mallows, Colin L., “Some Comments on C,,” Technometrim, 15 (November 1973), 661-75. Baranchik, Alvin J., “A Family of Minimax Estimators of the [17] Rolph, John E., “Choosing Shrinkage Estimators for Regres- Mean of a Multivariate Normal Distribution,” Annals of sion Problems,” CommunicatiwLs in Statistics-Theory and Mathematieal Statistics, 41, 2 (1970), 642-5. Methods, A5(9) (1976), 789-801. Conniffe, D., and Stone, J., “A Critical View of Ridge Regres- [l8] Sclove, Stanley, “Improved Estimators for Coefficients in sion,” The Statistician, 22 (1973), 181-7. Linear Regression,” Journal of the American Statistical AasocG Dempster, Arthur P., “Alternatives to Least Squares in ation, 63 (June 1968), 596-606. Multiple Regression,” in D. Kabe and R.P. Gupta, eds., [19] Wermuth, Nanny, An Empirical Comparison of Regression Multivariate Statistical Inference, Amsterdam : North-Holland Methods, unpublished Ph.D. thesis, Department of Statistics, Publishing Co., 1973, 2540. Harvard University, Cambridge, Mass., 1972. Efron, Bradley, and Morris, Carl, “Limiting the Risk of Bayes C20l -, APL-Funcths for Data Simulation, Regression Methods and Empirical Bayes Estimators-Part I: The Bayez Case,” and Dda Analysis Techniques, Research Report CP-15, De- Journal of the American Statistical Association, 66 (December partment of Statistics, Harvard University, Cambridge, Mass., 1971), 807-15. 1972. , and Morris, Carl, “Limiting the Risk of Bayea and [2l] Zellner, Arnold, and Vandaele, Walter, “Bayes-Stein Esti- Empirical Bayes Estimators-Part I1 : The Empirical Bayes mators for k-Means Regression and Simultaneous Equations,” Case,” Journal of the American Statietical Association, 67 H.G.B. Abzander Research Founddim, Graduate School of

Downloaded by [University of Michigan] at 08:19 03 October 2017 (March 1972), 130-9. Business, University of Chicago, 1972.

Comment

BRADLEY EFRON and CARL MORRIS*

Drs. Dempster, Schatzoff, and Wermuth deserve which have resulted in a comparison of 57 competing commendation for their substantial and novel efforts estimation methods over a variety of situations. We

amlaudI& their effort to shed light- on the task of the data *Bradley Efron is Profeaaor and Chairman, Department of statbtics and analyst, and share their hope that their methodology also Biostatistics, Stanford University, Stanford, CA 94306. Carl Morris is Senior Statistician, The RAND Corporation, Santa Monica, CA 90408. will be of use in other problems. 92 Journal of the American Statistical Association, March 1977

In these comments, we shall be concerned with just be applied with the loss SEB, and is known not to be two variants of ordinary regression, RIDGE and STEIN. minimax in this case. Our interest is focused in this way because we have given But RIDGM also is better than JSTEIN in these experi- considerable thought to estimators of these types in our ments. This is expected for experiments or problems own research over recent years on empirical Bayes and where the regression parameters (ai1 have an exchange- Stein-type estimation, because more analytical power can able prior distribution. In the Dempster, Schatzoff , and be brought to bear on these methods than suggested in Wermuth experiments, the random rotation matrix G the paper, and because for the loss functions used, the tends to symmetrize the prior distribution, and this fact performance of these estimators depends on the experi- combined with the dispersed set of eigenvalues insures mental inputs only through the eigenvalues of the matrix that RIDGM will be better. Professor Thisted’s comments X’X. This last reason simplifies matters greatly for are much more complete on this issue. understanding RIDGE and STEIN, although it does not In relation to the preceding paragraph, one of our apply to most of the other 57 varieties. For lack of space, concerns about the widespread application of ridge re- and because we haven’t studied them carefully, we will gression is that, being a data-based Bayes rule against not discuss the confidence contour constraint variants an exchangeable prior, it is necessary to feel confident of Section 2.4.5 of RIDGE and STEIN. in the exchangeability assumption, while in most real This leaves STEINM, RIDGM, and SRIDG as the only three regression problems the statistician couldn’t be. In rules under consideration. When the eigenvalues {A,) of certain nonregression problems, for instance in the ex- x‘x are equal, STEINJI and RIDGhI reduce to the same amples of [2], exchangeability seems plausible a priori, rule, but SRIDG behaves very badly, being highly non- but when this assumption is violated significantly, ridge minimax. By making use of arguments in [3], it easily rules can be much worse than the ordinary regression can be shown to bc substantially and uniformly improved rule. That is, ridge rules are not minimax in general. We upon by the James-Stein rule. We also expect SRIDG to will put these issues in more mathematical form to make perform poorly when the cigcnvalues are unequal, which the argument clearer. happened in the simulation. Since SRIDG is so complicated If we allow ourselves the luxury of thinking of n as in general, and has not been recornmended elsewhere, we large (instead of 20) so that u2 may be assumed known shall take RIDGAI to be the only interesting vcrsion of (this could be relaxed at the price of additional mathe- RIDGE considered in the study. matical complcxity), the Dempstcr, Schatzoff, and Certainly the most dramatic conclusion of the study is Wermuth estimation problem may be expressed in that a version of the ridge method, in the form of RIDGJI, canonical form as the problem of observing is the best method used in the study and dominates a Yi - .v(e,, V,) , i = l(1)p (1.1) Stein-type method, in the form of sTEINnr. While we agree with this conclusion, and have made similar state- independently with Vi = u2/Ai and u2 known, Y,= a; ments in our own work [2, 31, we believe different = (CD),,C being the principal components orthogonal language should be used; language based on an under- transformation defined in Section 2.2. In this notation standing of why this is so. In particular, we see justifi- ei = (CB),is to be cstimated with risk function cation for RIDG~I arising out of the empirical Bayes P literature in combination with the implicit assumptions R = E C Li(8i - 0i)’ (1.2) of this experiment, and not from the ridge-trace graphical i-1 technique used for estimation of regression coefficients. where Li= 1 for SEB and Li= l/V,.for SPE. The‘ reader should note that sTEIN,\i is not thc James- Downloaded by [University of Michigan] at 08:19 03 October 2017 J,ctting A u2/k, the indeperidcnt prior distributions Stein rule (which we shall term JSTEIN) [4], and over- shrinks JSTEIN by a factor of [ (n - p + 2)/ (72 - p)]p/ 0, - X(0, A) , i = l(l)p , (1.3) (p - 2), or 12/7 in this case. This sacrifices about 51 percent of the improvement in risk of JSTEIN over the lead to the ordinary regression estimator (OREG) for the loss function EBi)Y, = (1 - BJY; , Bi= Vi/(Vi + A) . (1.4) SPE. We make this statement for the data of these cx- periments, knowing that the authors claim the contrary We say that (1.4) is an “empirical Bayes estimator’’ if at the end of Section 2.4.1. The risk function for SPE A is cstimated from the independent marginal distribu- loss is a function only of CEN, the nonccntrality param- tions (derived by integrating out the distributions of Bi eter, which takes on the five values CEN = 10, 50, 100, in (1.1) with respect to (1.3)) 200, 500 in this experiment. The approximate value of Y,2 (A V,)xi2 , = l(1)p , (1.5) the risk at these points (actually an upper bound) is - + i 5.00, 5.74, 5.87, 5.93, and 5.972 for STEINwhile STEINAI and then the estimate A^ is used in (1.4) in place of the has 5.51, 5.87, 5.93, 5.97, and 5.986. Ordinary regression unknown value A. has SPE risk of G.00 for all values of CEN, and so even Defining A; = Y,2- V,, we have En^, = A, so these JSTEIN would not improve least squares substantially in are p independent unbiased statistics which may be used these experiments. The Jamcs-Stein estimator should not to estimate the unknown value A. Obvious unbiased Simulation of Alternatives to Least Squares 93

estimates of A are of the form because ridge regression formally reduces to this situa- tion, we have given a great deal of thought to the problem A=~Aawi,CW,=l. (1.6) framed by (1.1)-(1.5) over the past several years. This The choice W, = W,(A) a l/Var (A,) = .5/(V, + A) includes derivation of a wide class of minimax estimators results in the RIDGM estimate if w,(A)is replaced by which encompasses most estimators already proven to be W,@) in (1.6). The rule we proposed in [Z] for the minimax by other writers. We also have considered toxoplasmosis data, which we label EBniLE, derives from numerous empirical Bayes rules, partly in light of a W,(A) 0: 1/(Var (A1))2= .25/(V, + A)2, W,(A) re- necessary condition for minimaxity. In the equal variance placed by W,(A) in (1.6). This is the optimal linear situation of James and Stein [4], minimax rules with estimator in (1.6), and is equivalent to the maximum Bayesian properties against exchangeable priors exist, likelihood estimator of A from (1.5). Hence the term but when orthogonal invariance is sacrificed this happy EBMLE signifies the empirical Bayes model with maximum result disappears. When the variances Vi are sufficiently likelihood estimation of A. unequal, our current understanding is as follows. A Both of the estimators of the preceding paragraph take fundamental tension exists between minimax and empiri- advantage of the exchangeability of (1.3) and therefore cal Bayes (or ridge) requirements, and no rules appear to are superior to STEINM and JSTEIN for priors of the form exist which are satisfactory from both standpoints. One (1.3). It is essential to note, however, that if (1.3) were cannot approximate the Bayes rule against the prior replaced by 0, - X(0, V,A) then JSTEIN, which in this (1.3) without risk of doing worse than the Gauss-Markov notation is (1 - (p - 2)/C Y,”/V,)Y,, and STEINM estimator. Improvement on the Gauss-Rlarkov estimator would outperform EBMLE and RrDGnr. The experiment in regression situations therefore can be guaranteed only therefore has proved that the array of experimental re- with external information about the prior distribution of gression coefficients is better represented by Var (0,) = A the regression coefficients. Unfortunately, such infor- than Var (0,) = V,A. mation is not available for many applications. Carter and Rolph used their own version of RIDGM To summarize, the statistician has a choice of shrink- quite successfully in an empirical Bayes application to age rules to consider in applications to real data, and spatial analysis [11. (Actually both the Carter-Rolph must be careful in exercising this choice because, while rule and EBnILE [a] were modified slightly so that they the rewards can be great, so also can be the penalties. NO reduce to the James-Stein rule when the V8all are equal.) choice uniformly dominates the Gauss-Markov estimator While we don’t know whether EBMLE or RIDGM is better for all loss functions. The statistician, therefore, must for small or moderate p, EBMLE must be better for large know enough about his data and about the properties of p if (1.3) holds because of its relation to the maximum the alternative estimators available to him to make an likelihood estimator. It would be interesting to compare intelligent choice of rule. For the Dempster, Schatzoff, these rules on the 160 data sets of the experiment. and Wermuth experiments, the exchangeable prior, and If the V, are sufficiently unequal, then for certain therefore the use of RIDGM or EBAILE,seems to be justified configurations of the parameters el, . . ., eP, both EBJILE in aggregate, although the statistician who looks at the and RIDGM can be much worse than OREG for both losses 160 individual problems might choose not to use the same SEB and SPE. (The problem rests precisely with the rule in all of these situations. Other experiments could component having the large V,or small A,, i.e., the com- give opposite conclusions, so the reader’s faith in the ponent most ridge papers are concerned about.) JSTEIN, results of this experiment ultimately depends on how of course, is guaranteed to improve upon OREG for SPE, much he believes the Dempster, Schatzoff,and Wermuth while it is not minimax for SEB. data sets typify real world experience. It should be clear from the preceding discussion about Downloaded by [University of Michigan] at 08:19 03 October 2017 SRIDG, RIDGM, and EBMLE that there are many ways to REFERENCES estimate the constant k from the data. Although almost every ridge paper published, including this one, has [l] Carter, Grace M., and Rolph, John E., “Empirical Bayes Methods Applied to Estimating Fire Alarm Probabilities,” presented a different method, the expression “the ridge Journal of the American Statistical Association, 69 (December estimator” continues to be used. In fact, ridge estimators 1974), 880-5. are a class of Bayes rules against normal priors indexed [a] Efron, Bradley, and Morris, Carl, “Data Analysis Using Stein’s by k, and the effectiveness of a given rule depends upon Estimator and Its Generalizations,” Journal of the American how k is estimated. Some published ridge estimators are Statistical Associatim, 70 (June 1975), 311-9. drastically different from others, and some are disas- c3i -, and Morris, Carl, “Families of Minimax Estimators of trously bad. We believe that the important problem now the Mean of a Multivariate Normal Distribution,” The Annala of Statistics, 4, 1 (1976),11-21. is to find estimators of k which have good risk properties [4] James, W., and Stein, Charles, “Estimation with Quadratic in the class of all possible estimators. Loss,” Proceedings of the Fourth Berkeley Symposium of Malhe- Because most applications of Stein’s rule require its matieal Statis!ica and Probability, Vol. 1, Berkeley: University generalization to the unequal variances situation, and of California Press, 1961, 361-79. 94 Journal of the American Statistical Association, March 1977 Comment ARTHUR E. HOERL*

The authors are to be congratulated for so clearly the two estimators vacillate in superiority as a function presenting such a breadth of material. In addition, the of a'a, some subjective judgment would then be required. theoretical Bayesian development which lcads to the As an example of this approach, the following is presented. RIDGM algorithm is most ingenious and goes a long way Using the Gorman-Toman data [l] with two spec- in attaining a realistic share of the ridge potential in trums of eigenvalue structure used in [2, 31, a series of reducing mean square error. 200 simulations was performcd for each a'a value. The In the joint papers [2, 31, where simulation was used uniform random number generator described in the to examine ridge algorithms, orthogonal Z matrices of Appendix served as the basis for the simulation. Unit various dimensions and conditionings were defined. normals were gcnerated by summing 12 random uniforms Random vectors a with specified norm (designatcd as on (- 5, .5). Least-squares estimates d; were generated a2 = a'a) and e were generated, and the resultant least by di = ai+ Ei with Ei dcfincd by the sum of 12 squares and ridge regression were characterized through uniforms dividcd by dAi. Sample values s2, of u2 = 1, a quadratic loss function. In these simulations our origi- were defined by the sum of 25 squared unit normals nal algorithm relied on the minimum quadratic loss csti- divided by the df 25. mate k, = u2/ai2.This is, as the authors state, an intcrcst- For the 10-factor basic (the same eigenvalue structure ing identity since it depends on a12 (and u2) only arid not as originally published) the results in Section a of the on its corresponding eigenvalue Xi. However, one can table were obtained. As a guide, the expectation for the readily satisfy oneself that when using samplc cstimatcs F ratio is included. The critical value of F(10, 25) with the resultant loss is relatively largc. Thcrcfore, this a = .05 is 2.24. For Section a, the expectation for the motivated us to use a single k, equal to the harmonic least-squares error is 32.58 and the maximum potential mean of the p values kl, . . ., k,, i.e., is defined to be the minimum possible square error for k, = ps=/d'd . Average Square Error Subsequent to this simulation publication [2], an Maxi- Ordi- Itera- mum iterative algorithm with an empirical stopping rule based nary Basic tive poten- on successive estimates of a'a has been published [a]. a'a E(F Ratio) L.S. k. k,i RlDGM tial Since the sample value B'b overestimates a'a, k, tends to a. 10-Factor Basic underestimate its goal value. Therefore, since the first 1 1.20 34.12 7.99 6.80 2.18 .88 ridge cut, results in an improved estimate, &Ro, of a, k,,, 5 1.63 31.85 . 8.83 8.57 5.1 5 3.47 then based on the square length of dRo would perhaps 10 2.17 30.03 9.66 9.89 7.53 5.52 be a better estimate of a. The successive estimates k,; can 15 2.72 31.55 11.06 11.15 9.10 7.01 be repeated until the rate of change in k.i has stabilized. 25 3.80 30.70 12.75 12.81 12.05 9.51 50 6.52 34.28 18.23 18.23 17.83 13.82 Downloaded by [University of Michigan] at 08:21 03 October 2017 This can be specified under an empirical stopping rule. 100 12.0 31.84 20.76 20.76 20.56 16.53 For the purposes of multiple comparisons of many 200 22.8 30.12 24.69 24.69 24.92 19.55 500 55.4 32.99 30.33 , 30.33 30.45 23.84 estimation techniques, the authors were wise in formu- 1000 110. 30.08 30.41 30.41 30.51 24.67 lating their comparisons by averaging their results over 2500 273. 32.09 30.64 30.64 30.64 23.99 a range of signal-to-noise. However, for the purposes of 10000 1088. 32.77 32.73 32.73 32.73 27.78 evaluating a few algorithms it is perhaps more illustrative 100000 10871. 35.44 35.38 35.38 35.38 28.29 to display the relative effectiveness of each at specific b. 10-Factor Wide values of signal-to-noise since it is not exclusively the 1 1.20 530 50.1 .99 2.92 .87 frequentists who would argue that in formulating a 5 1.63 584 64.4 4.93 6.71 3.52 10 2.17 602 55.3 9.64 7.68 5.87 simulation strategy the following criterion be used: if it 15 2.72 544 54.9 14.1 10.1 7.88 can be shown that one estimator is superior to another 25 3.80 589 57.7 21.4 12.9 10.7 for all specified values of a'a over a wide spectrum of 50 6.52 559 62.8 26.2 22.1 17.6 100 12.0 572 71.6 32.0 31.6 27.5 conditioning and a range of p, then regardless of the real 200 22.8 61 5 97.5 49.8 54.0 43.2 world of a'a (assuming a finite 500 55.4 561 120. 88.7 89.4 62.2 domain), that estimator is preferable to the other. If 1000 110. 568 160. 133. 138. 85.3 2500 273. 555 253. 240. 243. 138. 10000 1088. 597 448. 448. 449. 21 8. *Arthur E. Howl ia Profemor, Department of Statistics and Computer Science, 100000 10871. 565 546. 546. 545. 282. University of Delaware, Newark, DE 19711. Simulation of Alternatives to Least Squares 95

each data set using a single k. This is defined by a variety of simulations, the technique was found wanting. MaxPot = Min { ‘C[ - xi Bi - . k>O xi+ k APPENDIX: UNIFORM RANDOM NUMBER ON [-.5, .51 Similarily, for the second simulation, with an assumed Define an arbitrary irrational number A truncated to 12 significant digits with a normalized floating-point as eigenstructure (4.25, 1.50, 1.25, 1.00, .778, .6, .4, .2, .02, .002), the 200 trial averages are given in Section b of the A = (O.ZZZ--~)~O~. table. Here, the expectation for the least-squares error The generation of the.uniform random numbers was obtained by

is 563.15. the followinaI- stem. These results suggest, then, the resultant effectiveness 1. Form the normalized floating-point number

similarity between the author’s RIDGM and the current B = 1/A = (0.~~~--*)10* iterative algorithm. This prompts a need for a broad with the remainder comparison, over a significant range of p and con- ditioning, of these ridge algorithms together with other R = (0.222- - -2) 10‘ ridge algorithms which are currently under study. This where r 5 b - 12 and A x B + REMAINDER = 1 with a 24 suggested study would also need to be concerned with digit product A X B. The mantissas of the respective the broader aspects of estimation including weighting numbers R and A satisfy the condition and prediction. Means for disseminating this information 0 < MANT (R) < MANT (A) . at an early date would help to reduce unnecessary dupli- The assumption here is that the mantissa of the remainder R cation and computational effort. Perhaps even a group after 12 significant places is independent of the divisor. of interested participants could pool their resources and 2. Form C = RIA as (0.zzz- - -2)lOc. Set c to zero and store talents in defining, formulating, and carrying out the as. the new A. detailed simulations. 3. The new C is assumed uniform on C.1, 1.1. With the increasing reliance on simulation in regression 4. Subtract 0.1 from C and divide by 0.9 for [0, I]. 5. Uniform numbers on [ - .5, .5] can be defined by subtracting it would seem propitious to develop a standardized 0.5 from C/O.9. procedure that would be reasonably acceptable. Here it No more than 100,000 consecutive numben should be used with is suggested that sampling conditioned on an CY norm is the same original seed to be well assured of a nonrepeated chain. one approach. This has the major attribute of avoiding An unlimited run (with no repeating chain) can be achieved by the question of what constitutes a typical regression adding one to A on a fixed count of say every 50,000 numbers. problem. In no instance has the algorithm degenerated in over lo8 wea. Dempster, Schatzoff , and Wermuth suggest that another device to evaluate prediction might be a jaclr- REFERENCES knifing type of technique. This has been extensively [l] Gorman, John W., and Toman, R.J., “Selection of Variables investigated by Hoed and Hennard and found deficient. for Fitting Equations to Data,” Technometrics, 8 (1966), 27-51. In fact, the basic idea was extended to include what we [2] Hoerl, Arthur E., Kennard, Robert W., and Baldwin, Kent F., called duplex (splitting the data into two groups under “Ridge Regression : Some Simulations,” CommunicutimLs in Statistics, 4 (1975), 105-23. a variety of criteria) and multiplex (defining all possible c31 -, and Kennard, Robert W., “Ridge Regression: Iterative subsets (3 and selecting all or some fractionated pro- Estimation of the Biasing Parameter,” Communicutim in portion of all nondegenerate sets). In every instance over Stat&ics, A5 (1976), 77-88. Downloaded by [University of Michigan] at 08:21 03 October 2017 Comment

DAVID M. ALLEN*

To establish some points for reference, I will begin by there exists a vector @ such that p = [email protected] will make expressing some of my own thoughts regarding regression. some harsh statements about 0 and then illustrate them We have a random vector y and a full-rank, nonstochastic using a simple example. The example depends on matrix X. The matrix X is said to be ill-conditioned if X = (x~IxZIX~), X* = (x11xzIx3*), and p where the there exists at least one vector G such that IIC(l = 1 and vectors are given in the table. llXCll is “small.” We denote E(y) by p, and we suppose What is the interpretation of. @?Let pi and xij denote the ith elements of p and xi. Since p; = Cj z;#j for all i, *David M. Allen is Associate Profemor, Department of Statistics, University of Kentucky, Lexington, KY 40506. it is often appealing to suppress the subscript i and regard 96 Journal of the American Statistical Association, March 1977

Vectors Used in Examples The estimation of a linear combination of the elements of @ where the variance of the least-squares estimator is greater than u2 will be termed an extrapolation. That is, 87.36 -87.36 -.07 .07 -84. the precision of the estimator is less than the precision -43.17 -47.97 -65.08 -65.12 -69. of a direct observation on that linear combination (if -51.68 -26.72 -55.99 -56.01 -68. such observation were possible). If X is ill-conditioned, -54.32 - .56 -39.20 -39.20 -56. -28.74 -59.46 -62.97 -63.03 -54. then estimation of an individual element of @ is often -51.14 25.66 -18.21 -18.19 -38. an extreme form of extrapolation. For the X of our 50.31 - 12.09 27.24 27.36 -33. example the respective variances of estimators of Pl, -42.19 47.09 3.48 3.52 -19. ol, -8.34 -56.34 -46.16 -46.24 -18. and f13 are 18.22~~~18.22~~~ and 35.716~9. 18.78 34.14 37.75 37.85 -6. Because of the high dimensionality of typical regression -27.52 58.88 22.37 22.43 -4. problems it is impossible to conduct a comprehensive -7.18 56.18 34.96 35.04 2. 18.08 -33.76 -11.15 -1 1.25 44. simulation study. However, the authors have come closer 50.57 13.13 45.56 45.44 137. than any other study I have seen. I believe interpretation 89.18 89.18 127.47 127.33 266. of regression is most difficult when X is highly ill-condi- tioned and thus regard Experiment 1 as being more /r as a linear function of continuous variables xl, x2, and valuable than Experiment 2. In view of my harsh state- xa. A common interpretation is: @j is the change in p ments about @,I would rather have a (A.8) than @ as a accompanying a unit change in xj with all other x's factor in the study design. This would systematically constant. The problem with this interpretation is that an generate different p in the appropriate vector space. For ill-conditioned X precludes all other x's constant. For the similar reasons, I place more credence on the analysis X of the example the relationship by SPE (A.ll) than on the analysis by SEB (A.lO). I am impressed by the potential of REGF methods and look .5Xi .5Xz - .7X3 .05 I + I < (1) forward to studying them further. is always satisfied. This requires x2 or x3 to change if x1 The authors mention conflicts among different methods. changes as much as .2. If x1 increases by'one and the If we do not extrapolate, these apparent conflicts may sum of absolute changes in 22 and 23 is kept as small as be of less consequence than they indicate. Evaluation of possible subject to (l), then x2 does not change and x3 SPE cannot involve extrapolation, while evaluation of SEB increases by .5714. Thus the change in /r accompanying often is extrapolation. This statement is supported by a unit change in x1 with minimum changes in x2 and 23 the fact that the coefficient variation of SPE is less than is ol + .571483 and not pl. If X is ill-conditioned then, the of SEB. @ has little or no interpretation. The authors caution against expecting any favorite The vector v is unique. Its elements are expected procedure to be automatically applicable. I emphatically values of observable random variables and are inter- agree. In my example, the variance of the least-squares , pretable. If X has not been correctly specified (Who estimator of (-51.14, 25.66, - 18.21)@ is .097554a2, really knows X?),then there may not exist a @ such which is quite good for 15 observations. However, for that = If X is correctly specified but ill-condi- v X@. (-50.74, 26.06, -18.77)@ the variance is 46.62~9.Except tioned, then @ is fickle with regard to small perturbations for unrealistically large values of uZ, ridge is worse than of the elements of X. In the example, both and X X* least squares in mean square error for both cases. We are correct specifications in that their columns span the same vector space and v is in that space. The maximum should recognize the existence of situations where no estimation, by any method, is warranted. The second

Downloaded by [University of Michigan] at 08:22 03 October 2017 absolute difference between corresponding elements of X and X* is .14, yet @ = (-41950/49, -41950/49, 1200)' linear combination just mentioned is such a case. The and @*= (42050/49, 42050/49, - 1200)' are drastically data simply does not contain much information about different. that linear combination.

Comment A. F. M. SMITH"

Interest in alternatives to ordinary least-squares pro- widespread among both Bayesian and frequentist stat- cedures for the analysis of the normal linear model is now isticians, and the authors are to be congratulated on their timely and stimulating contribution. *A.F.M. Smith is Lecturer. Department of Statistics and Computer Science, University College London, Gower Street, London WClE BBT, England. I shall confine my comments to just two aspects of this Simulation of Alternatives to Least Squares 97

wide-ranging study : the first concerns the relationship ment over least squares in the prediction context. Brown between continuous and discrete shrinking methods ; the [I] has shown that this is not necessarily true, and his second relates to the authors’ formulation of the pre- arguments have been extended to a more general shrink- diction problem. age estimation setting by Goldstein [3]. A link between continuous and discrete shrinking The point at issue can be summarized as follows. Let methods is implicit in a result established by Leamer us assume that b* is an estimator of p from (A.l), and and Chamberlain [4, Theorem 11, which shows that the that we wish to predict m future values of Y correspond- RIDGE estimator-given here by (2.1) with Q equal to ing to a design matrix X,(m X p), the latter scaled in the -can be written as a weighted- such a way that X,b* is the desired predictor. (Note that average of the 2p least-squares estimators corresponding the authors consider only the special case m = n, to all possible ways of constraining subsets of the p X, = X.) It can be shown that if b* is taken to be the regression coefficients to be zero. The weighted-average RIDGE estimator, then the derivative, with respect to k, forms REQF and DRGF are not quite of this form, since of the predictive mean square error, evaluated at k = 0, only (r) such least-squares estimators are combipned (for is equal to -22 Ca(B,,/hJ, where Baa = (CXoTXoCT);;, some chosen r), but it would, perhaps, be worthwhile and C and hi are defined by (A.5). In this more general exploring the connection further. In particular, an setting, it is the relationship between the Bii and the examination of the weights given by Leamer and Cham- h, which determines the scope for saving in predictive berlain [4, Eq. (4)] and those discussed in Appendix B mean square error. The case considered by the authors of this paper reveals great similarity, especially for DRGF. has the special form B;i = h, and gives no insight into Indeed, apart from the fact that Leamer and Chamberlain the greater scope for improvement which occurs when assume the variance u2 to be known, it appears that DRGF the larger values of B;; correspond to small values of could be derived by expressing RIDGE as a wighted- A;; it., when the directions in which predictions are average and then forming a renormed weighted-average required turn out to be those which are poorly estimated using only those terms corresponding to T nonzero- on the basis of the original design matrix. constrained components (see [4, Eq. (3) and (4)] and Finally, I should like to draw attention to a splendid Appendix B, (B.2), (B.7), (B.8), (B.9) with a = r). A example of RIDGE in action-that of “Election Night closer study of this connection might well give some Forecasting” in the U.K. [2]-where a version of RIDGE insight into the comparative performances of these triumphs over all-comers, including OREG (k = 0) and estimators. (A similar analysis could be made of the ZERO (k = m). relationship between PRI and an alternative weighted- REFERENCES average representation of RIDGE given in [4, Theorem 21.) I find the comparison of estimators using prediction [l] Brown, P.J., “Predicting by Ridge Regression,” unpublished mean square error rather difficult to interpret. Indeed, report, Department of Mathematics, Imperial College, London 1974. it seems to me that this part of the study is both mis- PI- , and Payne, C.D., “Election Night Forecasting (with leading and misguided in so far as it identifies the pre- discussion),” Journal of the Royal Statistid Society, Ser. A, 138, diction problem with that of predicting a set of future 1975, 463-98. values at precisely the same design points as have been [3] Goldstein, M.,“Aspects of Linear Statistical Inference,” un- used in estimating the regression coefficients. This published Ph.D. thesis, Mathematical Institute, University of Oxford, 1974. obscures many of the features of interest that are present [4] Leamer, E.E., and Chamberlain, G., “A Bayesian Interpretation in the more general prediction problem and leads the of Pretesting,” Journal of the Royal Statistical Society, Set. B, authors to conclude that there is less scope for improve- 38, 85-94. Downloaded by [University of Michigan] at 08:23 03 October 2017

Comment

CHRISTOPHER BINGHAM and KINLEY LARNTZ*

The methods of estimation compared in the paper fall terms of the coordinates defined by the given independent into two classes: those which are best understood in variables, and those which are best understood in terms * Christopher Bingham is Associate Profasor and Kinley Larnts is Assistant Of canonical variables defined by eigenvectors of the Profeasor, both at Department of Applied Statistics, University of Minnesota. cross product or correlation matrix of the independent St. Paul, MN 55108. The research of the second author was facilitated by a single quarter leave granted by the Regents of the Univeraity of Minneaota. variables. What seems to be important in both these 98 Journal of the American Statistical Association, March 1977

cases is the Orientation of the true coefficient vector From the point of view of the paper, as wyell as of many relative to the relevant coordinate system. other authors, the relcvant property of an estimator is a In the case of best-subset methods or Bayesian mix- generalized mean square error criterion tures of them, it seems to us that the relevant coordinates 72 = - @>’w(B- @)I= - a)lW(a - a)] , are in terms of the orthogonal basis in variable space E[(B EL(& defined by the independent variables as given, either where W = W‘ and W = AWA’. In the usual cases standardized or not. The coefficient /3j can be considered fi = diag [wl, w2, . . ., w,], i.e., W is also diagonalized the length of the projection of @ on the jth basis vector. by A. Both SEB (W = I = W) and SPE (W = X’X, Coefficient vectors oriented near the space spanned by a W = A) are of this form. Then small set of the basis vectors will be well approximated 72 = tr Cov [a]) tr W(E[&- a]E[& - a]’) . by subset regression models, and methods assuming that (W + they are so oriented should be an improvement over But Cov [a] = cz(A + K)-IA(A + K)-l and methods, such as least squares, that do not. In effect, EL& - a] = ((A K)-’A - 1). = - (A 4- K)-lKa . subset methods shrink the coefficient vector in the direc- + tion of planes determined by a subset of the basis vectors. Thus If that is appropriate, they do well. 7’ = u2 [WiXi/(Xi ki)’] [w,a,2ki2/(Xi The other class of estimators, ridge methods and their C + + + ki)’] . generalizations, can be defined as It is easy to check that r2 is minimized for any choice of w, > 0 [l] by ki = u2/a,2, in accordance with Hoed = (X’X + Q)-’X’Y B and Kennard’s result for W = I [S]. The minimized where Q is a positive-semi-, or as a limit value, which in some sense represents the best that one of such estimators. The independent variables matrix X could do using any estimator in this class of estimators, is is almost always assumed to be corrected for the mean, and is usually assumed to be standardized so that X’X Tmin2 = C Wi(UZXi’+ .“/..t”)/(Xi + rP/a?)Z is a correlation matrix. The properties of such an esti- = c WiCr,2/(1 + X,a?/crZ) . mator are best understood in terms of the relative For least squares, (K = 0),we have eigenvectors of X’X and Q. It is well known that X’X and Q can be simultaneously diagonalized. That is, there TLS2 = c W,UZ/Xi . is a nonsingular matrix A such that Thus in each canonical direction the amount of pos- A’X’XA = A = diag [XI, Xz, . . ., A,] sible improvement over least squares is (Xia,l/a2)/ and (1 + X,a?/d‘). For fixed ai/u the improvement is greatest A’QA = K = diag [kl, kz, . . ., k,] . for smallest Xi. This provides much of the motivation for Marquardt’s generalized inverse estimator. However, The rows of A-I are proportional to the eigenvectors for fixed Xi, no matter how small, this ratio can be made of X’X relative to Q. For ordinary ridge regression, as close to one as desired if the corresponding canonical Q = = kI, and A is orthogonal. For generalized ridge K coefficient a, is large enough. In summary, we may con- regression, A is orthogonal and Q is the matrix having clude that the closer the coefficient vector is to the space the same eigenvectors as X’X and eigenvalues k1, . . ., k,. spanned by the eigenvectors corresponding to the larger For Marquardt’s generalized inverse, Q can be taken eigenvalues Xi, the more improvement ought to be pos- as a limit of matrices of this form, with the ki correspond- sible over least squares. ing to the smallest eigenvalues Xj of X’X approaching The above considerations indicate to us that any ex- infinity and those corresponding to larger eigenvalues Downloaded by [University of Michigan] at 08:24 03 October 2017 periment designed to explore the merits of various adap- being zero. tive ridge estimators, i.e., estimators with K chosen This diagonalization induces transformations of the depending on the data (usually on BLs/s), should have, parameters and independent variables : as one of the primary factors, variation of @ relative to @-+A-l@= a or = Aa , the eigenvectors of X‘X, i.e., variation of a. The second and important factor is, of course, the pattern of eigenvalues X+X= XA . of X’X. The direction of the eigenvectors is meaningful only with respect to their relationship with @.This is We can express the estimator 6 as why variation of a should be a factor rather than the 6 = [ (A’)-lAA-l + (A’)-lKA--l]-lX’Y eigenvectors. One difficulty in evaluating the results of the present = A(A + K)-lA’X’Y = A(A + K)-lPY experiment is that the eigenvectors, or more importantly, = A8 where 8 = A-lB = (A K)-IX’Y . , + the a’s, are not given. The orientation is left to chance, Thus @ and 6 can be expressed in terms of a and 8, which without much indication of how the construction of represent the coordinates relative to the rows of A-I (the patterned correlation matrices constrains a. Even when columns of A, if A is orthogonal). These characterize the the eigenvector matrix of the nonstandardized form of orientation of p and Q relative to these eigenvectors. X’X is chosen randomly (uniformly over the orthogonal Simulation of Alternatives to Least Squares 99

A. Orientation of Simulated Alpha’s

Al. EIG = 64.0 16.0 4.0 2.0 1.0 0.5 COR= no A2. EIG = 64.0 16.0 4.0 2.0 1.0 0.5 COR =no BETA = 32.0 16.0 8.0 8.0 8.0 8.0 MCL =no BETA = 1.0 7.0 1.0 0.0 0.0 0.0 MCL =yes Downloaded by [University of Michigan] at 08:24 03 October 2017

A3. EIG = 30.0 30.0 30.0 20.0 20.0 20:O COR =yes A4. EIG = 30.0 30.0 30.0 20.0 20.0 20.0 COR =yes BETA 1.0 1.0 1.0 1.0 1.0 1.0 MCL =no BETA = 32.0 16.0 8.0 0.0 0.0 0.0 MCL =yes

NOTE : EIQ denotes eigenvaluea of X’X matrix, and BETA denotaa the “true” regression coefficienb.

group), the eigenvectors of R,the correlation matrix, are because the orientation of (Y is best expressed as a point not random. Still less random are the eigenvectors of R on the unit sphere in 6-space. One simplification followed after it is massaged to have high collinearity and/or from the observation that we could always take a as multicollinearity. To get a clearer picture of what might being in the orthant defined by aj 2 0, j = 1, . . ., 6. have happened in the experiments described in the To reduce the dimensionality, we looked at the orienta- paper, we conducted a small simulation study to investi- tion of a in some of the twenty three-dimensional sub- gate the distribution of a, for fixed @,when the eigenvec- spaces defined by sets of three coordinate axes. An tors of X’X were chosen randomly and X’X was stand- effective way of displaying such three-dimensional orien- ardized and massaged exactly as described in the paper. tations is by means of an equiareal plot of a hemisphere There is considerable difficulty in presenting the results (in this case, an octant) on a disk (quadrant). Figure A 100 Journal of the American Statistical Association, March 1977

Eigenvalue Patterns for SEB Simulation Study

Pattern no. Eigen va loes 1 2 3 4 5 6 7 8 9 10

-p=2 1 .o 2.0 5.0 10.0 25.0 50.0 100.0 200.0 500.0 1000.0 1 .o 1 .o 1 .o 1 .o 1 .o 1 .o 1 .o 1 .o 1 .o 1 .o -p=6 1 .o 1.23 2.05 2.00 3.40 4.63 1.96 1.994 3.55 3.14 1 .o 1.20 1.20 1.15 1.88 ,733 1.22 1.010 1.50 1.66 1 .o 1.11 ,958 1.02 .436 ,333 1.06 1.005 .583 ,877 1 .o ,847 .921 ,962 ,159 ,164 ,936 ,995 ,228 ,257 1 .o ,827 .790 ,816 ,0738 ,0927 .a23 .990 ,126 ,0615 1 .o .778 .0803 .0484 ,0502 ,0420 ,00716 .00600 .00733 .000850 1 .o 1.59 25.5 41.3 67.7 110 274 332 484 3694 - 30 30 30 64 64 30 - 64 64 - no Yes no no Yes Yes - no Yes - no no Yes no no Yes - Yes Yes

shows four typical plots. Each circle is actually four Ratio plots, one for each of the four three-dimensional subspaces containing the eigenvectors al and as corresponding to the largest eigenvalue and the smallest eigenvalue of the

correlation matrix, respectively. Thus, starting at the lo* upper right and proceeding clockwise, the four quadrants display the orientations of the simulated a's in the spaces spanned by all 'az, and ae; all a3, and aa; al, a4, and ae; and al, as, and ae, respectively, where the aj's are the 1 .o eigenvectors corresponding to Xj, the jth eigenvalue in decreasing order of magnitude. Figure A1 corresponds to a situation in which there is \ '. " no introduced collinearity or multicollinearity. In this \ '. BY -.,-,,,...... - case the distribution of the aj's is exchangeable, although c---. -.- \ .', L .._..-..-..- \ %. probably not completely random (isotropic), and hence \ we would not expect to see any marked pattern. In fact \ I I I I I I I I\ the display shows a fairly uniform distribution of direc- .1 1 2 34567 8 9 10 Pattern Number 82. Canonical Coefficients: a' = (V50, V50) B. Ratio of SEB for Ridge Methods to

SEB for Least Squares a Ratio Ratio

Downloaded by [University of Michigan] at 08:24 03 October 2017 'Ol I 1.0

.1

M.----...... 0------.... ."...,,-. ",I.... c - - -. -- - L .._..-. .-..- L ..- .. - .. - . ._ I I I I I 1 I I .1 I I .001 '. 1 2 34567 8 9 10 1 2 3 4 5 6 7 8 9 10 Pattern Number Pattern Number 83. Canonical Coefficients: a' = (0,701 BI. Canonical Coefficients: (I'= (70.0) a Number of regressors: p - 2. Simulation of Alternatives to Least Squares 101

tions. In the other three cases with either collinearity, Ratio 10 multicollinearity, or both, the distributions are clearly not random. Figures A2 and A4 display a marked tendency for a1 to be large relative to a6, exactly the situation for which we would expect ridge methods to be /a:..a.... -.____..... I *< ....X-.. -.. I an improvement over least squares. The degree of con- / II______\:-.-. ---- ...... -->...... sistency displayed by Figure A4 is, indeed, quite re- -. /:.f=, - .... -.\ ...I .i' .-- ....-...... markable. The "inner" aI.)s are, however, quite random. +'' -.. '.*s>+-=: -..- .... \-*...... -...z$:.... Figure A3 shows a case for which both a1and a6 tend to 1.0 -- be bounded away from zero, and to be rather highly \ correlated. Again the "inner" a,'s are relatively random. For emphasis, we would like to repeat that these a's M.---- were chosen as described in the paper, using 100 different I ...... 0 ------random sets of eigenvectors. For the data sets discussed B ,_-.-,,,...... ,. c - - -. -. - in the paper corresponding to Figure A4, for which a's L .. - .. - .. - .._

are not given, it is clear we can say quite a lot about 1 1 I .1 J the orientation of a relative to al and ae, even though the 1 2 34 567 8 9 10 original eigenvectors were chosen randomly. Pattern Number

To study the effect of varying the a's more explicitly C2. Canonical Coefficients: a' = (3, V2.5, V2.5, V2.5, V2.5, 9) than was done in the paper, we conducted another small simulation study. We restricted our investigation to a Ratio comparison of least squares with a few ridge-type esti- mators. No best-subset methods or their relatives were included. The particular procedures selected were (with lo> the mnemonics used in our plots) :

B : RIDQM in the paper, ridge with empirical Bayes k; c: 1cRIDo in the paper, ridge with shrinkage to the F = 1 contour; 1 .o I: PRIF in the paper, adaptive form of Marquardt's generalized I inverse [5]; L: Ridge regression with k estimated as ~s'/~L~'~L~sug- gested by Hoerl, Kennard, and Baldwin [4]; 0: Generalized ridge regression as proposed by Hoerl and M'----, ...... Kennard [S] with K = diag [h,.... k,] computed using 0 _------B ...._...... I a method of Hemmerle [a]; c -.-. -. L ..-..-. .-..--

1 1 .11 I I I I 1 I C. Ratio of SEB for Ridge Methods to 1 234567 8 9 10 SEB for Least Squaresb Pattern Number

Ratio C3. Canonical Coefficients: a'= (0, 0.5, 0.5, 0.5, 0.5, V99)

b Number of regressors: p = 6.

Downloaded by [University of Michigan] at 08:24 03 October 2017 M: OPT in paper, generalized ridge with the correct (unrealiz- able) optimal ki's, yields a lower bound for ridge type estimators.

Computations were carried out for p = 2 and p = 6 with variety of canonical regression coefficients a and eigen- values A. The a's were standardized SO that a'a = 100. '. The variance 2 was assumed to be 1. For each com- '. . bination (a, A), 1000 regressions with n = 20 were M.---- '., .01 -1...... simulated and the average SEB calculated. The eigenvalue om------\ 5 ....""",* "..."*.." patterns for p = 2 and p = 6 are given in the table with c-.-. -. - \ L .. - .. - . . - .._ \ the patterns ordered by the ratio of the largest eigenvalue \ to the smallest eigenvalue (i.e., the condition number of I I I 1 I I I 1 \I ,001 1 the correlation matrix). For p = 6, eight of the eigen- 12 3 4 5 6 7 8 9 10 value combinations correspond to correlation matrices Pattern Number constructed according to the 2a combinations of the C1. Canonical Coefficients: (I' = (V99, 0.5, 0.5,! 0.5, 0.5, 0) factors EIG, COL, and MCL in Experiment 2 of the paper. 102 Journal of the American Statistical Association, March 1977

Different randomly chosen rotations were used in con- randomly, or even better, to choose combinations (a,A) structing each of these matrices. in a systematic experimental design. The results are too lengthy to give in full here. The The Monte Carlo computations just discussed were general flavor is given in Figures B and C. Figure B performed using FORTRAN programs on a CDC 6400 com- (p = 2) and Figure C (p = 6) are semi-log plots of the puter. The random normal deviates used were generated ratio of the average SEB for each method to the average using a library routine NORMAL based on a method pro- SEB for least squares. A point above the SEB = 1 line posed by Marsaglia and Bray [6]. The uniform random indicates the superiority of least squares. The abscissa is numbers used by NORJIAL were produced by a multipli- simply the eigenvalue pattern. Thus, the condition cative congruential generator using modulus 248 and number of the correlation matrix increases from left to multiplier 513. Because the simulations were intended to right. For both p = 2 and p = 6 there are clear gains be illustrative and preliminary, no attempt has been for the ridge methods relative to least squares when a1 made to determine standard errors for the ratios of SEB (the canonical regression coefficient associated with the in Figures B and C. All the curves in a plot were based largest eigenvalue) is large. However, the gains become on the same sets of randomly generated least-squares losses when there are substantial ails associated with the estimates dLs. However, different plots mere based on smaller eigenvalues, provided the condition number of independent samples of random deviates. the matrix is not too large. For extreme eigenvalue patterns, there appear to be guaranteed gains from the REFERENCES ridge methods, irrespective of the a’s, at least within the range of a patterns we studied. The basic point is that [l] Goldstein, M., and Smith, A., “RidgeType Estimations for Regression Analysis,” Journal the Royal Statistical Society, for moderately ill-conditioned matrices (say correspond- of Ser. B, 36 (1974), 284-91. ing to the degree of collinearity and multicollinearity [a] Hemmerle, W.J., “An Explicit Solution for Generalized Ridge studied in the paper) it is not at all clear that ridge Regression,” Technomelrics, 17 (1975),309-14. methods offer a clear-cut improvement over least squares [3] Hoerl, A., and Kennard, R., “Ridge Regression: Biased Esti- except for particular orientations of 0 relative to the mation for Nonorthogonal Problems,” Technometrk, 12 eigenvectors of X’X. (1970), 55-67. Looking again more closely at, Figure A (as well as r41 -, Kennard, R., and Baldwin, K., “Ridge Regression: other similar plots not given here), we see that there were Some Simulations,” Communications in Statistics, 4 (19751, 105-23. cases where a1was in fact the dominant component, even [5] Marquardt, D.,“Generalized Inverses, Ridge Regression, Biased though no explicit decision was made to make it so. This Linear Estimation, and Nonlinear Estimation,” Technometrics, is a result of the choice of particular levels of factors 12 (1970), 591412. BETA, MCL, and COR. Perhaps a more suitable procedure [6] Marsaglia, G., and Bray, T.A., “A Convenient Method for would have been to choose the orientations of the a’s Generating Normal Variables,” SIAM Review, 6 (1964), 260-4.

Comment

RONALD A. THISTED* Downloaded by [University of Michigan] at 08:24 03 October 2017

Dempster, Schatzoff, and Wermuth have taken on the shall direct my attention. Several remarks are in order task of determining how best to achieve in practice the which perhaps will clarify the scope and generality of gains over least squares that are guaranteed to us in their findings. These comments are primarily concerned theory when the regression coefficients number three or with the relative merits of RIDGE estimators and Stein- more. They have given us a catalog of rules and, within type estimators. the limitations of their study, have given us much insight The study attempts to separate the effects of col- into the behavior that these rules display and their linearity, multicollinearity, and eigenvalue pattern by performance relative to one another. Their conclusions including separate factors for each of them in the ex- are striking, particularly their assertion that ridge- periments. However, both SPE and SEB for OREG, RIDGM, regression rules are markedly superior to Stein-type and STEINM depend upon XTX only through its eigen- estimators, and it is primarily toward this result that I values. Furthermore, higher levels of COL (an MCL in Experiment 2) simply represent additional broadening * Ronald A. Thiated is Assistant Profeasor, Department of Statistics, University of the eigenvalue spectrum. Consequently, it is not of Chicago, Chicago, IL 80637. This work was funded in part by B National Science Foundation Eraduste fellowship. surprising to see significant main effects for each of these Simulation of Alternatives to Least Squares 1 03

factors and nonsignificant interactions in Table 3. It is theorem leads us to conclude with high posterior prob- important to recognize that these factors are not different ability, that for these data, the RIDGE prior and not the effects but one and the same-the effect of highly unequal STEIN prior is “more nearly correct.”

eigenvalues. As the reduction to principal components Consider, then, the random mechanism by which (Y is shows, multicollinearity and collinearity ‘affect SEB and selected in this study. First of all, @ is fixed, then a random SPE only to the extent that they spread out the eigen- orthonormal matrix G is generated. For any fixed vector values of XTX. u, Gu is uniformly distributed on the p sphere of radius After discussing the optimality of RIDGE and STEIN IIu//.The matrix G corresponds to CT of Appendix A. rules, each of which is Bayes for a particular prior dis- Consequently, (Y = GT@has a uniform distribution on tribution on (Y and any quadratic loss, the authors proceed the p sphere of radius il@ll. It is easy to see that, since to require a “rule for determining k from the sample a, -a, and (- a1, a2, . . ., a,) have the same distribution, data.” Of course this simply won’t do for the subjectivist Bayesian, for whom k represents a judgment on the precisions of components of a. Further, it is important to note that we may forfeit the previously mentioned optimality if k is a function of the data. It is curious that STEINM does so badly with respect to Hence, the method used to generate (Y has mean zero and SPE. It is well known [2] that the problem of estimating equal component variances. Thus the prior variances of regression coefficients with SPE loss is equivalent to estimating the mean of a multivariate normal distri- the ail by which we mean the variances of the random mechanism generating the ai this study, are equal and bution with equal variances and loss function L(S, @) in not proportional to the inverse eigenvalues. = 116 - @1[*/u2.Furthermore, Efron and Morris [l] show that in the latter problem the James-Stein positive-part As the authors point out in the quoted passage, this rule cannot be substantially improved upon in very much setup is highly favorable to RIDGE, and it ought not to be surprising that RIDGM beats STEINM even on SPE, the of the parameter space. From the fact that STEINM is loss function most favorable to STEIN-type estimators. so badly beaten in SPE by RIDGM we must conclude that : Furthermore, the more disparate the eigenvalues, the STEINM is not equivalent to the James-Stein rule; the parameters chosen in the study are restricted to regions worse STEIN-type rules will do in this experiment com- pared to RIDGE rules, since the STEIN prior is less like the of the parameter space more favorable to RIDGE rules than to STEIN-type rules; or that in this particular trip “correct” prior. It is easy to predict on these grounds that STEIN~Iwill improve its performance in Experiment to Monte Carlo the house has taken its cut, and that the results we see are not representative. This observation 2, since there are two vectors of eigenvalues added to brings us to our final point. those of the first experiment, each of which is less extreme than one of the vectors from the first experiment, so that Bayes rules are not optimal only when the statistician the average spread in the eigenvalues is reduced. STEINM has quantified his prior beliefs about (Y by specifying a for it. They are also optimal, improves dramatically. for instance, when the parameters in each experiment Let us return then to the data analyst, “who knows actually are generated by some random mechanism, the only his data and not the underlying parameters,” and distributional properties of which are known to the let us leave him with two words of caution. The condi- statistician. In the latter case it makes sense to speak of tions represented in the present experiment may not represent those likely to occur in practice. Further, it is a “correct” prior distribution for a. The authors are perhaps still too early to recommend ridge regression for correct in their remark (p. 80) that, Downloaded by [University of Michigan] at 08:25 03 October 2017 routine use in data analysis. To assert that RIDGE is better [than STEIN] in practice is equivalent to asserting that its prior assumptions are more REFERENCES nearly correct over the range of the statistician’s experience. Note especially that if the RIDGE prior is correct then the RIDGE [l] Efron, Bradley, and Morris, Carl, “Stein’s Estimation Rule and estimator is optimum for any quadratic loss function, including Its Competitors-an Empirical Bay= Approach,” Journal of SEB and SPE. the American Statistical Aaaociation, 68 (March 1973), 117-30. [Z] Sclove, Stanley L., “Improved Estimators for Coefficients in a Consequently, when we observe RIDGM to be the big Linear Regression,” Journal of the American Stutistieal Asaocicc winner both in SEB and SPE, a rough application of Bayes tion, 63 (June 1968), 596-606. 104 Journa’l of the American Statistical Association, March 1977 Rejoinder

A. P. DEMPSTER, MARTIN SCHATZOFF, and NANNY WERMUTH

1. INTRODUCTION gation, with extrapolation to unexplored regions of the Since the discussants’ comments fall into a few distinct parameter space difficult at best, and often hazardous. categories, we shall organize our responses by topic Accordingly, we have not made sweeping claims as to the rather than respond to each discussant separately. We general applicability of our results, but rather have at- preface our remarks with some comments of a general tempted to explore the effects of some parameters of nature. interest on a large number of different estimation pro- Our first observation on reading the six sets of com- cedures. ments is that all of the discussants are primarily in- Two of the discussants’ papers (Allen; and Bingham (Y @ terested in RIDGE methods, or in the comparison of RIDGE and Larntz) argued for variation of rather than in and STEIN procedures. Thus despite years of widespread the experimental design, while a third (Thisted) stressed use of techniques such as and regres- that the design factors COL and MCL affect the risk func- sion on principal components, there is no mention of tions based on SPE and SEB only through the A’s, for these methods by the discussants. We believe that the OREG, RIDGM, and STEINM. In both instances, our rationale indicated direction of interest is due to a combination of was to provide comparative evaluation of these pro- the lack of theoretical understanding of the latter classes cedures with various types of stepwise selection of vari- of procedures, and the analytical and philosophical ables. We expected these comparisons to be sensitive to variation in the P’s as well as to the pattern and degree of attractiveness of the RIDGE and STEIN approaches. Possibly the performance comparisons produced by our correlations in the independent variables. It should be study are such as to dampen interest in many common pointed out that in the simulation examples presented methods. Two of the discussants, Allen and Smith, ex- by Hoerl, based on random selection of the a’s with pressed interest in the REGF methods, but did not offer specified norm, RIDGix had very high relative substantive remarks. to the maximum potential, over wide ranges of the norm. A second observation is that there are no comments on A second comment on the design, made both in the the analysis of the results of the experiments. We were Thisted and Efron and Morris papers, has to do with our worried that someone might question our use of OREG in use of a random rotation matrix. Their claim is that this producing Table 5, while we clearly suggested in the type of would tend to symmetrize the paper that methods such as RIDGM and FREGF offered prior distribution of the P’s, resulting in exchangeable possibilities for improved estimation. Third, we are impressed by the variety and seriousness prior distributions that would naturally favor RIDGM of the commenters’ views on the broad classes of methods against other methods. This idea is intriguing, but is covered by the labels RIDGE and STEIN. We believe that not made very precise in the comments. Perhaps it the state of the art in these areas is well reflected in the means that random rotation makes the coordinates Q! Downloaded by [University of Michigan] at 08:26 03 October 2017 discussion. distribute in a way which appears exchangeable over the In considering the specific points raised by the dis- 160 data sets. Note that both RIDGE and REGF assume cussants, it appears that most of these may be appro- prior exchangeability among the components of @,but priately classified as follows : are very different methods which dominate each other in 1. Design of the experiment, different stituations, so exchangeability is not a sufficient 2. Theoretical aspects of RIDOF, and STEIN methods, description of a prior distribution of @ to guide the data 3. Estimation of the RIDGE parameter (k), analyst. In any case, the actual distribution of @ over 4. Criteria for evaluating alternate methods, and 5. What to do with real data. our 160 data sets is a very simple, known, discrete dis- We discuss in the next section what we consider to be the tribution, as opposed to the symmetrized distribution, relevant aspects of various comments pertaining to these whatever that is. It seems a small swindle to base inter- issues. pretations on an artificially scrambled distribution of a’s rather than the simple known distribution of @.An 2. DISCUSSION OF SPECIFIC COMMENTS interesting question remains : how should we have syste- 2.1 Design of the Experiment matically varied our factor @ to produce fairer com- As with any Monte Carlo type of study, hard conclu- parisons of the relative strengths and weaknesses of sions must usually be confined to the domain of investi- RIDGE and STEIN? Simulation of Alternatives to Least Squares 1 05

2.2 Theoretical Aspects of RIDGE and STEIN Methods We agree with Efron and Morris that it would have Efron and Morris state that STEINM is not the James- been desirable to include EBMLE in our set of RIDGE Stein rule, as we have been careful to note in Section methods. Our failure to do so was due to an error which 2.4.1. We wish to point out that JSTEIN and STEINM are led us to believe until too late in the study that RIDGM both STEIN-type procedures in that they shrink uniformly was equivalent to EBMLE. We would conjecture that on all principle axes, and that they differ only in the EBMLE should be slightly better than RIDGM on our degree of shrinkage. We maintain our contention that criteria. JSTEIN would have performed worse on our data than Perhaps there are some Bayesian statisticians as STEINM. It thus appears that is in Thisted states “for whom k represents a judgment on conflict with our empirical results. the precisions of components of (Y.” A more usual con- We believe that statisticians should no longer accept temporary Bayesian formulation would be to regard k as without question the assumptions of Efron, Morris, and an unknown which needs a prior distribution just like Thisted that statistical techniques should be evaluated other unknowns. The use of an estimated @ associated theoretically by means of frequentist risk functions de- with an estimated k is a crude approximation to the pending on unknown parameter values. We were careful center of a posterior distribution, which is reasonably in our paper to define SPE and SEB in terms of actual stable across a plausible range of smooth priors on k. errors of estimation, and not in terms of theoretical We did not spell this out because our paper is not pri- averages of such errors, whether frequentist or Bayesian. marily Bayesian in outlook. We do feel, however, and Our comment may be illustrated by Thisted’s statement, Efron, Morris, and Thisted apparently agree, that the “both SPE and SEB for OREG, RIDGM, and STEINM depend success of RIDGM must relate to some type of fit between upon XTX only through its eigenvalues,” which is the design of our study and the Bayesian assumptions literally false according to our definitions. It is also false, which make RIDGM a near-optimum technique. in general, when a prior distribution of @ is available, whether SEB and SPE are reinterpreted as posterior ex- 2.4 Criteria for Evaluation of Alternate Methods pectations given a data set, or are prior expectations of Smith, and Efron and Morris have addressed them- such Bayes risks. The statement is true for prior ex- selves to the question of criteria for comparing different pectations of a game player who knows @,but the methods. relevance and applicability of this game to data analysis Specifically, Smith is concerned about our use of SPE is a matter of current dispute. as a measure of predictive error, because it is defined only Having expressed serious reservations about the mean- at the same design points used in the experiment. He ing of the Efron-Morris-Thisted theory, we do wish to correctly points out that it “. . . gives no insight into the express our admiration for their efforts, and our wish to greater scope for improvement which occurs . . . when understand the insights which they feel the theory gives. the directions in which predictions are required turn out Their comment about the general incompatibility of to be those which are poorly estimated on the basis of minimax and empirical Bayes seems to us to capture a the original design matrix.” It would have been interest- real dilemma of much of statistics: except in rare, ing to expand the design to incorporate evaluation of mathematically nice, and overly taught, circumstances, predictive errors at points other than those included in there is no sure-thing principle to protect us against the the original design. need for hard prior judgments. Efron and Morris state that JSTEIN should not be SEB, Downloaded by [University of Michigan] at 08:26 03 October 2017 2.3 Estimation of the RIDGE Parameter applied with the loss because it is not minimax in this case. We believe, however, that SEB is a very im- Efron and Morris’s statement that “. . . ridge esti- portant criterion, since it often happens that the principal mators are a class of Bayes rules against normal priors objective of a regression study is to estimate the values indexed by and the effectiveness of a given rule k, of the regression coefficients. depends upon how k is estimated” summarizes the situation very concisely. 2.5 What to Do with Real Data We believe that we have demonstrated remarkable empirical properties for the RIDGM rule for estimating k. The problem of what to do with real data is not solved We have received a letter from Professor Hoerl, written by our study. Efron and Morris, and Thisted correctly after his original commentary on our paper, in which he caution against the routine application of any shrinkage alludes to a recent comparative evaluation of a number rule, and indeed we have adopted exactly the same of ridge estimators over a spectrum of signal-to-noise. posture in our paper. None of the discussants offered any He states, “Based on a broad comparison of all the concrete proposals, however, as to how one should pro- algorithms, with p = 10, yours is the most effective. In ceed when analyzing real data. Nor were there any fact, the degree to which your algorithm achieves near comments on our suggestions other than those by Hoerl, potential is startling.” We are not sure whether he is who indicated that he has experimented with a number referring to the study presented in his discussion of our of versions of our suggestion to divide data into subsets paper, or to a further exploration not yet reported. as a basis for comparing different estimation techniques. 1 06 Journal of the American Statistical Association, March 1977

Although he claims to have found such techniques to be and believe that the combincd effcct will be to stimulate deficient, we would be most interested in seeing the further research, both theoretical and cxperimental. We results. In terms of a predictive error criterion such as view the problem of what to do with rcal data as being SPE, or the predictive mean square error advocatcd by of paramount importance and we hopc that some of the Smith, it would seem that comparison of thc prcdictive suggestions made in the concluding scction of our paper capabilities of various methods from one subset to will be followed up. This is not mcant to preclude inde- another would provide a reasonable empirical basis for pendent approaches, for thcre is certainly ample room selecting a particular method in a given situation. for development and exploration of new ideas on many 3. CONCLUSION facets of the problem. The potcntial for large gains We feel that a number of interesting and uscful points clearly exists. We need to dcvclop tools for better ex- have emerged from the various discussions of our paper, ploiting this potential. Downloaded by [University of Michigan] at 08:26 03 October 2017