<<

Shrinkage regression

Rolf Sundberg Volume 4, pp 1994–1998

in

Encyclopedia of Environmetrics (ISBN 0471 899976)

Edited by

Abdel H. El-Shaarawi and Walter W. Piegorsch

 John Wiley & Sons, Ltd, Chichester, 2002 Shrinkage regression unrealistically large values. Under exact collinearity, bOLS is not even uniquely defined. In these situations it can pay substantially to use shrinkage methods that Shrinkage regression refers to shrinkage methods trade bias for . Not only can they give more of estimation or prediction in regression situations, realistic estimates of ˇ, but they are motivated even useful when there is multicollinearity among the stronger for construction of a predictor of y from x. regressors. With a term borrowed from approximation A simple and as we will see fundamental shrink- theory, these methods are also called regulariza- age construction is ridge regression (RR), according tion methods. Such situations occur frequently in to which the OLS estimator is replaced by environmetric studies, when many chemical, biologi- 1 cal or other explanatory variables are measured, as bRR D Sxx C υI sxy 4 when Branco et al. [3] wanted to model toxicity of n D 19 industrial wastewater samples by pH and for a suitable ridge constant υ>0 (to be specified). seven metal concentrations, and found it necessary to Addition of whatever υ>0 along the diagonal of use dimensionality-reducing techniques. As another Sxx clearly makes this factor less close to singular. example, to be discussed further below, Andersson To see more precisely how RR shrinks, consider et al. [1] constructed a quantitative structure–acti- an orthogonal transformation that replaces the com- vity relationship (QSAR) to model polychlorinated ponents of x by the principal components (PCs),  biphenyl (PCB) effects on cell-to-cell communication ti D ci x,iD 1,...,p. Here the coefficient vectors inhibition in the liver, having a sample of size n D 27 ci are the eigenvectors of Sxx, with eigenvalues i, of PCB congeners that represented a set of 154 differ- say. Correspondingly, Sxx is replaced by the diago- ent PCBs, and using 52 physicochemical properties nal matrix of these eigenvalues i, and RR adds υ to of the PCBs as explanatory variables. them all. When OLS divides the ith component of the Suppose we have n independent observations vector sty by i, RR divides by i C υ. Hence, RR of x,yD x1,...,xp,y from a standard multiple shrinks bOLS much in principal directions with small regression model , but only little in other principal directions. Variable selection might appear to shrink, by 0 Y D ˛ C ˇ x C ε1 restricting attention to a lower-dimensional subspace U r < p of the model space of the xs. Variable 2 r varε D (see Linear models). The ordinary least selection improves precision when it omits vari- squares (OLS) estimator can be written ables with negligible influence, but if it is used to eliminate multicollinearities, then it does not neces- b D S1s 2 OLS xx xy sarily shrink and will make interpretation difficult. More sophisticated ways of restricting attention to where Sxx is the sum of squares and products matrix lower-dimensional subspaces Ur of regressors are of the centered x variables and sxy is the vector of their sums of products with y. The OLS estimator is represented by methods such as principal compo- the best fitting and minimum variance linear unbiased nents regression (PCR) and partial (or estimator (best linear unbiased estimator, BLUE), projection to latent structures; PLS) regression. Both with variance methods construct new regressors, linear in the old regressors, and y is regressed (by least squares, LS) 2 1 on a relatively small number of these PCs or PLS varbOLS D Sxx 3 factors. This number r, as well as the ridge constant That OLS yields the best fit does not say, however, υ in RR, is typically chosen by a cross-validation that bOLS is best in a wider sense, or even a good criterion. As an example, Andersson et al. [1] used choice. We will discuss here alternatives to be used PLS regression with two PLS factors to replace the when the x variables are (near) multicollinear, that 52 original variables in the regression. This model is when there are linear combinations among them explained R2 D 84% of the variability, and in a cross- 2 that show little variation. The matrix Sxx is then near validation predicted 80% (necessarily less than R , singular, so varbOLS will have very large elements. but quite close in this case). They could also interpret Correspondingly, the components of bOLS may show their model within the context of their chemometric 2 Shrinkage regression application, as referring basically to the position pat- The most important role of CR is perhaps not as a tern of the chlorine atoms. practical regression method, however, but as a theo- In the standard form of PCR, the new regres- retical device embracing several methods. For D 0  sors ti D ci x are the first PCs of the (centered) x we get back OLS, and the algorithm will stop after from a PCs analysis. This means that the cis are the one step r D 1.For D 1 the criterion function eigenvectors of Sxx that explain as much variation (5) is proportional to the covariance squared and will as possible in x. The space Ur is then spanned by yield PLS, and as !1, Vt will dominate and the first r PCs. In this way directions in x with little we get PCR. A much more general yet simple result, variation are shrunk to 0. The connection with vari- which relates to RR and provides an even wider ation in y is only indirect, however. PLS regression understanding, has been given in [2], as follows. is more efficient in this respect, because it constructs Suppose we are given a criterion that is a func-  2 a first regressor t1 D c1 x to maximize covariance tion of R t, y and Vt, nondecreasing in each of between t1 and y (among coefficient vectors c1 of these arguments. The latter demand is quite natural, fixed length), and then successively t2, t3,etc.soas because we want both of these quantities to have high to maximize covariance with the y residuals after LS values. The regressor maximizing any such criterion regression of y on the previous set of regressors. As under the constraint jcjD1 is then proportional to an algorithm for construction of the PLS regression the RR estimator (4) for some υ, possibly negative or subspaces Ur of the x space, spanned by t1,...tr, infinite. Conversely, for each υ ½ 0orυ<max in this is one of several possible. For a long time PLS (4) there is a criterion for which this is the solution. was only algorithmically defined, until Helland [10] The first factor PLS is obtained as υ !š1,and gave a more explicit characterization by showing that υ !max corresponds to the first factor PCR. The Ur for PLS is spanned by the sequence of vectors sxy, first factor CR with varying ½ 0 usually (but not r1 Sxxsxy,...,Sxx sxy. This representation is of more always) requires the whole union of υ ½ 0andυ< theoretical than practical interest, however, because max. The use of c D bRR from (4) to form a sin- there are numerically better algorithms. gle regressor in LS regression is called least squares A step towards a general view on the choice ridge regression (LSRR) in [2], and differs from RR of shrinkage regression was taken by Stone and by a scalar factor that will be near 1 when υ is small. Brooks [13], who introduced continuum regression Hence RR is closely related to the other construc- (CR), with the criterion function tions. Furthermore, experience shows that typically RR and LSRR for some best small υ yield a similar 2 R t, yVt 5 regression as (the best r)PLSand(bestr) PCR. If different constructions yield quite similar esti- where Rt, y is the sample correlation between t D  mators/predictors, what shrinkage method is then c x and y, Vt is the sample variance of t,and ½ 0 preferable? Here is some general advice. is a parameter to be specified. This function of t (or c) should be maximized under the constraint jcjD1. In ž RR and LSRR have the advantage that they only previous terms involve a single factor.  2 ž LSRR only shrinks to compensate for multi- 2 c sxy collinearity, whereas RR always shrinks. R t, y /  6 c Sxxc ž RR has a Bayesian interpretation as posterior  Vt / c Sxxc 7 mean for a suitable prior for ˇ (see Bayesian methods and modeling). As in PLS, when a first c D c1, t D t1, has been ž LSRR (like PLS and PCR) yields residuals found, the criterion is applied again on the residuals orthogonal to the fitted relation, unlike RR. from LS regression of y on t1 to find the next c, c D ž PLS and PCR have interpretations in terms of c2, etc. The new c will automatically be orthogonal latent factors, such that for PCR essentially all to the previous ones. In order to institute a regression x variability is in Ur for suitable r, whereas method of this algorithm, the number r of regressors for PLS all x variability that influences y is in and the parameter are chosen so as to optimize the corresponding Ur for a typically somewhat some cross-validation measure. smaller r than for PCR. Shrinkage regression 3

ž The latent factor structure in PCR and effects on shrinkage regression. The choice of scal- PLS is convenient for outlier detection and ing is often more crucial than the choice between the classification (cf. below and [12, Chapter 5]). different methods (RR, LSRR, PCR, PLS). Also x components missing at random in In multivariate regression, when the response is prediction of y for a new observation are easily a vector y and not a scalar y, and the regression handled. coefficients form a matrix B, we could try shrinkage also in y space. In this way we might be able to Multicollinearity imposes estimation identifiability borrow strength for the prediction of one component problems, when we want to find the true ˇ.Ifx1 C x2 of y from the others. A variety of methods has been is (near) constant, say, we cannot tell if x1 or x2 proposed in the literature. or both jointly are responsible for the influence of x There are PLS type algorithms (PLS2, SIMPLS), on y. This is less serious for prediction, provided that in each step constructing a pair of PLS factors, in new observations satisfy the same relation between y y and in x. The first factors are selected equally and x, and that new xs are like calibration xs, in by PLS2 and SIMPLS, but their criteria for the particular satisfying the same multicollinearity. To subsequent factors differ. There is a multivariate judge the predictive ability, some version of cross- version of CR, joint continuum regression (JCR [5]), validation is natural, based on repeated splits of data of which SIMPLS is a special case. At one of the in calibration and test sets. But we must always have extreme ends of JCR we find PCR, which is not in mind the risk that the calibration (x, y)dataare affected by the dimension of y. At the other end is particular in some respect, and that therefore cross- the LS version of reduced-rank regression (RRR). validation criteria are too optimistic. An extreme RRR, imposing a rank constraint on B, is an example example, from a study concerning determination of of a principle that essentially only shrinks in y space nitrogen in wastewater, is discussed in [14]. In the and therefore is inadequate under near-collinearity in QSAR example above [1], the model was fitted on a x. The maximum likelihood (ML) (6D LS) estimator sample of 27 PCBs from a population group of 154 in an RRR model (standard multivariate regression PCBs. If we want to predict the activity y for a new with a fixed rank constraint added) is equivalent to PCB from this population, then we must allow and using a number of canonical variates from a canonical check for the possibility that it is an outlier in x space, correlation analysis, so for each canonical y variate compared with the calibration sample. Implications we use OLS regression. This method has become differ according to whether it is an outlier in a an important tool in econometrics for modeling direction orthogonal to the two-dimensional latent cointegration (i.e. common trends) in nonstationary subspace Ur, or only within Ur. There are also other vector autoregressive time series models [11], with PCBs not even satisfying the criteria for belonging to potential applications also in environmetrics. the population group under study, and for them we Burnham et al. [8] formulated a latent variable must doubt the prediction equation for y even if x is multivariate regression model, in which both y and not an outlier. x have latent structures, more or less overlapping. Shrinkage regression constructions are intrinsi- The overlap yields a regression relationship, when cally scale noninvariant. This is because ‘near’ in y should be predicted from x. A constrained ML near-collinear is metric-dependent. Correlation is in- method in this model is equivalent to a method known variant, but not covariance and variance. Individ- as principal covariates regression (PcovR [9]), which ual rescaling of the x components, or any other is a shrinkage method that will often yield similar nonorthogonal transformation, will affect the shrink- results to those of PLS2 and SIMPLS. See also [7], age. It is often recommended that x components be giving characterizations and relationships between autoscaled to unit as a pretreatment. This several different methods with respect to their opti- is natural when the x components are on incompa- mality criteria of construction. rable scales, as in QSAR studies, but silly in many Further methods were discussed and compared other situations, for example in spectroscopy. As by Breiman and Friedman [4], who constructed and another example of frequent pretreatments, second- advocated a method they called ‘curds and whey’, order differencing of a spectrum x corresponds to primarily shrinking in y space. The article caused a nonorthogonal linear transformation, and thus has lively debate and much criticism. 4 Shrinkage regression

Despite the many shrinkage methods proposed and [4] Breiman, L. & Friedman, J.H. (1997). Predicting mul- used for multivariate regression, it is not clear if and tivariate responses in multiple linear regression (with when they are really better than separate univariate discussion), Journal of the Royal Statistical Society, Series B 59, 3–54. regressions. From the literature it seems as if one [5] Brooks, R.J. & Stone, M. (1994). Joint continuum seldom gains much precision from replacing separate regression for multiple predictands, Journal of the univariate predictions by a multivariate prediction American Statistical Association 89, 1374–1377. method, cf. the discussion in [5]. [6] Brown, P.J. (1993). Measurement, Regression, and Cal- For fully implemented PCR or PLS, with cross- ibration, Oxford University Press, Oxford. validation to help select the number of factors and [7] Burnham, A.J., Viveros, R. & MacGregor, J.F. (1996). various diagnostic tools, there are commercial pack- Frameworks for latent variable multivariate regression, Journal of Chemometrics 10, 31–45. ages from the field of chemometrics doing the job,  [8] Burnham, A.J., MacGregor, J.F. & Viveros, R. (1999). such as SIMCA (Umetrics) and UNSCRAMBLER A statistical framework for multivariate latent variable (CAMO). For those who want more flexibility or regression methods based on maximum likelihood,  more lucid software, programming in Matlab can Journal of Chemometrics 13, 49–65. be recommended. The PLS Toolbox (Eigenvector [9] De Jong, S. & Kiers, H.A.L. (1992). Principal covariates Research) contains a large number of Matlab sub- regression. Part I. Theory, Chemometrics and Intelligent routines, and there is also much free code available Laboratory Systems 14, 155–164. [10] Helland, I.S. (1988). On the structure of partial on the web. least squares regression, Communications in Statis- Recommended literature on shrinkage regression tics–Simulation and Computation 17, 581–607. includes the books by Brown [6], not least for cov- [11] Johansen, S. (1995). Likelihood-based Inference in ering Bayesian aspects, and Martens and Næs [12]. Cointegrated Vector Autoregressive Models, Oxford University Press, Oxford. [12] Martens, H. & Næs, T. (1989). Multivariate Calibration, References Wiley, Chichester. [13] Stone, M. & Brooks, R.J. (1990). Continuum regres- [1] Andersson, P., Haglund, P., Rappe, C. & Tysklind, M. sion: cross-validated sequentially constructed prediction (1996). Ultra violet absorption characteristics and calcu- embracing ordinary least squares, partial least squares lated semi-empirical parameters as chemical descriptors and principal components regression (with discussion), in multivariate modelling of polychlorinated biphenyls, Journal of the Royal Statistical Society, Series B 52, Journal of Chemometrics 10, 171–185. 237–269. [2] Bjorkstr¨ om,¨ A. & Sundberg, R. (1999). A generalized [14] Sundberg, R. (1999). Multivariate calibration – direct view on continuum regression, Scandinavian Journal of and indirect regression methodology (with discussion), 26, 17–30. Scandinavian Journal of Statistics 26, 161–207. [3] Branco, J.A., Pires, A.M., Mendon¸ca, E. & Picado, A.M. (1994). A statistical evaluation of toxicity tests of waste water, in Statistics for the Environment, (See also Collinearity) Vol. 2, V. Barnett & K.F. Turkman, eds, Wiley, Chichester. ROLF SUNDBERG