A New Robust Partial Least Squares Re- Gression Method Based on a Robust and an Efficient Adaptive Reweighted Esti- Mator of Covariance
Total Page:16
File Type:pdf, Size:1020Kb
A NEW ROBUST PARTIAL LEAST SQUARES RE- GRESSION METHOD BASED ON A ROBUST AND AN EFFICIENT ADAPTIVE REWEIGHTED ESTI- MATOR OF COVARIANCE Authors: Esra Polat { Department of Statistics, Hacettepe University, Turkey ([email protected]) Suleyman Gunay { Department of Statistics, Hacettepe University, Turkey ([email protected]) Abstract: • Partial Least Squares Regression (PLSR) is a linear regression technique developed as an incomplete or "partial" version of the least squares estimator of regression, applica- ble when high or perfect multicollinearity is present in the predictor variables. Robust methods are introduced to reduce or remove the effects of outlying data points. In the previous studies it has been showed that if the sample covariance matrix is properly robustified further robustification of the linear regression steps of the PLS1 algorithm (PLSR with univariate response variable) becomes unnecessary. Therefore, we pro- pose a new robust PLSR method based on robustification of the covariance matrix used in classical PLS1 algorithm. We select a reweighted estimator of covariance, in which the Minimum Covariance Determinant as initial estimator is used, with weights adaptively computed from the data. We compare this new robust PLSR method with classical PLSR and four other well-known robust PLSR methods. Both simulation results and the analysis of a real data set show the effectiveness and robustness of the new proposed robust PLSR method. Key-Words: • Efficient estimation, Minimum Covariance Determinant (MCD), partial least squares regression, robust covariance matrix, robust estimation AMS Subject Classification: • 62F35, 62H12, 62J05 2 Esra POLAT and Suleyman GUNAY A NEW ROBUST PARTIAL LEAST SQUARES REGRESSION METHOD 3 1. INTRODUCTION Classical PLSR is a well-established technique in multivariate data anal- ysis. It is used to model the linear relation between a set of regressors and a set of response variables, which can then be used to predict the value of the re- sponse variables for a new sample. A typical example is multivariate calibration where the x-variables are spectra and the y-variables are the concentrations of certain constituents. Since classical PLSR is known to be severely affected by the presence of outliers in the data or deviations from normality, several PLSR methods with robust behaviour towards data contamination have been proposed (Hubert and Vanden Branden, 2003; Liebmann et al., 2010). NIPALS and SIM- PLS are the popular algorithms for PLSR and they are very sensitive to outliers in the dataset. For univariate or multivariate response variable several robustified versions of these algorithms have already been proposed (Gonz´alezet al., 2009). The two main strategies in the literature for robust PLSR are (1) the down- weighting of outliers and (2) robust estimation of the covariance matrix. The early approaches for robust regression by downweighting of outliers are considered semi- robust: they had, for instance, non-robust initial weights or the weights were not resistant to leverage points (Hubert and Vanden Branden, 2003). Based on the first strategy, for example, Wakeling and Macfie (1992) worked with the PLS with multivariate response variables (which will be called PLS2) and their idea was to replace the set of regressions involved in the standard PLS2 algorithm by M esti- mates based on weighted regressions. Griep et al. (1995) compared least median of squares (LMS), Siegel's repeated median (RM) and iterative reweighted least squares (IRLS) for PLS with univariate response variable (PLS1 algorithm), but these methods are not resistant to high leverage outliers (Gonz´alezet al., 2009). Based on the second strategy, a robust covariance estimation, the robust PLSR methods provide resistance to all types of outliers including leverage points (Hu- bert and Vanden Branden, 2003). For instance, Gil and Romera (1998) proposed a robust PLSR method based on statistical procedures for covariance matrix ro- bustification for PLS1 algorithm. They selected the well-known Stahel-Donoho estimator (SDE) (Gil and Romera, 1998). Since SIMPLS is based on the empirical cross-covariance matrix between the y-variables and the x-variables and on linear Least Squares (LS) regression, the results are affected by outliers in the data set. Hence, Hubert and Vanden Branden (2003) have been suggested a robust version of this method called RSIMPLS that it is used in case of both univariate and multivariate response variables. A robust method RSIMPLS starts by applying ROBPCA on the x- and y-variables in order to replace the covariance matrices Sxy and Sx by robust estimates and then proceeds analogously to the SIMPLS algorithm. A robust regression method (ROBPCA regression) is performed in the second stage. ROBPCA is a robust PCA method which combines projection pur- suit ideas with Minimum Covariance Determinant (MCD) covariance estimation in lower dimensions (Engelen et al., 2004; Hubert and Vanden Branden, 2003). Serneels et al. (2005) proposed a method called as Partial Robust M (PRM) 4 Esra POLAT and Suleyman GUNAY regression that it is conceptually different from the other robust PLSR methods: instead of robust partial least squares, a partial robust regression estimator was proposed. This method uses SIMPLS algorithm and it could be used in case of univariate response. In this method, with an appropriately chosen weighting scheme, both vertical outliers and leverage points were downweighted (Serneels et al., 2005). As the name suggests, it is a partial version of the robust M-regression. In an iterative scheme, weights ranging between zero and one are calculated to reduce the influence of deviating observations in the y space as well as in the space of the regressor variables. PRM is very efficient in terms of computational cost and statistical properties (Liebmann et al., 2010). Gonz´alezet al. (2009) also concentrated in the case of univariate response (PLS1) and showed that if the sample covariance matrix is properly robustified the PLS1 algorithm will be robust and, therefore, further robustification of the linear regression steps of the PLS1 algorithm is unnecessary (Gonz´alezet al., 2009). In this paper, we concentrate in the case of univariate response (PLS1) and we present a procedure which applies the standard PLS1 algorithm to a robust covariance matrix similar to Gil and Romera (1998) and Gonz´alezet al. (2009) studies. In our study, we estimate the covariance matrix used in PLS1 algorithm robustly by using 'an adaptive reweighted estimator of covariance using Minimum Covariance Determinant (MCD) estimators in the first step as robust initial estimators of location and covariance'. The rest of the paper is organized as follows. Section 2 reviews briefly the PLS1 algorithm (PLS with univariate response variable). Section 3 presents the new proposed robust PLSR method 'PLS-ARWMCD'. Section 4 contains a simulation study where the performance of the new robust PLSR method is compared to classical PLSR method and other four robust PLSR methods existing in robust PLSR literature. Section 5 illustrates the performance of the new proposed robust PLSR method 'PLS-ARWMCD' in a well known set of real data in robust PLSR literature. Finally, Section 6 collects some conclusions. 2. THE CLASSICAL PLS1 ALGORITHM It is supposed that we have a sample of size n of a 1 + p dimensional vector 0 z = (y; X) which could be decomposed as a set of p independent variables, x and a univariate response variable y. Throughout this paper, matrices are denoted by bold capital letters and vectors are denoted by bold lowercase letters. Let Sz, be 2 0 sy sy;X the sample covariance matrix of z, consisting of the elements Sz = , sy;X SX where sy;X is the p × 1 vector of covariances between y and the x variables. The aim of this study is to estimate the linear regression y^ = β^0 x, and it is assumed that the response variable can be linearly explained by a set of a components t1; :::; tk with k << p, which are linear functions of the x variables. Hence, 0 calling X the n × p data matrix of the independent variables, and xi to its ith A NEW ROBUST PARTIAL LEAST SQUARES REGRESSION METHOD 5 row, the following model showed by (2.1) and (2.2) holds (Gonz´alezet al., 2009). (2.1) xi = P ti + "i 0 (2.2) yi = q ti + ηi 0 Here, P is the p × k matrix of the loadings of the vector ti = (ti1; :::; tik) and q is the k-dimensional vector of the y-loadings. The vectors "i and ηi have zero mean, follow normal distributions and are uncorrelated. The component 0 matrix T = (t1; :::; tk) is not directly observed and should be estimated. Then, it can be shown that the maximum likelihood estimation of the T matrix is given as in (2.3) (Gonz´alezet al., 2009). (2.3) T = XWk Here, the loading matrix Wk = [w1; w2; :::; wk] is the p × k matrix of coefficients and the vectors wi; 1 ≤ i < k are the solution of (2.4) under the con- straint in (2.5) with w1αsy;x. Consequently, we can conclude that components (t1; :::; tk) are orthogonal (Gonz´alezet al., 2009). 2 (2.4) wi = arg max cov (Xw; y) w 0 0 (2.5) w w = 1 and wiSxwj = 0 for 1 ≤ j < i It can be shown that vectors wi are found as the eigenvectors linked to the largest eigenvalues of the matrix is given as in (2.6). 0 (2.6) (I − Px(i)) sy;xsy;x Px(i) is the projection matrix on the space spanned by SX Wi, given by h 0 i−1 0 Px(i) = (SxWi) (SxWi) (SxWi) (SxWi) . From these results it is easy to see that the vectors wi can be computed recursively as in below. 6 Esra POLAT and Suleyman GUNAY (2.7) w1αsy;x 0 −1 0 (2.8) wi+1αsy;x − SxWi Wi SxWi Wi sy;x; 1 ≤ i < k It could be mentioned that by using the expressions given by (2.7) and (2.8), it is not necessary to calculate the PLS components ti.