Partial Least Squares Regression. Projection on Latent Structures

Wiley Interdisciplinary Reviews: Computational Statistics Partial least squares regression For Peer Review Journal: Wiley Interdisciplinary Reviews: Computational Statistics Manuscript ID: EOCS-039.R1 Wiley - Manuscript type: Focus article Date Submitted by the 08-Jan-2009 Author: Complete List of Authors: Abdi, Hervé Partial least squares, Projection to latent structures, Principal Keywords: component analysis, multicollinearity, Singular value decomposition John Wiley & Sons Page 1 of 23 Wiley Interdisciplinary Reviews: Computational Statistics 1 2 3 4 5 6 7 8 9 10 11 12 13 Partial Least Squares Regression 14 15 and 16 17 Projection on Latent Structure Regression 18 19 (PLS-Regression) 20 For Peer Review 21 22 Hervé Abdi 23 24 25 Abstract 26 Partial least squares (pls) regression (a.k.a projection on latent 27 structures) is a recent technique that combines features from and gen- 28 eralizes principal component analysis (pca) and multiple linear re- 29 gression. Its goal is to predict a set of dependent variables from a set 30 31 of independent variables or predictors. This prediction is achieved by 32 extracting from the predictors a set of orthogonal factors called latent 33 variables which have the best predictive power. These latent variables 34 can be used to create displays akin to pca displays. The quality of 35 the prediction obtained from a pls regression model is evaluated with 36 cross-validation techniques such as the bootstrap and jackknife. There 37 38 are two main variants of pls regression: The most common one sepa- 39 rates the rôles of independent and independent variables; the second 40 one|used mostly to analyze brain imaging data|gives the same rôles 41 to dependent and independent variables. 42 43 Keywords: 44 Partial least squares, Projection to latent structures, Principal com- 45 ponent analysis, Principal component regression, Multiple regression, 46 multicollinearity, nipals Eigenvalue decomposition, Singular value de- 47 composition, Bootstrap, Jackknife, small N large P problem. 48 49 50 51 1 Introduction 52 53 Pls regression is an acronym which originally stood for Partial Least Squares 54 Regression, but, recently, some authors have preferred to develop this acronym 55 56 57 1 58 59 60 John Wiley & Sons Wiley Interdisciplinary Reviews: Computational Statistics Page 2 of 23 1 2 3 4 5 6 7 8 as Projection to Latent Structures. In any case, pls regression combines fea- 9 tures from and generalizes principal component analysis and multiple linear 10 11 regression. Its goal is to analyze or predict a set of dependent variables 12 from a set of independent variables or predictors. This prediction is achieved 13 by extracting from the predictors a set of orthogonal factors called latent 14 variables which have the best predictive power. 15 16 Pls regression is particularly useful when we need to predict a set of 17 dependent variables from a (very) large set of independent variables (i.e., 18 predictors). It originated in the social sciences (specifically economy, Herman 19 Wold 1966) but became popular first in chemometrics (i.e., computational 20 For Peer Review 21 chemistry) due in part to Herman's son Svante, (Wold, 2001) and in sensory 22 evaluation (Martens & Naes, 1989). But pls regression is also becoming 23 a tool of choice in the social sciences as a multivariate technique for non- 24 experimental (e.g., Fornell, Lorange, & Roos, 1990; Hulland, 1999; Graham, 25 26 Evenko, Rajan, 1992) and experimental data alike (e.g., neuroimaging, see 27 Worsley, 1997; Mcintosh & Lobaugh, 2004; Giessing et al., 2007; Kovacevic 28 & McIntosh, 2007; Wang et al., 2008). It was first presented as an algorithm 29 akin to the power method (used for computing eigenvectors) but was rapidly 30 interpreted in a statistical framework. (see e.g., Burnham, 1996; Garthwaite, 31 32 1994; Höskuldson, 2001; Phatak, & de Jong, 1997; Tenenhaus, 1998; Ter 33 Braak & de Jong, 1998). 34 Recent developments, including, extensions to multiple table analysis, 35 are explored in Höskuldson (in press), and in the volume edited by Esposito 36 37 Vinzi, Chin, Henseler, and Wang (2009). 38 39 40 2 Prerequisite notions and notations 41 42 The I observations described by K dependent variables are stored in an 43 I K matrix denoted Y, the values of J predictors collected on these I 44 × 45 observations are collected in an I J matrix X. 46 × 47 48 3 Goal of PLS regression: 49 50 Predict Y from X 51 52 The goal of pls regression is to predict Y from X and to describe their 53 54 common structure. When Y is a vector and X is a full rank matrix, this goal 55 56 57 2 58 59 60 John Wiley & Sons Page 3 of 23 Wiley Interdisciplinary Reviews: Computational Statistics 1 2 3 4 5 6 7 8 could be accomplished using ordinary multiple regression. When the number 9 of predictors is large compared to the number of observations, X is likely to 10 11 be singular and the regression approach is no longer feasible (i.e., because 12 of multicollinearity). This data configuration has been recently often called 13 the \small N large P problem." It is characteristic of recent data analysis 14 domains such as, e.g., bio-informatics, brain imaging, chemometrics, data 15 16 mining, and genomics. 17 18 19 3.1 Principal Component Regression 20 Several approacForhes ha vPeere been develop Reviewed to cope with the multicollinearity 21 22 problem. For example, one approach is to eliminate some predictors (e.g., 23 using stepwise methods, see Draper & Smith, 1998), another one is to use 24 ridge regression (Hoerl & Kennard, 1970). One method, closely related to 25 pls regression is called principal component regression (pcr), it performs a 26 27 principal component analysis (pca) of the X matrix and then use the prin- 28 cipal components of X as the independent variables of a multiple regression 29 model predicting Y. Technically, in pca, X is decomposed using its singular 30 value decomposition (see Abdi 2007a,b for more details) as 31 32 X = R∆VT (1) 33 34 with: 35 RTR = VTV = I; (2) 36 37 (where R and V are the matrices of the left and right singular vectors), and 38 ∆ being a diagonal matrix with the singular values as diagonal elements. The 39 singular vectors are ordered according to their corresponding singular value 40 which is the square root of the variance (i.e., eigenvalue) of X explained by 41 42 the singular vectors. The columns of V are called the loadings. The columns 43 of G = R∆ are called the factor scores or principal components of X, or 44 simply scores or components. The matrix R of the left singular vectors of X 45 (or the matrix G of the principal components) are then used to predict Y 46 47 using standard multiple linear regression. This approach works well because 48 the orthogonality of the singular vectors eliminates the multicolinearity prob- 49 lem. But, the problem of choosing an optimum subset of predictors remains. 50 A possible strategy is to keep only a few of the first components. But these 51 52 components were originally chosen to explain X rather than Y, and so, noth- 53 ing guarantees that the principal components, which \explain" X optimally, 54 will be relevant for the prediction of Y. 55 56 57 3 58 59 60 John Wiley & Sons Wiley Interdisciplinary Reviews: Computational Statistics Page 4 of 23 1 2 3 4 5 6 7 8 3.2 Simultaneous decomposition of 9 10 predictors and dependent variables 11 12 Principal component regression decomposes X in order to obtain components 13 which best explains X. By contrast, pls regression finds components from 14 X that best predict Y. Specifically, pls regression searches for a set of com- 15 ponents (called latent vectors) that performs a simultaneous decomposition 16 17 of X and Y with the constraint that these components explain as much as 18 possible of the covariance between X and Y. This step generalizes pca. It 19 is followed by a regression step where the latent vectors obtained from X are 20 used to predictForY. Peer Review 21 22 Pls regression decomposes both X and Y as a product of a common 23 set of orthogonal factors and a set of specific loadings. So, the independent 24 variables are decomposed as 25 26 X = TPT with TTT = I ; (3) 27 28 29 with I being the identity matrix (some variations of the technique do not 30 require T to have unit norms, these variations differ mostly by the choice of 31 the normalization, they do not differ in their final prediction, but the differ- 32 ences in normalization may make delicate the comparisons between different 33 implementations of the technique). By analogy with pca, T is called the 34 35 score matrix, and P the loading matrix (in pls regression the loadings are 36 not orthogonal). Likewise, Y is estimated as 37 38 Y = TBCT ; (4) 39 40 41 where B is a diagonal matrix withb the \regression weights" as diagonal ele- 42 ments and C is the \weight matrix" of the dependent variables (see below for 43 more details on the regression weights and the weight matrix). The columns 44 of T are the latent vectors. When their number is equal to the rank of X, 45 46 they perform an exact decomposition of X. Note, however, that the latent 47 vectors provide only an estimate of Y (i.e., in general Y is not equal to Y). 48 49 b 50 4 PLS regression and covariance 51 52 The latent vectors could be chosen in a lot of different ways.

Load more