PLS-Regression
Total Page:16
File Type:pdf, Size:1020Kb
Chemometrics and Intelligent Laboratory Systems 58Ž. 2001 109–130 www.elsevier.comrlocaterchemometrics PLS-regression: a basic tool of chemometrics Svante Wold a,), Michael Sjostrom¨¨a, Lennart Eriksson b a Research Group for Chemometrics, Institute of Chemistry, Umea˚˚ UniÕersity, SE-901 87 Umea, Sweden b Umetrics AB, Box 7960, SE-907 19 Umea,˚ Sweden Abstract PLS-regressionŽ. PLSR is the PLS approach in its simplest, and in chemistry and technology, most used formŽ two-block predictive PLS. PLSR is a method for relating two data matrices, X and Y, by a linear multivariate model, but goes beyond traditional regression in that it models also the structure of X and Y. PLSR derives its usefulness from its ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. PLSR has the desirable property that the precision of the model parameters improves with the increasing number of relevant variables and observations. This article reviews PLSR as it has developed to become a standard tool in chemometrics and used in chemistry and engineering. The underlying model and its assumptions are discussed, and commonly used diagnostics are reviewed together with the interpretation of resulting parameters. Two examples are used as illustrations: First, a Quantitative Structure–Activity RelationshipŽ. QSAR rQuantitative Struc- ture–Property RelationshipŽ. QSPR data set of peptides is used to outline how to develop, interpret and refine a PLSR model. Second, a data set from the manufacturing of recycled paper is analyzed to illustrate time series modelling of process data by means of PLSR and time-lagged X-variables. q 2001 Elsevier Science B.V. All rights reserved. Keywords: PLS; PLSR; Two-block predictive PLS; Latent variables; Multivariate analysis 1. Introduction The present volume contains numerous examples of the use of PLSR in chemistry, and this article is In this article we review a particular type of mul- merely an introductory review, showing the develop- tivariate analysis, namely PLS-regression, which uses ment of PLSR in chemistry until, around, the year the two-block predictive PLS model to model the re- 1990. lationship between two matrices, X and Y. In addi- tion PLSR models the AstructureB of X and of Y, 1.1. General considerations which gives richer results than the traditional multi- ple regression approach. PLSR and similar ap- proaches provide quantitatiÕe multivariate modelling PLS-regressionŽ. PLSR is a recently developed generalization of multiple linear regressionŽ. MLR methods, with inferential possibilities similar to mul- wx tiple regression, t-tests and ANOVA. 1–6 . PLSR is of particular interest because, unlike MLR, it can analyze data with strongly collinear Ž.correlated , noisy, and numerous X-variables, and ) also simultaneously model several response vari- Corresponding author. Tel.: q46-90-786-5563; fax: q46-90- 13-88-85. ables, Y, i.e., profiles of performance. For the mean- E-mail address: [email protected]Ž. S. Wold . ing of the PLS acronym, see Section 1.2. 0169-7439r01r$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S0169-7439Ž. 01 00155-1 110 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 The regression problem, i.e., how to model one or mation, namely that each model parameter is itera- several dependent variables, responses, Y, by means tively estimated as the slope of a simple bivariate re- of a set of predictor variables, X, is one of the most gressionŽ. least squares between a matrix column or common data-analytical problems in science and row as the y-variable, and another parameter vector technology. Examples in chemistry include relating as the x-variable. So, for instance, the PLS weights, X X Y s properties of chemical samples to X s their w, are iteratively re-estimated as XurŽuu.Žsee chemical composition, relating Ysthe quality and Section 3.10. The ApartialB in PLS indicates that this quantity of manufactured products to Xsthe condi- is a partial regression, since the x-vector Ž.u above is tions of the manufacturing process, and Yschemical considered as fixed in the estimation. This also shows properties, reactivity or biological activity of a set of that we can see any matrix–vector multiplication as molecules to Xstheir chemical structureŽ coded by equivalent to a set of simple bivariate regressions. means of many X-variables. The latter models are This provides an intriguing connection between two often called QSPR or QSAR. Abbreviations are ex- central operations in matrix algebra and statistics, as plained in Section 1.3. well as giving a simple way to deal with missing data. Traditionally, this modelling of Y by means of X Gerlach et al.wx 7 applied multi-block PLS to the is done using MLR, which works well as long as the analysis of analytical data from a river system in X-variables are fairly few and fairly uncorrelated, i.e., Colorado with interesting results, but this was clearly X has full rank. With modern measuring instrumen- ahead of its time. tation, including spectrometers, chromatographs and Around 1980, the simplest PLS model with two sensor batteries, the X-variables tend to be many and blocks Ž.X and Y was slightly modified by Svante also strongly correlated. We shall therefore not call Wold and Harald Martens to better suit to data from them AindependentB, but instead ApredictorsB, or just science and technology, and shown to be useful to X-variables, because they usually are correlated, deal with complicated data sets where ordinary re- noisy, and incomplete. gression was difficult or impossible to apply. To give In handling numerous and collinear X-variables, PLS a more descriptive meaning, H. Wold et al. have and response profiles Ž.Y , PLSR allows us to investi- also recently started to interpret PLS as Projection to gate more complex problems than before, and ana- Latent Structures. lyze available data in a more realistic way. However, some humility and caution is warranted; we are still 1.3. AbbreÕiations far from a good understanding of the complications of chemical, biological, and economical systems. Also, quantitative multivariate analysis is still in its infancy, particularly in applications with many vari- AA Amino Acid ables and few observationsŽ. objects, cases . ANOVA ANalysis Of VAriance AR AutoRegressiveŽ. model ARMA AutoRegressive Moving AverageŽ. model CV Cross-Validation 1.2. A historical note CVA Canonical Variates Analysis DModX Distance to Model in X-space The PLS approach was originated around 1975 by EM Expectation Maximization Herman Wold for the modelling of complicated data H-PLS Hierarchical PLS sets in terms of chains of matricesŽ. blocks , so-called LDA Linear Discriminant Analysis path models, reviewed in Ref.wx 1 . This included a LV Latent Variable simple but efficient way to estimate the parameters in MA Moving AverageŽ. model these models called NIPALSŽ Non-linear Iterative MLR Multiple Linear Regression PArtial Least Squares. This led, in turn, to the MSPC Multivariate SPC acronym PLS for these modelsŽ Partial Least NIPALS Non-linear Iterative Partial Least Squares Squares. This relates to the central part of the esti- NN Neural Networks S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 111 PCA Principal Components Analysis C the Ž.M) A Y-weight matrix; c a are PCR Principal Components Regression columns in this matrix PLS Partial Least Squares projection to latent E the Ž.N) K matrix of X-residuals structures f m residuals of mth y-variable; Ž.N)1 vector PLSR PLS-Regression F the Ž.N) M matrix of Y-residuals PLS-DA PLS Discriminant Analysis G the number of CV groups Ž.gs1,2, . ,G PRESD Predictive RSD p a PLSR X-loading vector of component a PRESS Predictive Residual Sum of Squares P Loading matrix; pa are columns of P QSAR Quantitative Structure–Activity R2 multiple correlation coefficient; amount Y Relationship AexplainedB in terms of SS 2 A B QSPR Quantitative Structure–Property R X amount X explained in terms of SS Relationship Q2 cross-validated R2 ; amount Y ApredictedB RSD Residual SD t a X-scores of component a SD Standard Deviation T score matrix Ž.N) A , where the columns are SDEP, SEP Standard error of prediction t a SECV Standard error of cross-validation u a Y-scores of component a SIMCA Simple Classification Analysis U score matrix Ž.N) A , where the columns are SPC Statistical Process Control u a SS Sum of Squares wa PLSR X-weights of component a VIP Variable Influence on Projection W the Ž.K ) A X-weight matrix; wa are columns in this matrix ) wa PLSR weights transformed to be indepen- 1.4. Notation dent between components W ) Ž.K ) A matrix of transformed PLSR ) ) We shall employ the common notation where col- weights; wa are columns in W . umn vectors are denoted by bold lower case charac- ters, e.g., v, and row vectors shown as transposed, X e.g., v . Bold upper case characters denote matrices, e.g., X. 2. Example 1, a quantitative structure property ) multiplication, e.g., A)B relationship() QSPR X X X transpose, e.g., v ,X a index of componentsŽ. model dimensions ; We use a simple example from the literature with Ž.as1,2, . , A one Y-variable and seven X-variables. The problem A number of components in a PC or PLS is one of QSPR or QSAR, which differ only in that model the responseŽ. s Y are chemical properties in the for- i index of objectsŽ.Ž observations, cases ; is mer and biological activities in the latter. In both 1,2, . , N . cases, X contains a quantitative description of the N number of objectsŽ.