and Intelligent Laboratory Systems 58Ž. 2001 109–130 www.elsevier.comrlocaterchemometrics

PLS-regression: a basic tool of chemometrics Svante Wold a,), Michael Sjostrom¨¨a, Lennart Eriksson b a Research Group for Chemometrics, Institute of Chemistry, Umea˚˚ UniÕersity, SE-901 87 Umea, Sweden b Umetrics AB, Box 7960, SE-907 19 Umea,˚ Sweden

Abstract

PLS-regressionŽ. PLSR is the PLS approach in its simplest, and in chemistry and technology, most used formŽ two-block predictive PLS. . PLSR is a method for relating two data matrices, X and Y, by a linear multivariate model, but goes beyond traditional regression in that it models also the structure of X and Y. PLSR derives its usefulness from its ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. PLSR has the desirable property that the precision of the model parameters improves with the increasing number of relevant variables and observations. This article reviews PLSR as it has developed to become a standard tool in chemometrics and used in chemistry and . The underlying model and its assumptions are discussed, and commonly used diagnostics are reviewed together with the interpretation of resulting parameters. Two examples are used as illustrations: First, a Quantitative Structure–Activity RelationshipŽ. QSAR rQuantitative Struc- ture–Property RelationshipŽ. QSPR data set of peptides is used to outline how to develop, interpret and refine a PLSR model. Second, a data set from the manufacturing of recycled paper is analyzed to illustrate time series modelling of process data by means of PLSR and time-lagged X-variables. q 2001 Elsevier Science B.V. All rights reserved.

Keywords: PLS; PLSR; Two-block predictive PLS; Latent variables; Multivariate analysis

1. Introduction The present volume contains numerous examples of the use of PLSR in chemistry, and this article is In this article we review a particular type of mul- merely an introductory review, showing the develop- tivariate analysis, namely PLS-regression, which uses ment of PLSR in chemistry until, around, the year the two-block predictive PLS model to model the re- 1990. lationship between two matrices, X and Y. In addi- tion PLSR models the AstructureB of X and of Y, 1.1. General considerations which gives richer results than the traditional multi- ple regression approach. PLSR and similar ap- proaches provide quantitatiÕe multivariate modelling PLS-regressionŽ. PLSR is a recently developed generalization of multiple linear regressionŽ. MLR methods, with inferential possibilities similar to mul- wx tiple regression, t-tests and ANOVA. 1–6 . PLSR is of particular interest because, unlike MLR, it can analyze data with strongly collinear Ž.correlated , noisy, and numerous X-variables, and ) also simultaneously model several response vari- Corresponding author. Tel.: q46-90-786-5563; fax: q46-90- 13-88-85. ables, Y, i.e., profiles of performance. For the mean- E-mail address: [email protected]Ž. S. Wold . ing of the PLS acronym, see Section 1.2.

0169-7439r01r$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S0169-7439Ž. 01 00155-1 110 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

The regression problem, i.e., how to model one or mation, namely that each model parameter is itera- several dependent variables, responses, Y, by means tively estimated as the slope of a simple bivariate re- of a set of predictor variables, X, is one of the most gressionŽ. least squares between a matrix column or common data-analytical problems in science and row as the y-variable, and another parameter vector technology. Examples in chemistry include relating as the x-variable. So, for instance, the PLS weights, X X Y s properties of chemical samples to X s their w, are iteratively re-estimated as XurŽuu.Žsee chemical composition, relating Ysthe quality and Section 3.10. . The ApartialB in PLS indicates that this quantity of manufactured products to Xsthe condi- is a partial regression, since the x-vector Ž.u above is tions of the manufacturing process, and Yschemical considered as fixed in the estimation. This also shows properties, reactivity or biological activity of a set of that we can see any matrix–vector multiplication as molecules to Xstheir chemical structureŽ coded by equivalent to a set of simple bivariate regressions. means of many X-variables. . The latter models are This provides an intriguing connection between two often called QSPR or QSAR. Abbreviations are ex- central operations in matrix algebra and , as plained in Section 1.3. well as giving a simple way to deal with . Traditionally, this modelling of Y by means of X Gerlach et al.wx 7 applied multi-block PLS to the is done using MLR, which works well as long as the analysis of analytical data from a river system in X-variables are fairly few and fairly uncorrelated, i.e., Colorado with interesting results, but this was clearly X has full rank. With modern measuring instrumen- ahead of its time. tation, including spectrometers, chromatographs and Around 1980, the simplest PLS model with two sensor batteries, the X-variables tend to be many and blocks Ž.X and Y was slightly modified by Svante also strongly correlated. We shall therefore not call Wold and Harald Martens to better suit to data from them AindependentB, but instead ApredictorsB, or just science and technology, and shown to be useful to X-variables, because they usually are correlated, deal with complicated data sets where ordinary re- noisy, and incomplete. gression was difficult or impossible to apply. To give In handling numerous and collinear X-variables, PLS a more descriptive meaning, H. Wold et al. have and response profiles Ž.Y , PLSR allows us to investi- also recently started to interpret PLS as Projection to gate more complex problems than before, and ana- Latent Structures. lyze available data in a more realistic way. However, some humility and caution is warranted; we are still 1.3. AbbreÕiations far from a good understanding of the complications of chemical, biological, and economical systems. Also, quantitative multivariate analysis is still in its infancy, particularly in applications with many vari- AA Amino Acid ables and few observationsŽ. objects, cases . ANOVA ANalysis Of VAriance AR AutoRegressiveŽ. model ARMA AutoRegressive Moving AverageŽ. model CV Cross-Validation 1.2. A historical note CVA Canonical Variates Analysis DModX Distance to Model in X-space The PLS approach was originated around 1975 by EM Expectation Maximization Herman Wold for the modelling of complicated data H-PLS Hierarchical PLS sets in terms of chains of matricesŽ. blocks , so-called LDA Linear Discriminant Analysis path models, reviewed in Ref.wx 1 . This included a LV Latent Variable simple but efficient way to estimate the parameters in MA Moving AverageŽ. model these models called NIPALSŽ Non-linear Iterative MLR Multiple Linear Regression PArtial Least Squares. . This led, in turn, to the MSPC Multivariate SPC acronym PLS for these modelsŽ Partial Least NIPALS Non-linear Iterative Partial Least Squares Squares. . This relates to the central part of the esti- NN Neural Networks S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 111

PCA Principal Components Analysis C the Ž.M) A Y-weight matrix; c a are PCR Principal Components Regression columns in this matrix PLS Partial Least Squares projection to latent E the Ž.N) K matrix of X-residuals

structures f m residuals of mth y-variable; Ž.N)1 vector PLSR PLS-Regression F the Ž.N) M matrix of Y-residuals PLS-DA PLS Discriminant Analysis G the number of CV groups Ž.gs1,2, . . . ,G

PRESD Predictive RSD p a PLSR X-loading vector of component a PRESS Predictive Residual Sum of Squares P Loading matrix; pa are columns of P QSAR Quantitative Structure–Activity R2 multiple correlation coefficient; amount Y Relationship AexplainedB in terms of SS 2 A B QSPR Quantitative Structure–Property R X amount X explained in terms of SS Relationship Q2 cross-validated R2 ; amount Y ApredictedB

RSD Residual SD t a X-scores of component a SD Standard Deviation T score matrix Ž.N) A , where the columns are

SDEP, SEP Standard error of prediction t a SECV Standard error of cross-validation u a Y-scores of component a SIMCA Simple Classification Analysis U score matrix Ž.N) A , where the columns are

SPC Statistical Process Control u a SS Sum of Squares wa PLSR X-weights of component a VIP Variable Influence on Projection W the Ž.K ) A X-weight matrix; wa are columns in this matrix ) wa PLSR weights transformed to be indepen- 1.4. Notation dent between components W ) Ž.K ) A matrix of transformed PLSR ) ) We shall employ the common notation where col- weights; wa are columns in W . umn vectors are denoted by bold lower case charac- ters, e.g., v, and row vectors shown as transposed, X e.g., v . Bold upper case characters denote matrices, e.g., X. 2. Example 1, a quantitative structure property ) multiplication, e.g., A)B relationship() QSPR X X X transpose, e.g., v ,X a index of componentsŽ. model dimensions ; We use a simple example from the literature with Ž.as1,2, . . . , A one Y-variable and seven X-variables. The problem A number of components in a PC or PLS is one of QSPR or QSAR, which differ only in that model the responseŽ. s Y are chemical properties in the for- i index of objectsŽ.Ž observations, cases ; is mer and biological activities in the latter. In both 1,2, . . . , N . cases, X contains a quantitative description of the N number of objectsŽ. cases, observations variation in chemical structure between the investi- k index of X-variables Ž.ks1,2, . . . , K gated compounds. m index of Y-variables Ž.ms1,2, . . . , M The objective is to understand the variation of y X matrix of predictor variables, size Ž.N) K sDDGTSsthe free energy of unfolding of a pro- Y matrix of response variables, size Ž.N) M teinŽ tryptophane synthase a unit of bacteriophage T4 bm regression coefficient vector of the mth y. lysosome. when position 49 is modified to contain Size Ž.K )1 each of the 19 coded amino acidsŽ. AA’s except B matrix of regression coefficients of all Y’s. arginine. The AA’s are described by seven highly Size Ž.K ) M correlated X-variables as shown in Table 1. Compu- wx c a PLSR Y-weights of component a tational and other details are given in Ref. 8 . For al- 112 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

Table 1 Raw data of example 1 PIE PIF DGR SAC MR Lam Vol DDGTS Ž.1 Ala 0.23 0.31 y0.55 254.2 2.126 y0.02 82.2 8.5 Ž.2 Asn y0.48 y0.60 0.51 303.6 2.994 y1.24 112.3 8.2 Ž.3 Asp y0.61 y0.77 1.20 287.9 2.994 y1.08 103.7 8.5 Ž.4 Cys 0.45 1.54 y1.40 282.9 2.933 y0.11 99.1 11.0 Ž.5 Gln y0.11 y0.22 0.29 335.0 3.458 y1.19 127.5 6.3 Ž.6 Glu y0.51 y0.64 0.76 311.6 3.243 y1.43 120.5 8.8 Ž.7 Gly 0.00 0.00 0.00 224.9 1.662 0.03 65.0 7.1 Ž.8 His 0.15 0.13 y0.25 337.2 3.856 y1.06 140.6 10.1 Ž.9 Ile 1.20 1.80 y2.10 322.6 3.350 0.04 131.7 16.8 Ž.10 Leu 1.28 1.70 y2.00 324.0 3.518 0.12 131.5 15.0 Ž.11 Lys y0.77 y0.99 0.78 336.6 2.933 y2.26 144.3 7.9 Ž.12 Met 0.90 1.23 y1.60 336.3 3.860 y0.33 132.3 13.3 Ž.13 Phe 1.56 1.79 y2.60 366.1 4.638 y0.05 155.8 11.2 Ž.14 Pro 0.38 0.49 y1.50 288.5 2.876 y0.31 106.7 8.2 Ž.15 Ser 0.00 y0.04 0.09 266.7 2.279 y0.40 88.5 7.4 Ž.16 Thr 0.17 0.26 y0.58 283.9 2.743 y0.53 105.3 8.8 Ž.17 Trp 1.85 2.25 y2.70 401.8 5.755 y0.31 185.9 9.9 Ž.18 Tyr 0.89 0.96 y1.70 377.8 4.791 y0.84 162.7 8.8 Ž.19 Val 0.71 1.22 y1.60 295.1 3.054 y0.13 115.6 12.0

Correlation matrix PIE 1.000 0.967 y0.970 0.518 0.650 0.704 0.533 0.645 PIF 0.967 1.000 y0.968 0.416 0.555 0.750 0.433 0.711 DGR y0.970 y0.968 1.000 y0.463 y0.582 y0.704 y0.484 y0.648 SAC 0.518 0.416 y0.463 1.000 0.955 y0.230 0.991 0.268 MR 0.650 0.555 y0.582 0.955 1.000 y0.027 0.945 0.290 Lam 0.704 0.750 y0.704 y0.230 y0.027 1.000 y0.221 0.499 Vol 0.533 0.433 y0.484 0.991 0.945 y0.221 1.000 0.300 DDGTS 0.645 0.711 y0.648 0.268 0.290 0.499 0.300 1.000

The lower half of the table shows the pair-wise correlation coefficients of the data. PIE and PIF are the lipophilicity constant of the AA side chain according to El Tayar et al.wx 8 , and Fauchere and Pliska, respectively; DGR is the free energy of transfer of an AA side chain from protein interior to water according to Radzicka and Woldenden; SAC is the water-accessible surface area of AA’s calculated by MOLSV; MR molecular refractivityŽ. from Daylight data base ; Lam is a polarity parameter according to El Tayar et al.wx 8 . Vol is the molecular volume of AA’s calculated by MOLSV. All the data except MR are from Ref.wx 8 . ternative ways, see Hellberg et al.wx 9 and Sandberg but the arguments are similar in most other mod- et al.wx 10 . elling in science and technology. Our chemical thinking makes us formulate the in- fluence of structural change on activityŽ and other properties. in terms of A effectsB—lipophilic, steric, 3. PLSR and the underlying scientific model polar, polarizability, hydrogen bonding, and possibly others. Similarly, the modelling of a chemical pro- PLSR is a way to estimate parameters in a scien- cess is interpreted using AeffectsB of thermodynam- tific model, which basically is linearŽ see Section 4.3 icsŽ equilibria, heat transfer, mass transfer, pressures, for non-linear PLS models. . This model, like any temperatures, and flows.Ž and chemical kinetics reac- scientific model, consists of several parts, the philo- tion rates. . sophical, conceptual, the technical, the numerical, the Although this formulation of the scientific model statistical, and so on. We here illustrate these using is not of immediate concern for the technicalities of the QSPRrQSAR model of example 1Ž. see above , PLSR, it is of interest that PLSR modelling is consis- S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 113 tent with AeffectsB causing the changes in the inves- formed. If the value zero occurs in a variable, the tigated system. The concept of latent Õariables ŽSec- fourth root transformation is a good alternative to the tion 4.1. may be seen as directly corresponding to logarithm. The response variable in example 1 is al- these effects. ready logarithmically transformed, i.e., expressed in thermodynamic units. No further transformation was 3.1. The data— X and Y made of the example 1 data. Results of projection methods such as PLSR de- The PLSR model is developed from a training set pend on the scaling of the data. With an appropriate of N observationsŽ objects, cases, compounds, pro- scaling, one can focus the model on more important cess time points. with K X-variables denoted by Y-variables, and use experience to increase the s s weights of more informative X-variables. In the ab- x kmŽ.k 1,...,K , and M Y-variables y Žm 1,2, . . . , M .. These training data form the two matri- sence of knowledge about the relative importance of ces X and Y of dimensionsŽ.Ž.N) K and N) M ,re- the variables, the standard multivariate approach is to spectively, as shown in Fig. 1. In example 1, Ns19, Ž.i scale each variable to unit variance by dividing Ks7, and Ms1. them by their SD’s, andŽ. ii center them by subtract- Later, predictions for new observations are made ing their averages, so-called auto-scaling. This cor- based on their X-data. This gives predicted X-scores responds to giving each variableŽ. column the same Ž.t-values , X-residuals, their residual SD’s, and y- weight, the same prior importance in the analysis. values with confidence intervals. In example 1, all variables were auto-scaled, ex- s cept x6 Lam, the auto-scaling weight of which was multiplied by 1.5 to make it not be masked by the 3.2. Transformation, scaling and centering othersŽ. so-called block-scaling . In CoMFA and GRID-QSAR, as well as in multi- Before the analysis, the X- and Y-variables are of- variate calibration in analytical chemistry, auto-scal- ten transformed to make their distributions be fairly ing is often not the best scaling of X, but non-scaled symmetrical. Thus, variables with range of more than X-data or some intermediate between auto-scaled and one magnitude of 10 are often logarithmically trans- non-scaled may be appropriatewx 5 . In process data, the acceptable interval of variation of each variable can form the basis for the scaling. For ease of interpretation and for numerical stabil- ity, it is recommended to center the data before the analysis. This is done—either before or after scaling —by subtracting the averages from all variables both in X and Y. In process data other reference values such as set point values may be subtracted instead of averages. Hence, the analysis concerns the deviations from the reference points and how they are related. The centering leaves the interpretation of the results unchanged. In some applications it is customary to normalize also the obserÕations Ž.objects . In chemistry, this is often done in the analysis of chromatographic or spectral profiles. The normalization is typically done by making the sum of all peaks of one profile be 100 Fig. 1. Data of a PLSR can be arranged as two tables, matrices, X or 1000. This removes the size of the observations and Y. Note that the raw data may have been transformedŽ e.g., Ž. logarithmically. , and usually have been centered and scaled before objects , which may be desirable if size is irrele- the analysis. In QSAR, the X-variables are descriptors of chemical vant. This is closely related to correspondence analy- structure, and the Y-variables measurements of biological activity. siswx 11 . 114 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

A B 3.3. The PLSR model The PLS-regression coefficients , bmk Ž.B , can be written as: The linear PLSR model finds a few AnewB vari- s ) s ) X bmkÝ cw ma ka Ž.B W C .6 Ž. ables, which are estimates of the LV’s or their rota- a tions. These new variables are called X-scores and Note that these b’s are not independent unless A s denoted by t a Ž.a 1,2,...,A . The X-scores are Ž.Žthe number of PLSR components equals K the predictors of Y and also model X ŽŽ.Ž.Eqs. 4 and 2 number of X-variables. . Hence, their confidence in- below. , i.e., both Y and X are assumed to be, at least tervals according to the traditional statistical interpre- partly, modelled by the same LV’s. tation are infinite. The X-scores are AfewB Ž.A in number , and or- An interesting special case is at hand when there X thogonal. They are estimated as linear combinations is a single y-variable Ž.Ms1 and XXis diagonal, of the original variables x k with the coefficients, i.e., X originates from an orthogonal designŽ frac- A B ) s weights , waka Ž.1,2, . . . , A . These weights are tional factorial, Plackett–Burman, etc.. . In this case wx sometimes denoted by rka 12,13 . Below, formulas there is no correlation structure in X, and PLSR ar- are shown both in element and matrix formŽ the lat- rives at the MLR solution in one componentwx 14 , and ter in parentheses. : the MLR and PLS-regression coefficients are equal to X wc. s ) s ) 11 tiaÝ W kaX ik; Ž.T XW .1 Ž.After each component, a, the X-matrix is Ade- k B ) X flated by subtracting tpia kafrom x ikŽtp a a from . The X-scores Ž.t a’s have the following properties: X . This makes the PLSR model alternatively be ex- Ž.a They are, multiplied by the loadings pak, good pressed in weights wa referring to the residuals after A B summaries of X, so that the X-residuals, eik, in Eq. previous dimension, E aI1, instead of relating to the Ž.2 are AsmallB: X-variables themselves. Thus, instead of Eq.Ž. 1 , we can write: X X s tpqe ; Ž.Ž.XsTP qE .2 ikÝ ia ak ik t s we t sEW 7a a iaÝ ka ik,ay1 Ž.Ž.aay1 a k ) s y With multivariate Y Ž.when M 1 , the correspond- eik,ay1 eik,ay2 tpi,ay1 ay1,k A B X ing Y-scores Ž.u a are, multiplied by the weights s y A B Ž.Ž.E ay1 E ay2 tpay1 ay1 7b cam, good summaries of Y, so that the residuals, A B e sX E sX .7c gim, in Eq.Ž. 3 are small : ik,0 ik Ž.0 Ž. However, the weights, w, can be transformed to w ), s q s X q yimÝ uc ia amg im Ž.Ž.Y UC G ,3which directly relate to X, giving Eq.Ž. 1 above. The a relation between the two is given aswx 14 : Ž.b the X-scores are good predictors of Y, i.e.: X y1 W ) sWPWŽ..8 Ž. X y s ctqf Ž.Ž.YsTC qF .4 The Y-matrix can also be AdeflatedB by subtract- imÝ ma ia im X a ing tcaa, but this is not necessary; the results are equivalent with or without Y-deflation. The Y-residuals, f express the deviations between im From the PLSR algorithmŽ. see below , one can see the observed and modelled responses, and comprise that the first weight vector Ž.w is the first eigenvec- the elements of the Y-residual matrix, F. 1 tor of the combined variance–covariance matrix, Because of Eq.Ž. 1 , Eq. Ž. 4 can be rewritten to look XX XYYX, and the following weight vectorsŽ compo- as a multiple regression model: nent a. are eigenvectors to the deflated versions of X X X ) s yc wx qf s bxqf the same matrix, i.e., Z aaYY Z , where Z aaZ y1 imÝÝ ma ka ik im Ý mk ik im y X a kk TPay1 ay11. Similarly, the first score vector Ž.t is X X X an eigenvector to XX YY , and later X-score vectors Ž.Ž.YsXW ) C HFsXBHF .5 XX Ž.t aaaare eigenvectors of ZZYY. S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 115

These eigenvector relationships also show that the lationship between T and Y, analogously to MLR. In vectors wa form an ortho-normal set, and that the PLSR we also have residuals for X; the part not used vectors t a are orthogonal to each other. The loading in the modelling of Y. These X-residuals are useful vectors Ž.p a are not orthogonal to each other, and for identifying outliers in the X-space, i.e., molecules neither are the Y-scores, u a. The u’s and the p’s are with structures that do not fit the model, and process orthogonal to the t’s and the w’s, respectively, one points deviating from AnormalB process operations. X s and more components earlier, i.e., u bat 0and This, together with control charts of the X-scores, t a, X s ) X s wx pbaw 0, if b a. Also, w aap 1.0. is used in multivariate SPCŽ. MSPC 16 .

3.4. Interpretation of the PLSR model 3.5. Geometric interpretation One way to see PLSR is that it forms Anew x- B variables Ž.LV estimates , t a, as linear combinations PLSR is a projection method and thus has a sim- of the old x’s, and thereafter uses these new t’s as ple geometric interpretation as a projection of the predictors of Y. Hence, PLSR is based on a linear X-matrixŽ a swarm of N points in a K-dimensional modelŽ. see, however, Section 4.3 . Only as many new space. down on an A-dimensional hyper-plane in t’s are formed as are needed, as are predictively sig- such a way that the coordinates of the projection Žt a, nificantŽ. Section 3.8 . as1,2, . . . , A. are good predictors of Y. This is in- All parameters, t, u, w Žand w ) .Ž, p, and c see Fig. dicated in Fig. 2. 1. , are determined by a PLSR algorithm as described below. For the interpretation of the PLSR model, the scores, t and u, contain the information about the ob- jects and their similaritiesrdissimilarities with re- spect to the given problem and model. ) The weights waaor the closely similar w Žsee be- low. , and c a, give information about how the vari- ables combine to form the quantitative relation be- tween X and Y, thus providing an interpretation of the scores, t aaand u . Hence, these weights are essential for the understanding of which X-variables are im- portantŽ. numerically large wa-values , and which X- variables that provide the same informationŽ similar profiles of wa-values. . A B The PLS weights wa express both the positive correlations between X and Y, and the Acompensa- tion correlationsB needed to predict Y from X clear from the secondary variation in X. The latter is ev- erything varying in X that is not primarily related to

Y. This makes wa difficult to interpret directly, espe- cially in later components Ž.a)1 . By using an or- thogonal expansion of the X-parameters in O-PLS, one can get the part of wa that primarily relates to Y, Fig. 2. The geometric representation of PLSR. The X-matrix can thus making the PLS interpretation more clearwx 15 . be represented as N points in the K dimensional space where each

The part of the data that are not explained by the column of XxŽ.k defines one coordinate axis. The PLSR model model, the residuals, are of diagnostic interest. Large defines an A-dimensional hyper-plane, which in turn, is defined by Y-residuals indicate that the model is poor, and a one line, one direction, per component. The direction coefficients of these lines are p . The coordinates of each object, i, when its normal probability plot of the residuals of a single ak dataŽ. row i in X are projected down on this plane are tia. These Y-variable is useful for identifying outliers in the re- positions are related to the values of Y. 116 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

The direction of the plane is expressed as slopes, Hence, one should start with a PCA of just the pak, of each PLS direction of the planeŽ each compo- Y-matrix. This shows the practical rank of Y, i.e., the nent. with respect to each coordinate axis, x k . This number of resulting components, APCA.Ifthisis slope is the cosine of the angle between the PLS di- small compared to the number of Y-variables Ž.M , rection and the coordinate axis. the Y’s are correlated, and a single PLSR model of Thus, PLSR develops an A-dimensional hyper- all Y’s is warranted. If, however, the Y’s cluster in plane in X-space such that this plane well approxi- strong groups, which is seen in the PCA loading plots, mates X Ž.the N points, row vectors of X , and at the separate PLSR models should be developed for these same time, the positions of the projected data points groups. on this plane, described by the scores t , are related ia 3.8. The number of PLS components, A to the values of the responses, activities, Yim Žsee Fig. 2.. In any empirical modelling, it is essential to deter- mine the correct complexity of the model. With nu- 3.6. Incomplete X and Y matrices() missing data merous and correlated X-variables there is a substan- tial risk for Aover-fittingB, i.e., getting a well fitting Projection methods such as PLSR tolerate moder- model with little or no predictive power. Hence, a ate amounts of missing data both in X and in Y.To strict test of the predictive significance of each PLS have missing data in Y, it must be multivariate, i.e., component is necessary, and then stopping when have at least two columns. The larger the matrices X components start to be non-significant. and Y are, the higher proportion of missing data is Cross-validationŽ. CV is a practical and reliable tolerated. For small data sets with around 20 obser- way to test this predictive significancewx 1–6 . This has vations and 20 variables, around 10% to 20% miss- become the standard in PLSR analysis, and incorpo- ing data can be handled, provided that they are not rated in one form or another in all available PLSR missing according to some systematic pattern. software. Good discussions of the subject were given The NIPALS PLSR algorithm automatically ac- by Wakeling and Morriswx 19 and Clark and Cramer counts for the missing values, in principle by itera- wx20 . tively substituting the missing values by predictions Basically, CV is performed by dividing the data in by the model. This corresponds to, for each compo- a number of groups, G, say, five to nine, and then nent, giving the missing data values that have zero developing a number of parallel models from re- residuals and thus have no influence on the compo- duced data with one of the groups deleted. We note nent parameters t and p . Other approaches based on aa that having GsN, i.e., the leave-one-out approach, the EM algorithm have been developed, and often is not recommendablewx 21 . work better than NIPALS for large percentages of After developing a model, differences between ac- missing datawx 17,18 . One should remember, how- tual and predicted Y-values are calculated for the ever, that with much missing data, any resulting pa- deleted data. The sum of squares of these differences rameters and predictions are highly uncertain. is computed and collected from all the parallel mod- 3.7. One Y at a time, or all in a single model? els to form the predictive residual sum of squares Ž.PRESS , which estimates the predictive ability of the PLSR has the ability to model and analyze several model. Y’s together, which has the advantage to give a sim- When CV is used in the sequential mode, CV is pler overall picture than one separate model for each performed on one component after the other, but the Y-variable. Hence, when the Y’s are correlated, they peeling offŽŽ. Eq. 7b , Section 3.3 . is made only once should be analyzed together. If the Y’s really mea- on the full data matrices, where after the resulting sure different things, and thus are fairly independent, residual matrices E and F are divided into groups for r a single PLSR model tends to have many compo- the CV of next component. The ratio PRESSaaSS y1 nents and be difficult to interpret. Then a separate is calculated after each component, and a component modelling of the Y’s gives a set of simpler models is judged significant if this ratio is smaller than with fewer dimensions, which are easier to interpret. around 0.9 for at least one of the Y-variables. Slightly S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 117 sharper bonds can be obtained from the results of the chanceŽ. probability to get a good fit with ran- wx Wakeling and Morris 19 . Here SSay1 denotes the dom response data. Ž.fitted residual sum of squares before the current componentŽ. index a . The calculations continue until 3.10. PLSR algorithms a component is non-significant. Alternatively with Atotal CVB, one first divides the The algorithms for calculating the PLSR model are data into groups, and then calculates PRESS for each mainly of technical interest, we here just point out component up to, say 10 or 15 with separate Apeel- that there are several variants developed for different ingB ŽŽ.Eq. 7b , Section 3.3 . of the data matrices of shapes of the datawx 2,22,23 . Most of these algo- each CV group. The model with the number of com- rithms tolerate moderate amounts of missing data. ponents giving the lowest PRESSrŽ.N y A y 1is Either the algorithm, like the original NIPALS algo- then used. This AtotalB approach is computationally rithm below, works with the original data matrices, X more taxing, but gives similar results. and Y Ž.scaled and centered . Alternatively, so-called Both with the AsequentialB and the AtotalB mode, kernel algorithms work with the variance–covariance X X X a PRESS is calculated for the final model with the matrices, X X, Y Y, and X Y, or association matri- X X estimated number of significant components. This is ces, XX and YY , which is advantageous when the 2 2 often re-expressed as Q Žthe cross-validated R . number of observations Ž.N differs much from the which isŽ. 1yPRESSrSS where SS is the sum of number of variables Ž.K and M . squares of Y corrected for the mean. This can be For extensions of PLSR, the results of Hoskulds-¨ 2 compared with R sŽ.1yRSSrSS , where RSS is sonwx 3 regarding the possibilities to modify the NI- the fitted residual sum of squares. In models with PALS PLSR algorithm are of great interest. Ho- 2 2 several Y’s, one obtains also Rmmand Q for each skuldsson shows that as long as the stepsŽ. C to Ž. G Y-variable, ym. below are unchanged, modifications can be made of These measures can, of course, be equivalently w in stepŽ. B . Central properties remain, such as or- expressed as residual SD’sŽ. RSD’s and predictive thogonality between model components, good sum- residual SD’sŽ. PRESD’s . The latter is often called marizing properties of the X-scores, t a, and inter- standard error of predictionŽ. SDEP or SEP , or stan- pretability of the model parameters. This can be used dard error of cross-validationŽ. SECV . If one has to introduce smoothness in the PLSR solutionwx 24 , to some knowledge of the noise in the investigated sys- develop a PLSR model where a majority of the PLSR tem, for example "0.3 units for logŽ. 1rC in coefficients are zerowx 25 , align w with a` priori speci- QSAR’s, these predictive SD’s should, of course, be fied vectorsŽ similar to Atarget rotationB of Kvalheim similar in size to the noise. et al.wx 26. , and more. The simple NIPALS algorithm of Wold et al.wx 2 3.9. Model Õalidation is shown below. It starts with optionally transformed, scaled, and centered data, X and Y, and proceeds as Any model needs to be validated before it is used followsŽ note that with a single y-variable, the algo- for AunderstandingB or for predicting new events such rithm is non-iterative. . as the biological activity of new compounds or the Ž.A Get a starting vector of u, usually one of the yield and impurities at other process conditions. The Y columns. With a single y, usy. best validation of a model is that it consistently pre- Ž.B The X-weights, w: cisely predicts the Y-values of observations with new X X wsX uru u, X-values—a Õalidation set. But an independent and representative validation set is rare. Ž.here w can now be modified norm w to 55w s1.0 In the absence of a real validation set, two reason- Ž.C Calculate X-scores, t: able ways of model validation are given by cross- tsXw. validationŽ. CV, see Section 3.8 which simulates how Ž.D The Y-weights, c: well the model predicts new data, and model re- X X estimation after data randomization which estimates csY trt t. 118 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

Ž.E Finally, an updated set of Y-scores, u: errors. , followed by using the t-distribution to give X usYcrc c. confidence intervals. Since all PLSR parameters Ž.scores, loadings, etc. are linear combinations of the Ž.F Convergence is tested on the change in t, i.e., original dataŽ. possibly deflated , these parameters are 5555I r - A B t oldt newt new ´, where ´ is small , e.g., close to normally distributed, and hence jack-knifing y y 10 6 or 10 8. If convergence has not been reached, works well. return toŽ. B , otherwise continue with Ž. G , and then Ž.A . If there is only one y-variable, i.e., Ms1, the procedure converges in a single iteration, and one 4. Assumptions underlying PLSR proceeds directly withŽ. G . Ž. Ž . G Remove deflate, peel off the present compo- 4.1. Latent Variables nent from X and Y use these deflated matrices as X and Y in the next component. Here the deflation of In PLS modelling, we assume that the investi- Y is optional; the results are equivalent whether Y is gated system or process actually is influenced by just deflated or not. a few underlying variables, latent Õariables() LV’s . X X psX trŽ.t t The number of these LV’s is usually not known, and X one aim with the PLSR analysis is to estimate this XsXytp number. Also, the PLS X-scores, t , are usually not X a YsYytc . direct estimates of the LV’s, but rather they span the Ž.H Continue with next component Ž back to step same space as the LV’s. Thus, the latterŽ denoted by .Ž.V.Ž.are related to the former T by a, usually un- A until cross-validation see above indicates that X there is no more significant information in X about known, rotation matrix, R, with the property R Rs Y. 1: wx X Chu et al. 27 recently has reviewed the attractive VsTR or TsRV. properties of matrix decompositions of the Wedder- Both the X- and the Y-variables are assumed to be burn type. The PLSR NIPALS algorithm is such a realizations of these underlying LV’s, and are hence Wedderburn decomposition, and hence is numeri- not assumed to be independent. Interestingly, the LV cally and statistically stable. assumptions closely correspond to the use of micro- scopic concepts such as molecules and reactions in 3.11. Standard errors and confidence interÕals chemistry and molecular biology, thus making PLSR philosophically suitable for the modelling of chemi- Numerous efforts have been made to theoretically cal and biological data. This has been discussed by, derive confidence intervals of the PLSR parameters, among others, Wold et al.wx 31,32 , Kvalheim wx 33 , and see, e.g., Ref.wx 28 . Most of these are, however, based recently from a more fundamental perspective, by on regression assumptions, seeing PLSR as a biased Burnham et al.wx 12,13 . In spectroscopy, it is clear that regression model. Only recently, in the work of the spectrum of a sample is the sum of the spectra of Burnham et al.wx 12 , have these matters been investi- the constituents multiplied by their concentrations in gatedwithPLSRasalatent Õariable regression the sample. Identifying the latter with t ŽLambert– model. Beers AlawB., and the spectra with p, we get the la- X X X A way to estimate standard errors and confidence tent variable model X s t1p1 q t2p2 q ...s TP intervals directly from the data is to use jack-knifing qnoise. In many applications this interpretation with wx29 . This was recommended by Wold wx 1 in his orig- the data explained by a number of AfactorsB Žcompo- inal PLS work, and has recently been revived by nents. makes sense. Martens and Martenswx 30 and others. The idea is As discussed below, we can also see the scores, T, simple; the variation in the parameters of the various as comprised of derivatives of an unknown function sub-models obtained during cross-validation is used underlying the investigated system. The choice of the to derive their standard deviationsŽ called standard interpretation depends on the amount of knowledge S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 119 about the system. The more knowledge we have, the interpretation as being measurements made on a set more likely it is that we can assign a latent variable of similar observations. Any mixture of these two in- interpretation of the X-scores or their rotation. terpretations is, of course, often applicable. If the number of LV’s actually equals the number of X-variables, K, then the X-variables are indepen- 4.3. Homogeneity dent, and PLSR and MLR give identical results. Any data analysis is based on an assumption of Hence, we can see PLSR as a generalization of MLR, homogeneity. This means that the investigated sys- containing the latter as a special case in situations tem or process must be in a similar state throughout when the MLR solution exists, i.e., when the number all the investigation, and the mechanism of influence of X- and Y-variables is fairly small in comparison of X on Y must be the same. This, in turn, corre- to the number of observations, N. In most practical sponds to having some limits on the variability and cases, except when X is generated according to an diversity of X and Y. experimental design, however, the X-variables are not Hence, it is essential that the analysis provides di- independent. We then call X rank deficient. Then agnostics about how well these assumptions indeed PLSR gives a AshrunkB solution which is statistically are fulfilled. Much of the recent progress in applied more robust than the MLR solution, and hence gives statistics has concerned diagnosticswx 36 , and many of better predictions than MLRwx 34 . these diagnostics can be used also in PLSR mod- PLSR gives a model of X in terms of a bilinear elling as discussed below. PLSR also provides addi- projection, plus residuals. Hence, PLSR assumes that tional diagnostics beyond those of regression-like there may be parts of X that are unrelated to Y. These methods, particularly those based on the modelling of parts can include noise andror regularities non-re- X Ž.score and loading plots and X-residuals . lated to Y. Hence, unlike MLR, PLSR tolerates noise In the first example, the first PLSR analysis indi- in X. cated that the data set was inhomogeneous—three 4.2. AlternatiÕe deriÕation aromatic amino acidsŽ. AA’s are indicated to have a different type of effect than the others on the mod- The second theoretical foundation of LV-models is elled property. This type of information is difficult to one of Taylor expansionswx 35 . We assume the data obtain in ordinary regression modelling, where only X and Y to be generated by a multi-dimensional large residuals in Y provide diagnostics about inho- function FŽ.u,v , where the vector variable u de- mogeneities. scribes the change between observationsŽ. rows in X and the vector variable v describes the change be- 4.4. Non-linear PLSR tween variablesŽ. columns in X . Making a Taylor ex- pansion of the function F in the u-direction, and dis- For non-linear situations, simple solutions have cretizing for isobservation and ksvariable, gives been published by Hoskuldsson¨ wx 4 , and Berglund and the LV-model. Again, the smaller the interval of u Woldwx 37 . Another approach based on transforming that is modelled, the fewer terms we need in the Tay- selected X-variables or X-scores to qualitative vari- lor expansion, and the fewer components we need in ables coded as sets of dummy variables, the so-called the LV-model. Hence, we can interpret PCA and PLS GIFI approachwx 38,39 , is described elsewhere in this as models of similarity. DataŽ. variables measured on volumewx 15 . a set of similar observationsŽ samples, items, cases, . . ..Ž. can always be modelled approximated by a PC- or PLS model. And the more similar are the 5. Results of example 1 observations, the fewer components we need in the 5.1. The initial PLSR analysis of the AA data model. We hence have two different interpretations of the The first PLSR analysisŽ. linear model of the AA LV-model. Thus, real data well explained by these data gives one significant component explaining 43% models can be interpreted as either being a linear of the Y-variance Ž R22s0.435, Q s0.299. . In con- combination of AfactorsB or according to the latter trast, the MLR gives an R2 of 0.788, which is equiv- 120 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

Fig. 3. The PLS scores u11and t of the AA example, 1st analysis. alent to PLSR with As7 components. The full MLR with As2 significant components and R2 s0.783, solution, however, has a Q2 of y0.215, indicating Q2 s0.706. The MLR model for these 16 objects that the model is poor, and does not predict better than gives a R2 of 0.872, and a Q2 of 0.608. This strong chance. improvement indicates that indeed the data set now is With just one significant PLS component, the only more homogeneous and can be properly modelled. meaningful score plot is that of y against t Ž.Fig. 3 . The plot in Fig. 4 of uy11Ž.vs. t shows, however, The aromatic AA’s, Trp, Phe, and, may be, Tyr, show a fairly strong curvature, indicating that squared terms a much worse fit than the others, indicating an inho- in lipophilicity and, may be, polarity are warranted. mogeneity in the data. To investigate this, a second In the final analysis, the squares of these four vari- round of analysis is made with a reduced data set, ables were included in the model, which indeed gave Ns16, without the aromatic AA’s. better results. Two significant PLS components and one additional with border-line significance were ob- 5.2. Second analysis tained. The resulting R2 and Q2 values are for As 2: 0.90 and 0.80, and for As3: 0.925 and 0.82, re- The modelling of Ns16 AA’s with the same lin- spectively. The As3 values correspond to RSDs ear model as before gives a substantially better result 0.92, and PRESDŽ. SDEP s1.23, since the SD of Y

Fig. 4. The PLS scores, u11and t of the AA example, 2nd analysis. S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 121

Fig. 5. The PLS scores t12and t of the AA example, 2nd analysis. The overlapping points up to the right are Ile and Leu. is 2.989. The full MLR model gives R2 s0.967, but w’s and c’s; the results and interpretation is similar. with much worse Q2 s0.09. This plot shows how the X-variables combine to form

the scores t a; X-variables important for the ath com- ponent fall far from the origin along the ath axis in 5.2.1. X-scores ()t show object similarities and dis- a the wc-plot. Analogously, Y-variables well modelled similarities by the ath component have large coefficients, c , The plot of the X-scores Ž.t vs. t , Fig. 5 shows am 12 and hence fall far from the origin along the ath axis the 16 amino acids grouped according to polarity in the same plot. from upper left to lower right, and inside each group, The example 1 weight plotŽ Fig. 6, often also according to size and lipophilicity. called loading plot. shows the first PLS component dominated by lipophilicity and polarityŽ PIF, PIE, and 5.2.2. PLSR weights w and c Lam on the positive side, and DGR on the negative. , For the interpretation of PLSR models, the stan- and the second component being a mixture of size and dard is to plot the w ) ’s and c’s of one model dimen- polarity with MR, Vol, and SAC on the positive sion against another. Alternatively, one can plot the Ž.high side, and Lam on the negative Ž. low side.

Fig. 6. The PLS weights, w ) and c for the first two dimensions of the 2nd AA model. 122 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

Fig. 7. PLS regression coefficients after As2 componentsŽ. second analysis . The bars indicate 95% confidence intervals based on jack- knifing.

The c-values of the response, y, are proportional shows that the coefficients of x34Ž.DGR and x to the linear variation of Y explained by the corre- Ž.SAC change sign between the PLSR model Ž Fig. 7 . sponding dimension, i.e., R2. They define one point and the MLR modelŽ. Fig. 8 . Moreover, the coeffi- per response; in the example with a single response, cients of x45, x , and x 7which are moderate in the this pointŽ. DDGTS sits far to the right in the first PLSR Ž.As2 model become large and with oppo- quadrant of the plot. site signs in the MLR model Ž.As7 , although they The importance of a given X-variable for Y is are strongly positively correlated to each other in the proportional to its distance from the origin in the raw data. loading space. These lengths correspond to the PLS- The MLR coefficients clearly are misleading and regression coefficients after As2 dimensionsŽ Fig. un-interpretable, due to the strong correlations be- 7.. tween the X-variables. PLSR, however, stops at As 2, and gives reasonable coefficient values both for 5.3. Comparison with multiple linear regression As2 and As3. With correlated variables one can- ()MLR not assign AcorrectB values to the individual coeffi- Comparing the PLS model Ž.As2 and the MLR cients, the only thing we can estimate is their joint modelŽ. equivalent to the PLS model with A s 7 contribution to y.

Fig. 8. The MLR coefficients corresponding to analysis 2. The bars indicate 95% confidence intervals based on jack-knifing. Note the differ- ences in the vertical scale to Fig. 7. S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 123

s s s s Fig. 9. VIP of the X-variables of the three-component PLSR model, 3rd analysis. The squares of x123PIE, x PIF, x DGR, and x 6 Lam, are denoted by S1)1, S2)2, S3)3, and S6)6, respectively.

5.4. Measures of Õariable importance 5.5. Residuals

In PLSR modelling, a variable Ž.x k may be im- The residuals of Y and XEŽŽ.and F in Eqs. 2 and portant for the modelling of Y. Such variables are Ž.4 above . are of diagnostic value for the quality of identified by large PLS-regression coefficients, bmk. the model. A normal probability plot of the Y-residu- However, a variable may also be important for the alsŽ. Fig. 10 of the final AA model shows a fairly modelling of X, which is identified by large load- straight line with all values within "3 SD’s. To be a ings, pak. A summary of the importance of an X- serious outlier, a point should clearly deviate from the " variable for both Y and X is given by VIPk Ž variable line, and be outside 4 SD’s. importance for the projection, Fig. 9. . This is a Since the X-residuals are many Ž.N ) K ,one ) weighted sum of squares of the PLS-weights, wak, needs a summary for each observationŽ. compound with the weights calculated from the amount of Y- not to drown in details. This is provided by the resid- variance of each PLS component, a. ual SD of the X-residuals of the corresponding row

Fig. 10. Y-residuals of the three-component PLSR model, 3rd analysis, normal probability plot. 124 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 of the residual matrix E. Because this SD is propor- directly model the change from one group to another, tional to the distance between the data point and the provides a simple approach to deal with these non- model plane in X-space, it is also often called linearities. DModXŽ. distance to the model in X-space . A DModX larger than around 2.5 times the overall SD of the X-residualsŽ corresponding to an F-value of 6. Example 2() SIMCODM 6.25. indicates that the observation is an outlier. Fig. 11 shows that none of the 16 compounds in example High and consistent product quality combined with 1 has a large DModX. Here the overall SDs0.34. AgreenB plant operation is important in today’s com- petitive industrial climate. The goal of process data 5.6. Conclusions of example 1 modelling is often to reduce the amount of down time and eliminate sources of undesired and deleterious The initial PLSR analysis gave diagnosticsŽ score process variability. The second example shows the plots. that indicated in-homogeneities in the data. A investigation of possibilities to operate a process in- much better model was obtained for the N s 16 dustry in an environment-friendly mannerwx 40 . non-aromatic AA’s. A remaining curvature in the At the Aylesford Newsprint paper-mill in Kent, score plot of u11vs. t lead to the inclusion of squared UK, premium quality newsprint is produced from terms, which gave a good final model. Only the 100% recycled fiber, i.e., reclaimed newspapers and squares in the lipophilicity variables are significant in magazines. The first step of the process is one of de- the final model. inking. This has stages of cleaning, screening, and If additional aromatic AA’s had been present, a flotation. First the wastepaper is fed into rotating second separate model could have been developed for drum pulpers, where it is mixed with warm water and this type of AA’s. This would provide insight in how chemicals to initiate the fiber separation and ink re- this aromatic group differs from the non-aromatic moval. In the next stage, the pulp undergoes pre- AA’s. This is, in a way, a non-linear model of the screening and cleaning to remove light and heavy changes in the relationship between structure and ac- contaminantsŽ staples, paper clips, sand, plastics, tivity when going from non-aromatic to aromatic etc.. . Thereafter the pulp goes to a pre-flotation stage. AA’s. These changes are too large and non-linear to After the pre-flotation the pulp is thickened and dis- be modelled by a linear or low degree polynomial persed and the brightness is adjusted. After post-flo- model. The use of two separate models, which do not tation and washing the de-inked pulpŽ. DIP is sent via

Fig. 11. RSD’s of each compound’s X-residualsŽ. DModX , 3rd analysis. S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 125 storage towers to the paper machines for production of premium quality newsprint. The wastewater discharged from the de-inking process is the main source of chemical oxygen de- mandŽ. COD of the mill effluent. COD estimates the oxygen requirements of organic matter present in the effluent. The COD in the effluent is reduced by 85% prior to discharge to the river Medway. Stricter envi- ronmental regulations and costs of COD reduction made the investigation of the COD sources and pos- sible control strategies one of the company’s envi- ronmental objectives for 1999. A multivariate ap- proach was adopted. Thus, data from more than a year were used in the multivariate study. The example data set contains 384 daily observations, with one response variableŽ COD load, y .Ž.and 54 process variables x yx . Half of 1255Fig. 13. Example 2. PCA loading plot Ž.p1rp2 corresponding to the observationsŽ. process time points are used for Fig. 12. The position of the COD load, variable number 1, is high- model training and half for model validation. lighted by an increased font size.

6.1. OÕerÕiew by PCA Fig. 12 reveals a main cluster in the right-hand part PCA applied to the entire data set Ž.X and Y gave of the score plot, where the process has spent most an eight-component model explaining R2 s79% and of its time. Occasionally, the process has drifted away predicting Q2 s68% of the data variation. Scores and towards the left-hand area of the score plot, corre- loadings of the first two components accounting for sponding to a lower production rate, including 54% of the sum of squares are plotted in Figs. 12 and planned shut downs in one of the process lines. The 13. loading plot shows most of the variables having low numerical valuesŽ i.e., low material flow, low addi- tion of process chemicals, etc.. for the low produc- tion rates. This analysis shows that the data are not strongly clustered. The low production process time points deviate from the main cluster, but not seriously. Rather, the inclusion of the low production time points in the subsequent PLSR modelling give a good spanning of key features such as brightness, residual ink, and addition of external chemicals.

6.2. Selection of training set for PLSR and lagging of data

The parent data set, with preserved time order, was split into two halves, one training set and one predic- tion set. Only forward prediction was applied. Pro- cess insight, initial PLSR modelling, and inspection Fig. 12. Example 2. PCA score plot Ž.t1rt2 of overview model. of cross-correlations, indicated that lagging 10 of the Each point represents one process time point. variables was warranted to catch process dynamics. 126 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

6.3. PLSR modelling

PLSR was utilized to relate the 54q10 process variables to the COD load. A two-component PLSR 2 s 2 s 2 model was obtained, with R XYY0.49, R 0.60, Q s 2 s Ž.CV 0.57, and QY Ž.ext 0.59 Ž external valida- tion set. . Fig. 14 shows relationships between ob- served and predicted levels of COD for the vali- dation set. The picture for the training set is very similar.

6.4. Results

The collected process variables together with their lags predict COD load with Q2 )0.6. This is a satis- Fig. 14. Example 2. Observed verses predicted COD for the pre- fying result considering the complexity of the pro- diction set. cess and the great variations that occur in the starting materialŽ. recycled paper . These were: lags 1–4 of COD, and lags 1 and 2 of The PLS-regression coefficientsŽ. Fig. 15 , show variables x23, x 24 Ž.addition of two chemicals , and the process variables with the strongest correlation to x35 Ž.a flow , totally 10 lagged X’s. Lagging a vari- the COD load to be x23, x 24 Ž.addition of chemicals , able L steps means that the same variable shifted L x34, x 35Ž.flows , and x 49and x 50 Žtemperatures . , steps in time is included as an additional variable in with coefficients exceeding 0.04. Only some of these the data set. can be controlled, namely x23, x 24, and x 49. More-

Fig. 15. Example 2. Plot of regression coefficients of scaled and centered variables of PLS model. Coefficients with numbers 2 to 55 repre- sent the 54 original process variables. Coefficients 57–60 depict lags 1–4 of COD, and coefficients 61–66 the two first lags of variables x23, x 24, and x 35. S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 127

2 2 2 over, the lagged variables are important. Variables one obtains also Rmmand Q for each y m. The R ’s 57–60 are the four first lags of the COD variable. give an upper bound of how well the model explains Obviously, past measurements of COD are useful for the data and predicts new observations, and the Q2 ’s modelling and predicting future levels of COD—the give lower bounds for the same things. change in this variable is slow. The variables 61–66 Ž.5 The Žu,t .score plots for the first two or three are lags 1 and 2 of x23, x 24, and x 35. The dynamics model dimensions will show the presence of outliers, of these process variables are thus related to COD. curvature, or groups in the data. Unfortunately, none of the three controlled vari- The Ž.t,t score plots—windows in X-space—show ables Ž.x23, x 24, and x 49 offer a means to decrease in-homogeneities, groups, or other patterns. Together COD. The former two variables describe the addition with this, the Žwc) . weight plots gives an interpreta- of caustic and peroxide to the dispergers, and cannot tion of these patterns. be significantly reduced without endangering pulp Ž.6a If problems are seen, i.e., small R22and Q quality. The last variable is the raw effluent tempera- values, or outliers, groups, or curvature in the score ture which cannot be lowered without negative ef- plots, one should try to remedy the problem. Plots of fects on pulp quality. residualsŽ normal probability and DModX and This result is typical for the analysis of historical DModY. may give additional hints for the sources of process data. The value of the analysis is mainly to the problems. detect deviations from normal operation, not as a Single outliers should be inspected for correctness means to process optimization. The latter demands of data, and if this does not help, be excluded from data from a statistically designed set of experiments the analysis only if they are non-interestingŽ i.e., low with the purpose of optimizing the process. activity. . Curvature in Ž.u,t plots may be improved by transforming parts of the dataŽ. e.g., log , or by in- cluding squared andror cubic terms in the model. 7. Summary; how to develop and interpret a PLSR After possibly having transformed the data, modi- model fied the model, divided the data into groups, deleted outliers, or whatever warranted, one returns to step 1. Ž.1 Have a good understanding of the stated prob- Ž.6b If no problems are seen, i.e., R22and Q are lem, particularly which responsesŽ. properties , Y, that OK, and the model is interpretable, one may try to are of interest to measure and model, and which pre- mildly prune the model by deleting unimportant dictors, X, that should be measured and varied. If terms, i.e., with small regression coefficients and low possible, i.e., if the X-variables are subject to experi- VIP values. Thereafter a final model is developed, mental control, use statistical experimental design interpreted, and validated, predictions are made, etc. wx41 for the construction of X. For the interpretation the Žwc) . weight plots, coeffi- Ž.2 Get good data, both Y Žresponses . and X Žpre- cient plots, and contour or 3D plots with dominating dictors. . Multivariate Y’s provide much more infor- X’s as plot coordinates, are invaluable. mation because they can first be separately analyzed by a PCA. This gives a good idea about the amount of systematic variation in Y, which Y-variables that 8. Regression-like data-analytical problems should be analyzed together, etc. Ž.3 The first piece of information about the model A number of seemingly different data-analytical is the number of significant components, A—the problems can be expressed as regression problems complexity of the model and hence of the system. with a special coding of Y or X. These include linear This number of components gives a lower bound of discriminant analysisŽ. LDA , analysis of variance the number of effects we need to postulate in the Ž.ANOVA , and time series analysis Ž ARMA and system. similar models. . With many and collinear variables Ž.4 The goodness of fit of the model is given by Ž.rank-deficient X , a PLSR solution can therefore be R22and Q Žcross-validated R2 .. With several Y’s, formulated for each of these. 128 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

In linear discriminant analysisŽ. LDA , and the The iterative calculations in non-linear regression closely related canonical variates analysisŽ. CVA , one often involve a regression-like updating stepŽ e.g., the has the X-matrix divided in a number of classes, with Gauss–Newton approach. . The X-matrix of this step index 1,2, . . . , g,...,G. The objective of the analysis is often highly rank-deficient, and ridge-regression is is to find linear combinations of the X-variablesŽ dis- used to remedy the situationŽ Marquardt–Lefwen- criminant functions. that discriminate between the berg algorithms. . PLSR provides a simple and inter- classes, i.e., have very different values for the classes esting alternative, which also is computationally very wx11 . Provided that each class is AtightB and occupies fast. a small and separate volume in X-space, one can find a plane—a discriminant plane—in which the pro- jected observations are well separated according to 9. Conclusions and discussion class. With many and collinear X-variables, a PLS version of LDAŽ. PLS-DA is usefulwx 42,43 . PLSR provides an approach to the quantitative However, when some of the classes are not tight, modelling of the often complicated relationships be- often due to a lack of homogeneity and similarity in tween predictors, X, and responses, Y, that with these non-tight classes, the discriminant analysis does complex problems often is more realistic than MLR not work. Then other approaches, such as, SIMCA including stepwise selection variants. This because have to be used, where a PC or PLSR model is de- the assumptions underlying PLS—correlations veloped for each tight class, and new observations are among the X’s, noise in X, model errors—are more classified according to their nearness in X-space to realistic than the MLR assumptions of independent these class models. and error free X’s. Analogously, when X contains qualitative vari- The diagnostics of PLSR, notably cross-validation ablesŽ. the ANOVA situation , these can be coded us- and score plots Ž.u, t and t, t with corresponding ing dummy variables, and the data then analyzed us- loading plots, provide information about the correla- ing MLR or PLSR. The latter is advantageous if the tion structure of X that is not obtained in ordinary coded X is unbalanced andror rank deficientŽ or MLR. In particular, PLSR results showing that the close to. . With several Y-variables, this provides a data are inhomogeneousŽ like the AA example looked simple approach to MANOVAŽ multiple responses at here. , are hard to obtain by MLR. In complicated ANOVA. wx 11 . systems, non-linearities so strong that a single poly- Auto-regressive and transfer function models in nomial model cannot be constructed, seem to be time-series analysis are easily analyzed by first con- rather common. Hence, a flexible approach to mod- structing an expanded X-matrix that contains the ap- elling is often warranted with separate models for propriate lags of Y- and X-variables, and then calcu- different mechanistic classes. And there is no loss of lating a PLSR solution. If the disturbances a t are information with this approach in comparison with included in the modelsŽ. ARMA, etc. , a two-step the single model approach. A new observationŽ ob- analysis is needed, first calculating estimates of a t , ject. is first classified with respect to its X-values, and and then using these in lagged forms in the final predicted response values are then obtained by the PLSR model. appropriate class model. In the modelling of mixtures, for instance of The ability of PLSR to analyze profiles of re- chemical ingredients in paints, pharmaceutical, cos- sponses, makes it easier to device response measure- metic, and polymer formulations, beverages, etc., the ments that are relevant to the stated objective of the sum of the X-variables is 1.0, since ingredients sum investigation; it is easier to capture the behavior of a to 100%. This makes X rank deficient, and a number complicated system by a battery of measurements of special regression approaches have been devel- than by a single variable. oped for the analysis of mixture data. With PLSR, PLSR can be extended in various directions, to however, the rank deficiency presents no difficulties, non-linear modelling, hierarchical modelling when and the data analysis becomes straight forward, as the variables are very numerous, multi-way data, PLS shown by Kettaneh-Woldwx 44 . time series, PLS-DA, etc. Many recently developed S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130 129

PLS based approaches are discussed in other articles wx10 M. Sandberg, L. Eriksson, J. Jonsson, M. Sjostrom,¨¨ S. Wold, in this volume. The statistical understanding of PLSR New chemical dimensions relevant for the design of biologi- has recently improved substantiallywx 12,13,34 . cally active peptides, A multivariate characterization of 87 amino acids, J. Med. Chem. 41Ž. 1998 2481–2491. We feel that the flexibility of the PLS-approach, wx11 J.E. Jackson, A User’s Guide to Principal Components, Wi- its graphical orientation, and its inherent ability to ley, New York, 1991. handle incomplete and noisy data with many vari- wx12 A.J. Burnham, J.F. MacGregor, R. Viveris, Latent variable ablesŽ. and observations makes PLS a simple but regression tools, Chemom. Intell. Lab. Syst. 48Ž. 1999 167– powerful approach for the analysis of data of compli- 180. wx13 A. Burnham, R. Viveros, J. MacGregor, Frameworks for la- cated problems. tent variable multivariate regression, J. Chemom. 10Ž. 1996 31–45. wx14 R. Manne, Analysis of two partial least squares algorithms for Acknowledgements multivariate calibration, Chemom. Intell. Lab. Syst. 1Ž. 1987 187–197. wx Support from the Swedish Natural Science Re- 15 J. Trygg et al., This issue. wx16 P. Nomikos, J.F. MacGregor, Multivariate SPC charts for search CouncilŽ. NFR , and from the Center for Envi- monitoring batch processes, Technometrics 37Ž. 1995 41–59. ronmental Research in Umea˚ Ž. CMF is gratefully wx17 P.R.C. Nelson, P.A. Taylor, J.F. MacGregor, Missing data acknowledged. Dr. Erik Johansson is thanked for methods in PCA and PLS: score calculations with incomplete helping with the examples. observation, Chemom. Intell. Lab. Syst. 35Ž. 1996 45–65. wx18 B. Grung, R. Manne, Missing values in principal component analysis, Chemom. Intell. Lab. Syst. 42Ž. 1998 125–139. wx19 I.N. Wakeling, J.J. Morris, A test of significance for partial References least squares regression, J. Chemom. 7Ž. 1993 291–304. wx20 M. Clark, R.D. Cramer III, The probability of chance correla- wx1 H. Wold, Soft modelling, The basic design and some exten- tion using partial least squaresŽ. PLS , Quant. Struct.-Act. Re- sions, in: K.-G. Joreskog,¨ H. WoldŽ. Eds. , Systems Under In- lat. 12Ž. 1993 137–145. direct Observation, vols. I and II, North-Holland, Amster- wx21 J. Shao, Linear model selection by cross-validation, J. Am. dam, 1982. Stat. Assoc. 88Ž. 1993 486–494. wx2 S. Wold, A. Ruhe, H. Wold, W.J. Dunn III, The collinearity wx22 F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS problem in linear regression, The partial least squares ap- I, Many observations and few variables, J. Chemom. 7Ž. 1993 proach to generalized inverses, SIAM J. Sci. Stat. Comput. 5 45–59. Ž.1984 735–743. wx23 S. Rannar,¨ P. Geladi, F. Lindgren, S. Wold, The kernel algo- wx3 A. Hoskuldsson,¨ PLS regression methods, J. Chemom. 2 rithm for PLS II, Few observations and many variables, J. Ž.1988 211–228. Chemom. 8Ž. 1994 111–125. wx4 A. Hoskuldsson,¨ Prediction Methods in Science and Technol- wx24 K.H. Esbensen, S. Wold, SIMCA, MACUP, SELPLS, ogy, vol. 1, Thor Publishing, Copenhagen, 1996, ISBN 87- GDAM, SPACE and UNFOLD: the ways towards regional- 985941-0-9. ized principal components analysis and subconstrained N-way wx5 S. Wold, E. Johansson, M. Cocchi, PLS—partial least squares decomposition—with geological illustrations, in: O.J. Christie projections to latent structures, in: H. KubinyiŽ. Ed. , 3D Ž.Ed. , Proc. Nord. Symp. Appl. Statist. Stavanger, 1983, pp. QSAR in Drug Design, Theory, Methods, and Applications, 11–36, ISBN 82-90496-02-8. ESCOM Science Publishers, Leiden, 1993, pp. 523–550. wx25 N. Kettaneh-Wold, J.F. MacGregor, B. Dayal, S. Wold, Mul- wx6 M. Tenenhaus, La Regression PLS: Theorie et Pratique, tivariate design of process experimentsŽ. M-DOPE , Chemom. Technip, Paris, 1998. Intell. Lab. Syst. 23Ž. 1994 39–50. wx7 R.W. Gerlach, B.R. Kowalski, H. Wold, Partial least squares wx26 O.M. Kvalheim, A.A. Christy, N. Telnaes, A. Bjoerseth, Ma- modelling with latent variables, Anal. Chim. Acta 112Ž. 1979 turity determination of organic matter in coals using the 417–421. methylphenantrene distribution, Geochim. Cosmochim. Acta wx8 N.E. El Tayar, R.-S. Tsai, P.-A. Carrupt, B. Testa, Octan-1- 51Ž. 1987 1883–1888. ol-water partition coefficients of zwitterionic a-amino acids, wx27 M.T. Chu, R.E. Funderlic, G.H. Golub, A rank-one reduction Determination by centrifugal partition chromatography and formula and its applications to matrix factorizations, SIAM factorization into stericrhydrophobic and polar components, Rev. 37Ž. 1995 512–530. J. Chem. Soc., Perkin Trans. 2Ž. 1992 79–84. wx28 M.C. Denham, Prediction intervals in partial least squares, J. wx9 S. Hellberg, M. Sjostrom,¨¨ S. Wold, The prediction of Chemom. 11Ž. 1997 39–52. bradykinin potentiating potency of pentapeptides, An exam- wx29 B. Efron, G. Gong, A leisurely look at the bootstrap, the ple of a peptide quantitative structure–activity relationship, jackknife, and cross-validation, Am. Stat. 37Ž. 1983 36–48. Acta Chem. Scand., Ser. B 40Ž. 1986 135–140. wx30 H. Martens, M. Martens, Modified jack-knife estimation of 130 S. Wold et al.rChemometrics and Intelligent Laboratory Systems 58() 2001 109–130

parameter uncertainty in bilinear modelingŽ. PLSR , Food wx37 A. Berglund, S. Wold, INLR, implicit non-linear latent vari- Qual. Preference 11Ž. 2000 5–16. able regression, J. Chemom. 11Ž. 1997 141–156. wx31 S. Wold, C. Albano, W.J. Dunn III, U. Edlund, B. Eliasson, wx38 S. Wold, A. Berglund, N. Kettaneh, N. Bendwell, D.R. E. Johansson, B. Norden, M. Sjostrom,¨¨ The indirect observa- Cameron, The GIFI approach to non-linear PLS modelling, J. tion of molecular chemical systems, in: K.-G. Joreskog,¨ H. Chemom. 15Ž. 2001 321–336. WoldŽ. Eds. , Systems Under Indirect Observation, vols. I and wx39 L. Eriksson, E. Johansson, F. Lindgren, S. Wold, GIFI-PLS: II, North-Holland, Amsterdam, 1982, pp. 177–207, Chapter modeling of non-linearities and discontinuities in QSAR, 8. QSAR 19Ž. 2000 345–355. wx32 S. Wold, M. Sjostrom,¨¨ L. Eriksson, PLS in chemistry, in: wx40 L. Eriksson, P. Hagberg, E. Johansson, S. Rannar,¨ D. Sarney, P.v.R. Schleyer, N.L. Allinger, T. Clark, J. Gasteiger, P.A. O. Whelehan, A. Astrom,˚ ¨ T. Lindgren, Multivariate process Kollman, H.F. Schaefer III, P.R. SchreinerŽ. Eds. , The Ency- monitoring of a newsprint mill, Application to modelling and clopedia of Computational Chemistry, Wiley, Chichester, predicting COD load resulting from de-inking of modeling, J. 1999, pp. 2006–2020. Chemom. 15Ž. 2001 337–352. wx33 O. Kvalheim, The latent variable, an editorial, Chemom. In- wx41 G.E.P. Box, W.G. Hunter, J.S. Hunter, Statistics for Experi- tell. Lab. Syst. 14Ž. 1992 1–3. menters, Wiley, New York, 1978. wx34 I.E. Frank, J.H. Friedman, A statistical view of some chemo- wx42 M. Sjostrom,¨¨ S. Wold, B. Soderstrom, ¨ ¨ PLS discriminant plots, metrics regression tools, With discussion, Technometrics 35 Proceedings of PARC in Practice, Amsterdam, June 19–21, Ž.1993 109–148. 1985, Elsevier, North-Holland, 1986, pp. 461–470. wx35 S. Wold, A theoretical foundation of extrathermodynamic re- wx43 L. Stahle,˚ S. Wold, Partial least squares analysis with cross- lationshipsŽ. linear free energy relationships , Chem. Scr. 5 validation for the two-class problem: a Monte Carlo study, J. Ž.1974 97–106. Chemom. 1Ž. 1987 185–196. wx36 D.A. Belsley, E. Kuh, R.E. Welsch, Regression Diagnostics: wx44 N. Kettaneh-Wold, Analysis of mixture data with partial least Identifying Influential Data and Sources of Collinearity, Wi- squares, Chemom. Intell. Lab. Syst. 14Ž. 1992 57–69. ley, New York, 1980.