TONY DAVIES COLUMN

cian, but I have friends who are. The aim of from the publishers at a very modest cost and I .Introduction this column is to provide a bridge between after this issue it will be assumed that readers I would like to welcome readers who are chemometricians (who are expert mathemati­ have the knowledge or the reprints. new to this column with a few words of cians) and potential spectroscopic users who Former readers of World explanation. Chemometrics is a subject which are probably not mathematicians. The inten­ will recognise my article on Principal Compo­ has generated (and continues to generate) tion is that users should develop an under­ nent Analysis (PCA) which is repeated to much interest and excitement in analytical standing which will provide a suitable balance enable new readers to fully comprehend the spectroscopy. While there is a rather small between rejection of unknown methods and new article by Ian Cowe . While Ian may band of experts who are developing new unqualified enthusiasm for "black-box" soft­ claim NOT to be a chemometrician, his paper techniques, it is not necessary to be an expert ware. This should enable them to make on the utilisation of PCA is probably one of to utilise chemometrics given some basic un­ suaessful use of these poweiful enhancements the most frequently referenced papers in near derstanding of the limitations and potential to good spectroscopy. Photocopies of the pre­ . It is a great pleasure to piifalls for the unwary user of chemometric vious articles in the Chemometrics Column welcome him to the first of these Columns in computer software. I am not a chemometri- series in Spectroscopy World are available Spectroscopy Europe.

The principles of principal component analysis* by Tony Davies, Column Editor

Since the beginning of this column we • • • • have been taking a fairly relaxed tour of : !I> • 0 chemometric concepts while attempting i .. to exclude mathematics as far as possible. & • • I do not intend to change this approach, • ·- but in future columns we will have to be able to make assumptions of comprehen­ sion of some key topics. Principal Com­ Figure 4. Scores plot. ponent Analysis (PCA) is one of the

fundamental methods of multi-variate 17 analysis and hence of chemometrics. It Figure 1. 30 . was introduced in an early column [Spec­ 13 troscopy World 2(2), 32 (1990)] but it is so important that this and the next column PC I will be devoted to it. PCA is a method of data analysis which requ1res a matrix of samples and

variables. It finds the maximum vari­ - 3 ations in the data and fo rms new variables -7 ...____ .J.... ___ _,_ ___ _, [known as Princtpal Components (PCs)] 0

such that: X Varia~ f'..lmber each successive PC accounts for as Figure 5. Weights plot. much of the remaining variability as pos­ sible except that, each new variable must be orthogonal nents is more likely to be between 10 and Figure 2. First PC. (at right angles) to all other variables. 20. PCA is easily defined by matrix algebra The output from PCA is in the form but the intention of this column is to of two tables and some statistical infor­ present ideas in diagrammatic forms. mation. The first of these contains values This makes life difficult, because we are ·for each sample on each Principal Com­ visually restricted to three dimensions ponent. These are known as scores. The and thus we can only illustrate the work­ other contains coefficients used to com­ ing ofPCA in terms of three variables. It pute the components from the original is important to realise that the power of variables which are known as weights (or PCA is in being able to examine large sometimes coefficients). Both contain numbers of variables and to compute useful information. The scores are many principal components which are mainly concerned with the samples and mathematically orthogonal to each can be used in place of the original vari­ other. In some discussions of PCA this ables, while the weights show how the ability is not emphasised because of the Figure 3. Second PC. components are formed and tells about difficulties of demonstrating it and the the distribution of information in the reader could be left with the impression data set. If you remember the article that we only use two or three principal about cutting the data cake [Spectroscopy *reprinted from Spectroscopy World 4(1), components. Except for very simple . World 2(1), 35 (1990)] then the weights 23 (1992). data, the number of principal compo-

- 38 Spectroscopy Europe 4/2 (1992) TONY DAVIES COLUMN

are represented by the shape of the cutter vector which is at right angles to the first it is usually necessary to transform the and the table of scores are new slices of PC and contains the maximum amount data. This involves correcting for the (i.e. subtracting the mean value of I the computed cake. One of the impor­ of variability compared with all the vec­ tant from a PCA is the total tors which conform to the specification; that variable) and sometimes stand­ percentage of explained. This this is the second PC. Figure 4 shows the ardising by making the variance of each should be very close to 100%. The first scores for the samples as a plot of the two variable equal to 1. Most software pack­ few PCs will contain the majority of the PCs and Figure 5 shows plots for the PC ages will do this for you so that my simple variance but experience with PCA soon weights. Figure 4 contains 98% of the model is sufficient until you want to leads one to take notice of the later PCs variation present in the original three check that your program is giving correct answers! which may explain only very small vari­ variables. The first component ac­ The orthogonality of PCA is actually ances; sometimes this can be the crucial counted for 73% and the second for 25% a dual orthogonality. Not only are the information in your data. Not retaining of the total variance. It can be seen from vectors orthogonal but also the scores are sufficient PCs can be like throwing out the weights plot that the first component uncorrelated (i.e. orthogonal). the baby with the bath-water. is dominated by the second variable, Figure 1 shows a three-dimensional while the second component is largely a Acknowledgement plot for three variables measured on a set product of the first and third variable. [ am grateful to Tom Fearn for making of13 samples. In Figure 2 PCA has found Notes sure that during my efforts to obtain the vector which contains the maximum simplicity I have not strayed from a valid have tried to keep this explanation as amount of variability and this will form description of PCA. the first PC. In Figure 3 the PCA has simple as possible. Perhaps I should make found the position and orientation of a the point that before carrying out PCA

Applications using principal component analysis Ian A. Cowe 10 Buddon Drive, Monifieth, Dundee DD2 5DA, Scotland.

In a previous article, Tony Davies ex­ is derived we can determine easily how express practically zero variation but plained how principal components are many components are needed to model have statistically significant correlations derived and defined some of their basic the variation that relates to major physi­ due to random chance. properties. In this article, I will look at cal and chemical effects. In Table 1, only two components one application of Principal Component A real application, in this case wheat (PC 1 and PC2) correlate with oven dried Regression (PCR) to predict composi­ flour with values for protein and mois­ moisture. Water is one of the strongest tion and also consider applications where ture, 1 shows how easy it is to use PCR. absorbers and should be present at about components are used as an assessment of Table 1 shows a summary for the first few 12% in these samples. So we should ex­ some aspect of functionality without di­ components. Although we normally de­ pect that early components would be rect use of constituent data. Although the rive between 10 and 20 components, in dominated by water. In fact, the second applications discussed will relate to near this case only the first few correlated with component (r = 0.97) alone would be infrared diffuse reflectance spectroscopy, moisture and protein. The remainder enough to adequately predict moisture the same general principles apply in other had uniformly low correlations and to­ content. With protein, a weaker ab­ fields. gether represented less than 0.02% of the sorber, the first, fourth and, to a lesser PCR is a chemometric technique spectral variation. extent, the third components showed which uses all the spectral data to predict To be included in a model, a compo­ some correlation. composition. It provides two new vari­ nent should have a significant correlation One of the main advantages of PCR ates, "weight$' , which represent the with the constituent of interest and ex­ over conventional wavelength regres­ relative importance of each of the origi­ press an amount of spectral variation in sion is that spectral interpretation of a nal data values to the components and proportion to its concentration and ab­ model is much easier. When all the "x" which can be used for spectral interpre­ sorption coefficient. This avoids the in­ data are spectral values then plots of the tation and "scord' which condense the clusion of later components which weights become analogous to spectra. original data into a few uncorrelated val­ ues which can either be regressed against Table 1. Statistics for wheat flou r. chemical values, or examined by other PC No. %Var. %Cum.Var. Tm techniques such as discriminate analysis Tp to reveal some underlying trend or rela­ 98.60 98.60 -0.16 -0.71 tionship. Scores are derived solely from the 2 0.99 99.59 0.97 -0.08 spectral data and we obtain a score for 3 0.22 99.81 0.09 0.21 each sample on each principal compo­ nent. Each orthogonal vector (or PC) 4 0.ll 99.92 -0.05 0.66 represents in tum a decreasing amount of the spectral variation. By monitoring the 5 0.05 99.97 O.Q7 0.10 cumulative variance as each component 6 0.01 99.98 -0.02 -0.01

Spectroscopy Europe 4/2 (1992) 39 TONY DAVIES COLUMN

Table 2. Building regression models for moisture and protein in wheat. Thus for protein, the largest coefficient (PC4) always has a value of 60.55. For moisture One strength of principal components PC2 + PCl = 0.972 + 0.162 + 0.092 = 0.983 is that they are derived solely on the spectral data. They can be used even PC2 + PCl + PC3 = 0.972 + 0.162 + 0.092 = 0.987 where no suitable reference values are available. Take, for example, the prob­ For protein lem of monitoring progress of a batch PCl +PC4 = 0.71 2 + 0.662 = 0.969 process. An example was presented re­ cently by Griffin, Kohn and Cowie.2 PCl + PC4 + PC3 = 0.71 2 + 0.662 + 0.21 2 = 0.992 Using the sample scores we can represent · each sample as a single point in a p PCl + PC4 + PC3 +PCS = 0.71 2 + 0.662 + 0.21 2 + 0.102 = 0.997 dimensional space (where p is the num­ ber of components). The scores are Figure 1 shows the shapes of the first four tionship,is likely to be stable, here we can Cartesian co-ordinates defining where components. These are plots of the exploit it in our protein model. each point lies within the space. As we weights against wavelength. Typical The orthogonality of principal com­ cannot visualise more than three dimen­ NIR spectra consist of approximately ponents makes regression modelling a sions we normally select two compo­ 700 data points covering the 1100 simple and predictable process. The mul­ nents to provide a suitable two to 2500 nm and so we have 700 weights. tiple correlation for any combination of dimensional "window" on the p dimen­ For each component we get a weight at components relates directly to the indi­ sional space. each wavelength and the weights are vidual component correlations. We sim­ If, for an imaginary example, we took scaled in such a way that as the sum of ply sum the squared individual samples every few minutes throughout the squared weights across the spectrum correlations and take the square root to the life of a process to a point beyond always equals one. This that, for obtain the multiple correlation (see Table where it normally would be stopped, we any component, wavelengths with large 2). Adding PC3 only marginally im- might find that the scores form a "track" across a plane defined by two compo­ nents (Figure 2). This is not surprising as the samples form a and adja­ } cent samples are closely related. If the ,I•• PCn batch process were repeated several Fe • ! times, then we could measure the errors •• •• associated with "normal" operation •• •• ~. ~ -... • throughout the process. Finally, we •··'E could identify a small area of the two

PCm dimensional space which represents an Figure 1. First four principal com­ acceptable end point for the reaction. ponents for wheat flour. When subsequent batches are run, the operating conditions can be modified to weights are proportionally more impor­ keep the reaction" on track" , and when tant in determining the sample score on scores within the end point space are that component than wavelengths with encountered the process can be stopped. weights close to zero. When linked with feedback control sys­ Ifwe look at Figure 1, we see that PC2 tems this forms a powerful system. (which correlated highly with water) has These examples show two contrasting Figure 2. Scores/scores plot for a shape similar to a water spectrum. PC4, ways in which principal components can process control. S = start or initial which correlated highly (r = 0.66) with provide a solution to basic chemometric value, F = final value. The dashed protein, shows protein bands as high problems. There are several statistical line represents the normal end positive weights at 1980, 2050, 2180 and programs currently available for personal point for the reaction. 2210 nm. But how do we interpret PC 1? computers which provide PCA as an It expressed almost all the spectral vari­ option. proved the model for moisture, while for ation, correlated highly with protein, yet protein adding PCS made little differ­ References showed no evidence of protein bands. ence. Thus we would predict moisture 1. I.A. Cowe andJ.W. McNicol, "The In fact, PC 1 relates mainly to baseline using the first two components and pro­ use of principal components in the shifts caused by variation in particle size tein using PCs 1,3 and 4. The form of analysis of near-infrared spectra'' , between samples. The particle size of the the models is as follows: % Protein = Appl. Spectrosc. 39, 257 (1985). ground flour is dete~ned largely by 10.99 + 2.219 x PC1+13.72 x PC3 + 2. J.A. Griffin, W. Kohn and J. Cowie, grain hardness, and protein in one of the 60.55 x C4; % Moisture = 13.47 + in Making Light Work: Advances in factors which affects hardness. Thus, 0.24 x PC1 + 14.47 x PC2 Near Infrared Spectroscopy, I.A. Cowe PC1 relates to protein through a secon­ and I Murray (Eds), Proc. of the 4th One strength of PCR is that, because dary correlation with a physical factor. Int. Con£ on NIR Spectrosc., 13- of orthogonality, values of regression co­ Normally the use of an indirect correla­ 19 Aug, 1991, Aberdeen, Scotland. efficients do not change when terms are tion should be avoided but, as this rela- VCH, Weinheim, Germany (1992). added to or subtracted from the model.

- 40 Spectroscopy Europe 4/2 (1992)