<<

Multivariate Multivariate Data Analysis Introduction to Multivariate Data Analysis Principal Component Analysis (PCA) Multivariate (MLR, PCR and PLSR)

Laboratory exercises: Introduction to MATLAB Examples of PCA ( of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona) [email protected] and Multivariate Data

Univariate Multivariate • Only one measurement per • Multiple measurements per (pH, Absorbance, sample peak height or area) – Instrumental measuremnts • The property is defined by one (spectra, cromatograms,...) one measurement. – Multiple measurements • Total Selectivity is needed (constituent conc., sensorial • Interferences should be variables,....) eliminated de before • Total selectivity is not needed. measurements (separation) • Interferences can be present • Numerical treatment is easy • Different complexity levels (vectors, matrices, tensors,...). Univariate

n ∑ x i μ →=X i1= n n (x− X)2 ∑ i One variable 22→=s i1= Xxσ n1− summary n 2 ∑ (xi − X) Standard i1= Xx→=s Deviation σ n1− n ∑ (xii−− X)(y Y) i1= x,y→=s x,y n1− σ Relation s → x,y between 2 Correlation x,y ss xy variables Univariate Statistics Description of one variable Zn 4.10 • Mean: 4.00 4.04 – Mean value of the variable. 3.82 – Size of the scale of the variable. 3.96 • : 0.18. 4.07 • Variance: 0.032. 4.23 – Dispersion around the mean. 3.73 – Spread (dispersion) of the scale of the 3.80 variable. 4.23 Univariate Statistics Description of the relation between 2 variables Sn Ni 0.20 0.0022 • Covariance: 0.00043. 0.20 0.0020 – High values Indicate a linear relationship between x and y. 0.15 0.0015 – Sign: relation direct (+) or inverse (-). 0.61 0.0062 – Depends on the scale sizes of x i y. 0.57 0.0056 • Correlation: 0.996. 0.58 0.0055 – |1| total linearity, 0 absence of linear 0.30 0.0033 relationship. 0.60 0.0056 – Sign: relation direct (+), inverse (-). 0.10 0.0014 – Independent from the scale sizes of x and y.

High Correlation, redundant information. Low Correlation, complementary Information. Multivariate Data Example

Sn Zn Fe Ni Variables 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 Objects 0.61 3.96 0.09 0.0062

Samples 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014

X (9,4) (n,m) Multivariate Statistics

Description of variables

• Vector of : – Mean value of each variable Sn Zn Fe Ni – Sample diferences due to the (0.37 4.00 0.06 0.0037) differences in the size scales of the variables • Variables with higher means have n a higher influence in dta description. ∑ xij x = i=1 • If the scale sizes are very different, j n −1 a data pretreatment will probably be needed. Multivariate Statisitcs

Description of variables

• Vector of standard deviations: s = (s1, s2, s3, ..., sn) – Dispersion of the different variables. Sn Zn Fe Ni – Shows the differences in the scale (0.22 0.18 0.02 0.0020) ranges (dispersion, spread) among variables. • Variables with higher dispersions will have higher influence in the n data description. (x − x )2 ∑ ij j • If the scale ranges of the different i=1 s j = variables are very different, a data n −1 pretreatment will probably be needed. Multivariate Data

Robust Parameters

Sn Zn Fe Ni

0,100000 3,730000 0,020000 0,0014 0,150000 3,800000 0,040000 0,0015 0,200000 3,820000 0,050000 0,002 0,200000 3,960000 0,060000 0,0022 Spread 0,300000 4,040000 0,070000 0,0033 interquartile 0,570000 4,070000 0,070000 0,0055 0,580000 4,100000 0,080000 0,0056 0,600000 4,230000 0,080000 0,0056 0,610000 4,230000 0,090000 0,0062

They are less sensitive to the presence of outliers The of the interquartile is not symmetric respect the median Show the data structure Introduction to Multivariate Data Analysis

Descriptive Statistics (Excel)

ppDDD opDDT ppDDT Total DDX Total POCs PCB#28

Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9 Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066 Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166 Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548 Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386 Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535 Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065 Máximo 159 103.6 598.959 1537 1856.5 165.6 Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185 Cuenta 102 102 102 102 102 99

Probability distributions: Box plots A summarizes the information on the data distribution primarily in terms of the median, the upper quartile, and lower quartile. The “box” by definition extends from the upper to lower quartile. Within the box is a dot or line marking the median. The width of the box, or the distance between the upper and lower quartiles, is equal to the , and is a measure of spread. The median is a measure of location, and the relative distances of the median from the upper and lower quartiles is a measure of symmetry “in the middle” of the distribution. is defined by the upper and lower quartiles. A line or dot in the box marks the median. For example, the median is approximately in the middle of the box for a symmetric distribution, and is positioned toward the lower part of the box for a positively skewed distribution. Box plots, Tucson Precipitation

7

6

5

4

upper quartile P (in) 3 2 interquartile range median 1 irq 0 Jan July lower quartile Month Probability distributions: Box plots

“Whiskers” are drawn outside the box at what are called the the “adjacent values.” The upper adjacent value is the largest observation that does not exceed the upper quartile plus 1.5 iqr , where iqr is the interquartile range. The lower adjacent value is the smallest observation than is not less than the lower quartile minus1.5 iqr . If no data fall outside this 1.5 iqr buffer around the box, the whiskers mark the data extremes. The whiskers also give information about symmetry in the tails of the distribution.

Box plots, Tucson Precipitation

For example, if the distance from the 7 top of the box to the upper whisker exceeds the distance from the bottom 6 of the box to the lower whisker, the 5 distribution is positively skewed in Whiskers the tails. in the tails may be 4 different from skewness in the middle P (in) 3 of the distribution. For example, a 2 interquartile range distribution can be positively skewed in the middle and negatively skewed 1 irq in the tails. 0 Jan July Month Probability distributions: Box plots

Any points lying outside the 1.5 iqr around the box are marked by individual symbols as “outliers”. These points are outliers in comparison to what is expected from a with the same mean and variance as the data sample. For a standard normal distribution, the median and mean are both zero, and: q at 0.25 = −0.67449, q at 0.75 =0.67449, iqr = q 0.75 − q 0.25 =1.349, where q 0.25and q. 075are the first and third quartiles, and iqr is the interquartile range. We see that the whiskers for a standard normal distribution are at data values: Upper whisker = 2.698 , Lower whisker = -2.698

Box plots, Tucson Precipitation

7

6 Outliers 5 4

P (in) 3

2

1

0 Jan July Month Probability distributions: Box plots From the cdf of the standard normal distribution, we see that the probability of a lower value than x=−2.698 is 0.00035. This result shows that for a normal distribution, roughly 0.35 percent of the data is expected to fall below the lower whisker. By symmetry, 0.35 percent of the data are expected above the upper whisker. These data values are classified as outliers. Exactly how many outliers might be expected in a sample of normally distributed data depends on the sample size. For example, with a sample size of 100, we expect no outliers, as 0.35 percent of 100 is much less than 1. With a sample size of 10,000, however, we would expect 35 positive outliers and 35 negative outliers for a normal distribution. For a normal distribution >> varnorm=randn(10000,3); >> boxplot(varnorm) 0.35% of 10000 are approx. 35 outliers at each whisker side Parametric vs

Standard Dev. Mean Parametric

Box Plot Màximum

Robust Median Interquartil Range (IQR)

Mínimum Parametric vs Robust Statistics

Sn Zn Fe Ni Sn Zn Fe Ni

Mean and standard Median and IQRs plots deviation plots (Box plots)

They help to see the size and range scale differences They suggest the use of appropriate data pretreatments to handle these differences. Multivariate Statistics Description of the variable relationships

n 22 2 ∑(xijj− x )(xil -x l ) ⎛⎞ss...s11 12 1m s2 = i1= ⎜⎟ij n1− s22 ...... s S = ⎜⎟21 2m Covariance ⎜⎟...... n ⎜⎟ 2 ⎜⎟22 ∑(xijj− x ) ⎝⎠sm1 ...... s mm 2 s = i1= jj n1− Variance

, S (m,m): it has all the possible pairwise combinations between variables. Multivariate Statistics Description of the variable relationships

22 2 2 ⎛⎞r11 r 12 ... r 1m 2 sij ⎜⎟22 rij = Correlation ⎜⎟r21 ...... r 2m C = ssi j ⎜⎟...... ⎜⎟2 ⎜⎟22ssiii= ⎝⎠rm1 ...... r mm

• Correlation Matrix, C(m,m): it has all the possible of correlations between variables. – Diagonal elements are 1. Multivariate Statistics Description of the variable relationships Covariance Matrix Sn Zn Fe Ni Sn 0,047319 -0,002593 0,002518 0,000434 Zn -0,002593 0,033644 0,000581 -0,000022 Fe 0,002518 0,000581 0,000494 0,000022 Ni 0,000434 -0,000022 0,000022 0,000004

Correlation Matrix Sn Zn Fe Ni Sn 1,000 -0,065 0,521 0,995 Zn -0,065 1,000 0,142 -0,060 Fe 0,521 0,142 1,000 0,502 Ni 0,995 -0,060 0,502 1,000 Introduction to Multivariate Data Analysis variables Pair-wiseCorrelations

CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set. samples Multivariate Data Analysis

Need • Nature is multivariate. – Climate = f(T, rain, winds, seasons,...) – Health = f(genetics, diet, climate, habits,...) – Abs. analit = f(solvent, T, interferences, matrix,...) • Few properties are dependent of only one variable. • Many chemical measurements are multivariate Multivariate Data Analysis

Need • Many times measurements are indirect. –Temperature. • Low values (thermometer) (univariate data). • High values (emision spectra FTIR, f(T)) – Nitrogen Concentration. • Kjeldahl chemical method (univariate data). • NIR spectra (f(C)). Multivariate Data Analyis

Need • The studied property is selectively correlated to a single variable only in a few ocasions (lack of total selectivity). • The studied property is determined by a set of variables with which presents high correlatio

P = f(x1, x2, ...,xn); Multivariate Data Analyis

Need Experimental measures contain information which is not relevant to the property of interest

Observations = Structure + Noise

Structure = part of the signal correlated with the sought property

Noise = all the other contributions, instrumental noise, experimental errors, other components, ... Multivariate Data Analyis

Causality vs Correlation

Correlation is a statistical concept which measures the linear relation between two variables

Causality relationship is a deterministic interpretation from the problem or application

Example: the number of stork and the number of new born children in a geographical area Multivariate Data Analyis Information

Initial Hypothesis: Data have the sought information. Exists a relationship which can be modelled from measured variables and the measured property. When variables change their value, also the property will changed X (variables) ------> Y property model Y = f(X)

X is a vector or a matrix (e.g. spectral measures) Y is a scalar, vector or matrix (e.g. analytic concentrations) Visualization of original data Plot of the matrix rows and/or columns Spectra set

3 3 3 2.5

2.5 outlier sample 2.5 2

2 2 1.5

1 1.5 1.5 0.5

1 1 0 30

0.5 0.5 20 100 80 60 10 40 0 0 S 20 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 am 0 0 p les Variables Samples les Variab Rows Columns Rows and Columns (3D)

Detection of outlier samples/variables Detection of scale and range variable differences Systematic information (structure) is easily detected (instrumental responses). Difficult to interpret when the number of samples is high. Visualization of original data Map of samples In the column space (variables)

v1 v2 v3 Samples are drawn as points in the m1 ⎛⎞234 variables space ⎜⎟ m2 ⎝⎠106 Similarities among samples can be detected (distances among samples).

v3 (1,0,6) 6 m2 4 2 (2,3,4) m1 2 4 6 v1

v2 Visualization of original data Map of variables In the row space (samples)

v1 v2 v3 Variables are drawn as m1 ⎛⎞234 vectors in the sample ⎜⎟ m2 ⎝⎠106 subspace Correlation among variables m2 can be estimated (angle). v3 (4,6) 6

4 r(vi,vj) = cos(vi,vj) 2 (2,1) v1 (3,0) r = 1, angle 0o v2 m1 2 4 6 r = 0, angle 90o Visualization of original data

Graphical representacion of Samples Sn Zn Fe Ni multivariate data 1 0.2 3.4 0.06 0.08 in the variable space (3D) 2 0.2 2.4 0.04 0.06 0.8 3 0.15 2.0 0.08 0.16 4 8 0.6 4 0.61 6.0 0.09 0.02 5 6

0.4 5 0.57 4.2 0.08 0.06 Sn 6 0.58 4.82 0.07 0.02 0.2 3 1 ? 7 7 0.30 5.60 0.02 0.01 0 0.1 2 9 8 0.60 6.60 0.07 0.06 0.08 8 0.06 6 4 9 0.10 1.60 0.05 0.19 0.04 2 0.02 0 Fe Zn

There are two sample groups Is it critical the representation of a 4th variable? Visualization of original data

For more than 3 dimensions space?

• Qualitative approximations for a few variables – Chernoff faces.

• Efficient Compression of the original space of variables – Principal Component Analysis (PCA). Visualization of original data For more than 3 dimensions space? • Chernoff faces. – It is easu to distinguish different features in human faces. Each sample is a Chernoff face. – Each face feature is a variable.

V. 1 High front face of the head

V. 2 Lower front face of the head sample 7 V. 3 Eyebrows V. 4 Smile Multivariate Data Analyis

Methods Classification

According to their goal Exploration methods. Discrimation and Classification methods. Correlation and Regresion methods. Resolution methods

According data type Based on original data. Based on latent variables () Multivariate Data Analyis

Exploration Methods

• Visualization of the information. • Sample similarities and clusters • Correlations among variables. • Outlier detection. • Measured variables relevance. Selection. • Principal Component Analysis (PCA). Multivariate Data Analyis

Discriminant and Classification Methods

• Separation of the objects (samples) in defined groups or clusters (classes). • Assignation of new objects to predefined classes • Detection of outlier objects not belonging to any group (classes). •PCA, SIMCA, LDA, PLS-DA, SVM..... Multivariate Data Analyis

Correlation and Regression Methods

• Finding relations between two blocks of variables. • Modeling property changes from a group of variables. • Prediction of a property from the indirect measurement from a group of variables correlated to it. • Multilinear Regression (MLR), Principal Components Regresion (PCR), Partial Regression (PLS). • Non-linear Regression methods, Kernel, SVM,... Multivariate Data Analyis

Factor Analysis based methods

•Factor:source of the observed data variance of independent and defined nature. • Extraction of the relevant factors (structure) of the data set. Noise Filtering. • Description of the data variance from basic factors. • Identification of the chemical nature of these relevant factors. • PCA, PLS, PCR, SIMCA, • Multivariate Curve Resolution. MCR. PMF, ICA,.... Multivariate Data Analyis

Data pre-processing

• Modify the size and the range of the scale of the variables. • They can be applied in the direction of the columns (variables) or of the rows (objects, samples). • They are selected as a function of the data nature and of the information to be obtained. • There is no optimal treatment, it depends on the chemical problem to be investigated. Multivariate Data Analyis Data pre-processing

1) mean centering (axes translation)I ∑xik on the data x* =− x x , x =i1= matrix columns ik ik k k I K on the data ∑xik x* =− x x , x =k1= matrix rows ik ik i i K

centering Multivariate Data Analyis Data pre-processing

2) scaling I 2 on the data ∑()xxik− k * xik i1= matrix xik== , s k sI1k ()− columns

I 2 ∑()xxik− k on the data * xik i1= xik== , s i matrix rows sK1i ()− 3) autoscaling = mean centering + scaling

**xxik−− k xx ik i xik== ; x ik sski autoscaling Multivariate Data Analyis Data pre-processing 4) normalization: KK *2N xik=== x ik ; c i∑∑ x ik ; c i x ik ; .... ci k1== ñ1

5) rotation: X* = RT X p.e. in two dimensions:

* ⎛⎞x1 ⎛⎞cosθ sinθ ⎛⎞x1 ⎜⎟* = ⎜⎟⎜⎟ ⎝⎠x2 ⎝⎠-sinθ cosθ ⎝⎠x2 RT, rotation matrix Multivariate Data Analyis Data pre-processing

• Original data (without pretreatment). Scale, size and range of variables is kept

5.00 Sn 4.00 Zn 3.00 2.00 Fe Ni

Metal Concs. Metal 1.00 0.00 0510 Samples Multivariate Data Analyis Data pre-processing • Centered Data Each variable value is subtracted with the mean of all the values of that variable.

0.3 0.2 Sn 0.1 Zn 0 Fe -0.1 0510 Ni Metall Concs Metall -0.2 -0.3 Samples Diferences among variables due to scale size are eliminated Multivariate Data Analyis Data pre-processing

• Autoescaled Data ades. Each value of the variable is centered and divided by the standard deviation of the values of the variable

1.5 1 0.5 Sn 0 Zn -0.5 0510Fe -1

Metal concs Metal Ni -1.5 -2 -2.5 Samples

Differences among variables due to size and range are eliminated Multivariate Data Analyis Data pre-processing

5.00 0.3 1.5 1 4.00 0.2 Sn 0.5 3.00 0.1 0 Zn 0 2.00 -0.5 0510Fe 0510-1 1.00 -0.1 Conc. mConc. etalls -1.5 Ni Conc. metalls Conc. Conc. m etalls -0.2 0.00 -2 0510-0.3 -2.5 Mostres Mostres Mostres Autoscaled Data Original Data Centered Data

What What variables What is the samples/variables discriminate correlation among have higher better? different variables? values? Multivariate Data Analyis Data pre-processing • Centered data. Each variable value is subtracted by its mean value.

– The mean value of all variables is zero

⎛⎞x11−− x 1 x 12 x 2 ... x 1m − x m n number of samples ⎜⎟ xxxx−− x − x m number of variables XC(n, m) = ⎜⎟21 1 22 2 2m m ⎜⎟...... ⎜⎟ ⎝⎠xn1−− x 1 x n2 x 2 ... x nm − x m Centered data and Covariance

1 T S = XC (m,n) XC (m,m) n−1 (n,m) Multivariate Data Analyis Data pre-processing • Autoscaled Data. Each value of the variable is centered and divided by the standard deviation of the values of the variable – The mean of all variables is 0. – The variance (dispersion) of all variables is 1. xij − x j XT (n,m) xtij = sj Autoescaled data and correlation.

1 T C = XT (m,n) XT (m,m) n−1 (n,m) Multivariate Data Analyis

Normalitzation Data pre-processing0.08

0.07

0.06 X spectrum 0.05

0.04

Absorbància 0.03 X 0.02 0.01

Each vector (spectrum) value is 0 0 10 20 30 40 50 60 70 80 divided by its lenght (norm) Longituds d'ona 0.25

XN (n,m) 0.2 XN xij 2 0.15 xn = xi = ∑ xij ij 0.1 xi j 0.05

0 0 10 20 30 40 50 60 70 80 Longituds d'ona Equals the response intensities. Allows a better comparison of the shapes Multivariate Data Analyis Data pre-processing

In instrumental responses • To eliminate changes due to instrumental variations without chemical information – Smoothing: noise correction – 1st derivative: correction of constant variations – 2nd derivative: correction of linear variations – Peak alignements. – Baseline corrections – Warping –… • Different pretreatments can be combined. Multivariate Data Analyis Data pre-processing NIR spectra of samples

vertical offset

2a. derivative

baseline

original data row autoscaling + baseline correction

row autoescaling Multivariate Data Analyis Data pre-processing Multivariate Data Analyis

Other data pretreatments

– Baseline and background correction – Noise filtering – Shift alignement – Warping –... Multivariate Data Analyis

Data example: environmental monitoiung Data table or matrix X(30,50) 50 parameters are meassured on 30 samples

350 350

300 300

250 250

200 200

150 150

100 100

50 50

0 0

-50 -50 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40 45 50

Plot of rows (samples) Plot of columns (variables) of data matrix X of data matrix X Multivariate Data Analyis

Use of 1) Individual sample plots 2) Individual variable plots 3) Descriptive statistics (Excel Statistics) 4) /Box plots 5) Binary correlation between variables 6) ......

350 350 300 300 samples 250 variables 250 200 200

150 150

100 100

50 50

0 0

-50 -50 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40 45 50 1400 45 39 39 40 1200 35 mean 1000 9 9 sum 30 41 800 6 6 41 25 11 35 11 35 20 600 17 29 47 17 29 47 15 400

10

200 5

0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50

60

50 6 std 39 Samples sum, average and 40 9 standard deviation for every 17 41

30 11 35 variable 47 29 20

10

0 0 5 10 15 20 25 30 35 40 45 50 4000 80

3500 21 70

3000 60 21

2500 sum 50 mean

2000 40 26 26 1500 30

1000 20

500 10

0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30

70 21 Variables sum, average and 60 standard deviation 50 std for every sample 40 26

30

20

10

0 0 5 10 15 20 25 30 Multivariate Data Analyis

Descriptive Statistics (Excel)

ppDDD opDDT ppDDT Total DDX Total POCs PCB#28

Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9 Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066 Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166 Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548 Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386 Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535 Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065 Máximo 159 103.6 598.959 1537 1856.5 165.6 Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185 Cuenta 102 102 102 102 102 99

* Multivariate Data Analyis *

300 Boxplot 250

200

150 Values

100

50

0

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950 The box has lines at the lower quartileColumn, Numbermedian, and upper quartile values.

The whiskers are lines extending from each end of the box to show the * extent of the rest of the data. * * Outliers are data with values beyond the ends of the whiskers. Correlation between variables

Columns 1 through 7 Columns 8 through 14 Columns 15 through 21

1.0000 0.7071 0.9088 0.8677 0.9011 0.7807 0.9020 0.8438 0.9213 0.8608 0.8508 0.9012 0.7564 0.6942 0.7363 0.8445 0.7738 0.9249 0.8241 0.8350 0.9383 0.7071 1.0000 0.7413 0.6186 0.7394 0.6311 0.6835 0.9207 0.6820 0.5930 0.6622 0.7057 0.6899 0.5023 0.4323 0.3825 0.6443 0.7821 0.4946 0.6862 0.7801 0.9088 0.7413 1.0000 0.8069 0.8313 0.9304 0.9812 0.8171 0.9842 0.8594 0.7248 0.8973 0.9133 0.6083 0.6508 0.5904 0.9330 0.9756 0.6426 0.9750 0.9716 0.8677 0.6186 0.8069 1.0000 0.9716 0.5649 0.7457 0.8244 0.8547 0.5502 0.9727 0.9595 0.5232 0.9450 0.9624 0.7990 0.5707 0.7923 0.9021 0.7015 0.7655 0.9011 0.7394 0.8313 0.9716 1.0000 0.5945 0.7632 0.9198 0.8528 0.5965 0.9742 0.9549 0.5746 0.9053 0.8984 0.7847 0.6013 0.8362 0.8829 0.7202 0.8175 0.7807 0.6311 0.9304 0.5649 0.5945 1.0000 0.9659 0.6322 0.9081 0.9161 0.4484 0.7050 0.9909 0.3058 0.3683 0.3878 0.9988 0.9140 0.4065 0.9747 0.9282 0.9020 0.6835 0.9812 0.7457 0.7632 0.9659 1.0000 0.7484 0.9790 0.9203 0.6525 0.8441 0.9447 0.5179 0.5759 0.5853 0.9654 0.9688 0.6111 0.9819 0.9759 0.8438 0.9207 0.8171 0.8244 0.9198 0.6322 0.7484 1.0000 0.7907 0.6191 0.8598 0.8566 0.6577 0.7377 0.6804 0.6010 0.6435 0.8429 0.7205 0.7284 0.8364 0.9213 0.6820 0.9842 0.8547 0.8528 0.9081 0.9790 0.7907 1.0000 0.8493 0.7651 0.9209 0.8765 0.6666 0.7210 0.6510 0.9102 0.9618 0.7091 0.9626 0.9603 0.8608 0.5930 0.8594 0.5502 0.5965 0.9161 0.9203 0.6191 0.8493 1.0000 0.4861 0.6604 0.9066 0.2743 0.3517 0.6050 0.9031 0.8833 0.5315 0.8686 0.9239 0.8508 0.6622 0.7248 0.9727 0.9742 0.4484 0.6525 0.8598 0.7651 0.4861 1.0000 0.9197 0.4205 0.9492 0.9398 0.8304 0.4539 0.7292 0.9175 0.5926 0.7106 0.9012 0.7057 0.8973 0.9595 0.9549 0.7050 0.8441 0.8566 0.9209 0.6604 0.9197 1.0000 0.6721 0.8500 0.8838 0.7372 0.7100 0.8843 0.8419 0.8117 0.8590 0.7564 0.6899 0.9133 0.5232 0.5746 0.9909 0.9447 0.6577 0.8765 0.9066 0.4205 0.6721 1.0000 0.2615 0.3068 0.3345 0.9909 0.9064 0.3639 0.9601 0.9243 0.6942 0.5023 0.6083 0.9450 0.9053 0.3058 0.5179 0.7377 0.6666 0.2743 0.9492 0.8500 0.2615 1.0000 0.9739 0.7176 0.3175 0.5792 0.8534 0.4801 0.5440 0.7363 0.4323 0.6508 0.9624 0.8984 0.3683 0.5759 0.6804 0.7210 0.3517 0.9398 0.8838 0.3068 0.9739 1.0000 0.7787 0.3747 0.6183 0.8824 0.5312 0.5829 0.8445 0.3825 0.5904 0.7990 0.7847 0.3878 0.5853 0.6010 0.6510 0.6050 0.8304 0.7372 0.3345 0.7176 0.7787 1.0000 0.3722 0.6206 0.9100 0.4613 0.6365 0.7738 0.6443 0.9330 0.5707 0.6013 0.9988 0.9654 0.6435 0.9102 0.9031 0.4539 0.7100 0.9909 0.3175 0.3747 0.3722 1.0000 0.9159 0.4040 0.9772 0.9263 0.9249 0.7821 0.9756 0.7923 0.8362 0.9140 0.9688 0.8429 0.9618 0.8833 0.7292 0.8843 0.9064 0.5792 0.6183 0.6206 0.9159 1.0000 0.6820 0.9455 0.9782 0.8241 0.4946 0.6426 0.9021 0.8829 0.4065 0.6111 0.7205 0.7091 0.5315 0.9175 0.8419 0.3639 0.8534 0.8824 0.9100 0.4040 0.6820 1.0000 0.5180 0.6632 0.8350 0.6862 0.9750 0.7015 0.7202 0.9747 0.9819 0.7284 0.9626 0.8686 0.5926 0.8117 0.9601 0.4801 0.5312 0.4613 0.9772 0.9455 0.5180 1.0000 0.9519 0.9383 0.7801 0.9716 0.7655 0.8175 0.9282 0.9759 0.8364 0.9603 0.9239 0.7106 0.8590 0.9243 0.5440 0.5829 0.6365 0.9263 0.9782 0.6632 0.9519 1.0000 0.8716 0.6598 0.9818 0.8190 0.8134 0.9242 0.9754 0.7595 0.9915 0.8207 0.7119 0.8976 0.8930 0.6327 0.6845 0.5635 0.9286 0.9475 0.6420 0.9758 0.9408 0.8883 0.5718 0.7118 0.6022 0.6668 0.6651 0.7556 0.6553 0.7158 0.8884 0.6247 0.6506 0.6550 0.3880 0.4501 0.8270 0.6476 0.7772 0.7038 0.6444 0.8165 0.9154 0.8532 0.9245 0.7477 0.8279 0.8566 0.9116 0.8947 0.9023 0.8610 0.7288 0.8361 0.8697 0.5476 0.5604 0.6168 0.8565 0.9405 0.6462 0.8927 0.9659 0.9280 0.6412 0.9564 0.8246 0.8240 0.8884 0.9596 0.7663 0.9731 0.8794 0.7436 0.8912 0.8571 0.6270 0.6882 0.6942 0.8852 0.9376 0.7333 0.9317 0.9517 0.7005 0.9217 0.7575 0.4710 0.6158 0.7461 0.7388 0.8329 0.6826 0.7322 0.4908 0.5931 0.8099 0.2906 0.2425 0.3147 0.7513 0.8036 0.3726 0.7524 0.8207 0.9209 0.8423 0.9303 0.7820 0.8482 0.8428 0.9166 0.8913 0.9125 0.8429 0.7626 0.8591 0.8498 0.6039 0.6100 0.6456 0.8462 0.9557 0.6918 0.8829 0.9543 0.9145 0.8839 0.9470 0.8490 0.9106 0.8218 0.9069 0.9505 0.9291 0.7747 0.8258 0.9139 0.8266 0.7005 0.6884 0.6141 0.8296 0.9502 0.7058 0.8927 0.9486 0.9090 0.8047 0.9795 0.8521 0.8842 0.8838 0.9545 0.8850 0.9749 0.8049 0.7916 0.9272 0.8739 0.6835 0.6992 0.5910 0.8907 0.9646 0.6811 0.9485 0.9614 0.7394 0.3741 0.6021 0.9413 0.8757 0.3190 0.5388 0.6457 0.6793 0.3610 0.9306 0.8371 0.2543 0.9547 0.9813 0.8440 0.3212 0.5780 0.9173 0.4731 0.5535 0.8043 0.6443 0.9127 0.5437 0.5826 0.9850 0.9567 0.6320 0.8852 0.9511 0.4452 0.6793 0.9810 0.2715 0.3342 0.4371 0.9801 0.9126 0.4224 0.9465 0.9310 0.8287 0.7583 0.6477 0.8407 0.9123 0.3899 0.5759 0.8997 0.6521 0.5012 0.9263 0.8069 0.3972 0.8014 0.7653 0.7952 0.3919 0.6972 0.8544 0.5099 0.6882 0.8259 0.6214 0.9145 0.5609 0.5979 0.9828 0.9628 0.6301 0.8963 0.9689 0.4626 0.6913 0.9724 0.2885 0.3604 0.4827 0.9761 0.9159 0.4596 0.9474 0.9393 0.8927 0.6820 0.9768 0.7402 0.7568 0.9627 0.9888 0.7383 0.9712 0.9148 0.6436 0.8435 0.9435 0.5050 0.5689 0.5764 0.9606 0.9639 0.6018 0.9739 0.9698 0.8630 0.6876 0.9399 0.6166 0.6632 0.9806 0.9763 0.7027 0.9208 0.9668 0.5335 0.7423 0.9743 0.3545 0.4142 0.5181 0.9764 0.9445 0.5102 0.9578 0.9674 0.8785 0.5285 0.7587 0.9735 0.9386 0.5223 0.7173 0.7583 0.8204 0.5713 0.9575 0.9194 0.4688 0.9179 0.9513 0.8847 0.5232 0.7495 0.9375 0.6455 0.7385 0.8702 0.7217 0.9775 0.7037 0.7371 0.9769 0.9884 0.7573 0.9602 0.9091 0.6111 0.8139 0.9685 0.4708 0.5164 0.5060 0.9783 0.9652 0.5407 0.9840 0.9720 0.9045 0.6304 0.9432 0.6802 0.7057 0.9552 0.9796 0.6942 0.9392 0.9745 0.5950 0.7782 0.9347 0.4266 0.4992 0.6229 0.9480 0.9449 0.6044 0.9439 0.9661 0.6489 0.5175 0.5127 0.9035 0.8850 0.1833 0.4099 0.7378 0.5648 0.1928 0.9486 0.8046 0.1520 0.9750 0.9418 0.7208 0.1943 0.5086 0.8534 0.3626 0.4708 0.9031 0.8127 0.9118 0.7370 0.8028 0.8697 0.9230 0.8557 0.9043 0.8666 0.7040 0.8075 0.8764 0.5355 0.5437 0.5980 0.8706 0.9384 0.6293 0.9024 0.9590 0.7481 0.6250 0.9203 0.5424 0.5707 0.9977 0.9547 0.6158 0.8959 0.8922 0.4208 0.6873 0.9904 0.2872 0.3467 0.3385 0.9982 0.8989 0.3687 0.9729 0.9113 0.8293 0.6533 0.9664 0.6903 0.7032 0.9785 0.9826 0.6956 0.9600 0.8774 0.5768 0.8081 0.9584 0.4620 0.5222 0.4659 0.9805 0.9462 0.5140 0.9883 0.9420 0.8431 0.8600 0.9018 0.6070 0.7059 0.9062 0.9118 0.8389 0.8581 0.8934 0.5786 0.7289 0.9331 0.3833 0.3806 0.4650 0.9085 0.9304 0.4991 0.8998 0.9507 0.8637 0.5809 0.7495 0.9830 0.9618 0.4834 0.6845 0.8037 0.7970 0.5222 0.9823 0.9248 0.4407 0.9409 0.9625 0.8569 0.4859 0.7428 0.9309 0.6248 0.7214 0.8606 0.5226 0.7253 0.9750 0.9425 0.4707 0.6757 0.7596 0.7881 0.5305 0.9709 0.9102 0.4186 0.9350 0.9624 0.8954 0.4710 0.7190 0.9437 0.6027 0.7062 0.8783 0.6756 0.9791 0.7609 0.7716 0.9575 0.9909 0.7463 0.9830 0.8776 0.6584 0.8574 0.9316 0.5464 0.6040 0.5514 0.9601 0.9581 0.6012 0.9871 0.9611 0.8453 0.6615 0.9699 0.6840 0.7024 0.9872 0.9905 0.7044 0.9610 0.9051 0.5740 0.7984 0.9696 0.4484 0.5067 0.4843 0.9880 0.9484 0.5200 0.9912 0.9567 0.8632 0.6759 0.7094 0.5733 0.6677 0.6640 0.7422 0.7293 0.6920 0.8699 0.6102 0.6211 0.6794 0.3672 0.3904 0.7438 0.6513 0.7873 0.6794 0.6489 0.8227 0.8802 0.6873 0.9784 0.7389 0.7545 0.9698 0.9951 0.7405 0.9775 0.8998 0.6392 0.8421 0.9490 0.5146 0.5693 0.5432 0.9705 0.9622 0.5833 0.9856 0.9684 0.8733 0.6441 0.9746 0.7600 0.7583 0.9495 0.9837 0.7210 0.9781 0.8811 0.6526 0.8446 0.9213 0.5430 0.6055 0.5699 0.9497 0.9474 0.6119 0.9756 0.9530 Correlation between variables

Columns 22 through 28 Columns 29 through 35 Columns 36 through 42

0.8716 0.8883 0.9154 0.9280 0.7005 0.9209 0.9145 0.9090 0.7394 0.8043 0.8287 0.8259 0.8927 0.8630 0.8785 0.8702 0.9045 0.6489 0.9031 0.7481 0.8293 0.6598 0.5718 0.8532 0.6412 0.9217 0.8423 0.8839 0.8047 0.3741 0.6443 0.7583 0.6214 0.6820 0.6876 0.5285 0.7217 0.6304 0.5175 0.8127 0.6250 0.6533 0.9818 0.7118 0.9245 0.9564 0.7575 0.9303 0.9470 0.9795 0.6021 0.9127 0.6477 0.9145 0.9768 0.9399 0.7587 0.9775 0.9432 0.5127 0.9118 0.9203 0.9664 0.8190 0.6022 0.7477 0.8246 0.4710 0.7820 0.8490 0.8521 0.9413 0.5437 0.8407 0.5609 0.7402 0.6166 0.9735 0.7037 0.6802 0.9035 0.7370 0.5424 0.6903 0.8134 0.6668 0.8279 0.8240 0.6158 0.8482 0.9106 0.8842 0.8757 0.5826 0.9123 0.5979 0.7568 0.6632 0.9386 0.7371 0.7057 0.8850 0.8028 0.5707 0.7032 0.9242 0.6651 0.8566 0.8884 0.7461 0.8428 0.8218 0.8838 0.3190 0.9850 0.3899 0.9828 0.9627 0.9806 0.5223 0.9769 0.9552 0.1833 0.8697 0.9977 0.9785 0.9754 0.7556 0.9116 0.9596 0.7388 0.9166 0.9069 0.9545 0.5388 0.9567 0.5759 0.9628 0.9888 0.9763 0.7173 0.9884 0.9796 0.4099 0.9230 0.9547 0.9826 0.7595 0.6553 0.8947 0.7663 0.8329 0.8913 0.9505 0.8850 0.6457 0.6320 0.8997 0.6301 0.7383 0.7027 0.7583 0.7573 0.6942 0.7378 0.8557 0.6158 0.6956 0.9915 0.7158 0.9023 0.9731 0.6826 0.9125 0.9291 0.9749 0.6793 0.8852 0.6521 0.8963 0.9712 0.9208 0.8204 0.9602 0.9392 0.5648 0.9043 0.8959 0.9600 0.8207 0.8884 0.8610 0.8794 0.7322 0.8429 0.7747 0.8049 0.3610 0.9511 0.5012 0.9689 0.9148 0.9668 0.5713 0.9091 0.9745 0.1928 0.8666 0.8922 0.8774 0.7119 0.6247 0.7288 0.7436 0.4908 0.7626 0.8258 0.7916 0.9306 0.4452 0.9263 0.4626 0.6436 0.5335 0.9575 0.6111 0.5950 0.9486 0.7040 0.4208 0.5768 0.8976 0.6506 0.8361 0.8912 0.5931 0.8591 0.9139 0.9272 0.8371 0.6793 0.8069 0.6913 0.8435 0.7423 0.9194 0.8139 0.7782 0.8046 0.8075 0.6873 0.8081 0.8930 0.6550 0.8697 0.8571 0.8099 0.8498 0.8266 0.8739 0.2543 0.9810 0.3972 0.9724 0.9435 0.9743 0.4688 0.9685 0.9347 0.1520 0.8764 0.9904 0.9584 0.6327 0.3880 0.5476 0.6270 0.2906 0.6039 0.7005 0.6835 0.9547 0.2715 0.8014 0.2885 0.5050 0.3545 0.9179 0.4708 0.4266 0.9750 0.5355 0.2872 0.4620 0.6845 0.4501 0.5604 0.6882 0.2425 0.6100 0.6884 0.6992 0.9813 0.3342 0.7653 0.3604 0.5689 0.4142 0.9513 0.5164 0.4992 0.9418 0.5437 0.3467 0.5222 0.5635 0.8270 0.6168 0.6942 0.3147 0.6456 0.6141 0.5910 0.8440 0.4371 0.7952 0.4827 0.5764 0.5181 0.8847 0.5060 0.6229 0.7208 0.5980 0.3385 0.4659 0.9286 0.6476 0.8565 0.8852 0.7513 0.8462 0.8296 0.8907 0.3212 0.9801 0.3919 0.9761 0.9606 0.9764 0.5232 0.9783 0.9480 0.1943 0.8706 0.9982 0.9805 0.9475 0.7772 0.9405 0.9376 0.8036 0.9557 0.9502 0.9646 0.5780 0.9126 0.6972 0.9159 0.9639 0.9445 0.7495 0.9652 0.9449 0.5086 0.9384 0.8989 0.9462 0.6420 0.7038 0.6462 0.7333 0.3726 0.6918 0.7058 0.6811 0.9173 0.4224 0.8544 0.4596 0.6018 0.5102 0.9375 0.5407 0.6044 0.8534 0.6293 0.3687 0.5140 0.9758 0.6444 0.8927 0.9317 0.7524 0.8829 0.8927 0.9485 0.4731 0.9465 0.5099 0.9474 0.9739 0.9578 0.6455 0.9840 0.9439 0.3626 0.9024 0.9729 0.9883 0.9408 0.8165 0.9659 0.9517 0.8207 0.9543 0.9486 0.9614 0.5535 0.9310 0.6882 0.9393 0.9698 0.9674 0.7385 0.9720 0.9661 0.4708 0.9590 0.9113 0.9420 1.0000 0.6412 0.8737 0.9535 0.6691 0.8853 0.9106 0.9697 0.6306 0.8891 0.5822 0.8936 0.9670 0.9139 0.7702 0.9622 0.9233 0.5167 0.8816 0.9190 0.9737 0.6412 1.0000 0.7950 0.7707 0.6396 0.7997 0.7142 0.6849 0.5039 0.7393 0.7033 0.7671 0.7444 0.7918 0.6759 0.7177 0.8345 0.3819 0.7841 0.6208 0.6540 0.8737 0.7950 1.0000 0.8924 0.8785 0.9540 0.9529 0.9345 0.5324 0.8683 0.7501 0.8698 0.9098 0.9114 0.7088 0.9159 0.8970 0.5056 0.9493 0.8405 0.8717 0.9535 0.7707 0.8924 1.0000 0.6642 0.8923 0.8950 0.9382 0.6725 0.8778 0.6570 0.8967 0.9533 0.9169 0.8082 0.9328 0.9477 0.5321 0.8819 0.8690 0.9248 0.6691 0.6396 0.8785 0.6642 1.0000 0.8461 0.8356 0.7818 0.1988 0.7668 0.6333 0.7545 0.7354 0.7935 0.3938 0.7834 0.7268 0.2787 0.8475 0.7404 0.7189 0.8853 0.7997 0.9540 0.8923 0.8461 1.0000 0.9573 0.9398 0.5754 0.8557 0.7632 0.8484 0.9003 0.8968 0.7568 0.9112 0.8874 0.5518 0.9298 0.8248 0.8752 0.9106 0.7142 0.9529 0.8950 0.8356 0.9573 1.0000 0.9786 0.6410 0.8131 0.8048 0.8098 0.8984 0.8637 0.7917 0.9102 0.8557 0.6493 0.9398 0.8082 0.8704 0.9697 0.6849 0.9345 0.9382 0.7818 0.9398 0.9786 1.0000 0.6454 0.8624 0.7191 0.8614 0.9474 0.9015 0.7894 0.9526 0.8982 0.6040 0.9339 0.8747 0.9377 0.6306 0.5039 0.5324 0.6725 0.1988 0.5754 0.6410 0.6454 1.0000 0.3057 0.7741 0.3339 0.5304 0.3864 0.9573 0.4697 0.4896 0.9290 0.5161 0.2898 0.4652 0.8891 0.7393 0.8683 0.8778 0.7668 0.8557 0.8131 0.8624 0.3057 1.0000 0.4199 0.9885 0.9517 0.9878 0.5172 0.9636 0.9677 0.1639 0.8768 0.9765 0.9522 0.5822 0.7033 0.7501 0.6570 0.6333 0.7632 0.8048 0.7191 0.7741 0.4199 1.0000 0.4358 0.5735 0.5139 0.8311 0.5544 0.5563 0.8545 0.7138 0.3549 0.4792 0.8936 0.7671 0.8698 0.8967 0.7545 0.8484 0.8098 0.8614 0.3339 0.9885 0.4358 1.0000 0.9575 0.9915 0.5393 0.9631 0.9799 0.1795 0.8819 0.9715 0.9528 0.9670 0.7444 0.9098 0.9533 0.7354 0.9003 0.8984 0.9474 0.5304 0.9517 0.5735 0.9575 1.0000 0.9717 0.7029 0.9876 0.9747 0.4035 0.9091 0.9507 0.9776 0.9139 0.7918 0.9114 0.9169 0.7935 0.8968 0.8637 0.9015 0.3864 0.9878 0.5139 0.9915 0.9717 1.0000 0.5960 0.9793 0.9857 0.2574 0.9158 0.9680 0.9606 0.7702 0.6759 0.7088 0.8082 0.3938 0.7568 0.7917 0.7894 0.9573 0.5172 0.8311 0.5393 0.7029 0.5960 1.0000 0.6579 0.6777 0.8802 0.6993 0.4914 0.6390 0.9622 0.7177 0.9159 0.9328 0.7834 0.9112 0.9102 0.9526 0.4697 0.9636 0.5544 0.9631 0.9876 0.9793 0.6579 1.0000 0.9662 0.3664 0.9221 0.9686 0.9852 0.9233 0.8345 0.8970 0.9477 0.7268 0.8874 0.8557 0.8982 0.4896 0.9677 0.5563 0.9799 0.9747 0.9857 0.6777 0.9662 1.0000 0.3274 0.9024 0.9365 0.9481 0.5167 0.3819 0.5056 0.5321 0.2787 0.5518 0.6493 0.6040 0.9290 0.1639 0.8545 0.1795 0.4035 0.2574 0.8802 0.3664 0.3274 1.0000 0.4777 0.1600 0.3415 0.8816 0.7841 0.9493 0.8819 0.8475 0.9298 0.9398 0.9339 0.5161 0.8768 0.7138 0.8819 0.9091 0.9158 0.6993 0.9221 0.9024 0.4777 1.0000 0.8530 0.8785 0.9190 0.6208 0.8405 0.8690 0.7404 0.8248 0.8082 0.8747 0.2898 0.9765 0.3549 0.9715 0.9507 0.9680 0.4914 0.9686 0.9365 0.1600 0.8530 1.0000 0.9766 0.9737 0.6540 0.8717 0.9248 0.7189 0.8752 0.8704 0.9377 0.4652 0.9522 0.4792 0.9528 0.9776 0.9606 0.6390 0.9852 0.9481 0.3415 0.8785 0.9766 1.0000 0.8463 0.7681 0.9522 0.8479 0.9303 0.9305 0.9185 0.9059 0.3472 0.9218 0.6330 0.9149 0.9030 0.9448 0.5576 0.9339 0.9067 0.3259 0.9468 0.8953 0.8887 0.7479 0.6382 0.7077 0.7824 0.4322 0.7487 0.8027 0.7928 0.9588 0.4789 0.8791 0.5015 0.6735 0.5595 0.9768 0.6322 0.6367 0.9226 0.6885 0.4547 0.6115 0.7349 0.6557 0.6838 0.7792 0.3784 0.7232 0.7699 0.7659 0.9702 0.4658 0.8514 0.4963 0.6651 0.5495 0.9872 0.6149 0.6358 0.9087 0.6750 0.4394 0.5970 0.9866 0.6961 0.8997 0.9481 0.7154 0.8971 0.9085 0.9607 0.5551 0.9353 0.5617 0.9388 0.9806 0.9556 0.7235 0.9805 0.9551 0.4358 0.9174 0.9501 0.9827 0.9700 0.6877 0.8878 0.9358 0.7358 0.8825 0.8785 0.9379 0.4578 0.9659 0.4918 0.9694 0.9867 0.9768 0.6406 0.9934 0.9664 0.3287 0.8991 0.9819 0.9941 0.6247 0.9617 0.8305 0.7468 0.7604 0.8218 0.7490 0.6975 0.4436 0.7411 0.7428 0.7609 0.7305 0.7896 0.6190 0.7194 0.8175 0.3708 0.8183 0.6242 0.6382 0.9794 0.7137 0.9027 0.9491 0.7330 0.9029 0.9058 0.9575 0.5219 0.9532 0.5496 0.9560 0.9876 0.9704 0.7004 0.9890 0.9664 0.4021 0.9176 0.9621 0.9887 0.9792 0.6980 0.8862 0.9558 0.6784 0.8777 0.8879 0.9397 0.5673 0.9261 0.5433 0.9357 0.9761 0.9468 0.7284 0.9723 0.9573 0.4249 0.8900 0.9419 0.9740 Correlation between variables Columns 43 through 49 Column 50

0.8431 0.8637 0.8606 0.8783 0.8453 0.8632 0.8802 0.8733 0.8600 0.5809 0.5226 0.6756 0.6615 0.6759 0.6873 0.6441 0.9018 0.7495 0.7253 0.9791 0.9699 0.7094 0.9784 0.9746 0.6070 0.9830 0.9750 0.7609 0.6840 0.5733 0.7389 0.7600 0.7059 0.9618 0.9425 0.7716 0.7024 0.6677 0.7545 0.7583 0.9062 0.4834 0.4707 0.9575 0.9872 0.6640 0.9698 0.9495 0.9118 0.6845 0.6757 0.9909 0.9905 0.7422 0.9951 0.9837 0.8389 0.8037 0.7596 0.7463 0.7044 0.7293 0.7405 0.7210 0.8581 0.7970 0.7881 0.9830 0.9610 0.6920 0.9775 0.9781 0.8934 0.5222 0.5305 0.8776 0.9051 0.8699 0.8998 0.8811 0.5786 0.9823 0.9709 0.6584 0.5740 0.6102 0.6392 0.6526 0.7289 0.9248 0.9102 0.8574 0.7984 0.6211 0.8421 0.8446 0.9331 0.4407 0.4186 0.9316 0.9696 0.6794 0.9490 0.9213 0.3833 0.9409 0.9350 0.5464 0.4484 0.3672 0.5146 0.5430 Pairwise correlations are difficult to 0.3806 0.9625 0.9624 0.6040 0.5067 0.3904 0.5693 0.6055 0.4650 0.8569 0.8954 0.5514 0.4843 0.7438 0.5432 0.5699 0.9085 0.4859 0.4710 0.9601 0.9880 0.6513 0.9705 0.9497 interpret when many variables are 0.9304 0.7428 0.7190 0.9581 0.9484 0.7873 0.9622 0.9474 0.4991 0.9309 0.9437 0.6012 0.5200 0.6794 0.5833 0.6119 0.8998 0.6248 0.6027 0.9871 0.9912 0.6489 0.9856 0.9756 Involved Î Need of multivariate 0.9507 0.7214 0.7062 0.9611 0.9567 0.8227 0.9684 0.9530 0.8463 0.7479 0.7349 0.9866 0.9700 0.6247 0.9794 0.9792 0.7681 0.6382 0.6557 0.6961 0.6877 0.9617 0.7137 0.6980 data analysis tools 0.9522 0.7077 0.6838 0.8997 0.8878 0.8305 0.9027 0.8862 0.8479 0.7824 0.7792 0.9481 0.9358 0.7468 0.9491 0.9558 0.9303 0.4322 0.3784 0.7154 0.7358 0.7604 0.7330 0.6784 0.9305 0.7487 0.7232 0.8971 0.8825 0.8218 0.9029 0.8777 0.9185 0.8027 0.7699 0.9085 0.8785 0.7490 0.9058 0.8879 0.9059 0.7928 0.7659 0.9607 0.9379 0.6975 0.9575 0.9397 0.3472 0.9588 0.9702 0.5551 0.4578 0.4436 0.5219 0.5673 0.9218 0.4789 0.4658 0.9353 0.9659 0.7411 0.9532 0.9261 0.6330 0.8791 0.8514 0.5617 0.4918 0.7428 0.5496 0.5433 0.9149 0.5015 0.4963 0.9388 0.9694 0.7609 0.9560 0.9357 0.9030 0.6735 0.6651 0.9806 0.9867 0.7305 0.9876 0.9761 0.9448 0.5595 0.5495 0.9556 0.9768 0.7896 0.9704 0.9468 0.5576 0.9768 0.9872 0.7235 0.6406 0.6190 0.7004 0.7284 0.9339 0.6322 0.6149 0.9805 0.9934 0.7194 0.9890 0.9723 0.9067 0.6367 0.6358 0.9551 0.9664 0.8175 0.9664 0.9573 0.3259 0.9226 0.9087 0.4358 0.3287 0.3708 0.4021 0.4249 0.9468 0.6885 0.6750 0.9174 0.8991 0.8183 0.9176 0.8900 0.8953 0.4547 0.4394 0.9501 0.9819 0.6242 0.9621 0.9419 0.8887 0.6115 0.5970 0.9827 0.9941 0.6382 0.9887 0.9740 1.0000 0.5562 0.5264 0.8900 0.9065 0.8271 0.9068 0.8684 0.5562 1.0000 0.9874 0.6899 0.6073 0.6046 0.6679 0.6916 0.5264 0.9874 1.0000 0.6798 0.5957 0.6020 0.6573 0.6858 0.8900 0.6899 0.6798 1.0000 0.9867 0.6814 0.9924 0.9819 0.9065 0.6073 0.5957 0.9867 1.0000 0.6793 0.9931 0.9799 0.8271 0.6046 0.6020 0.6814 0.6793 1.0000 0.6993 0.6804 0.9068 0.6679 0.6573 0.9924 0.9931 0.6993 1.0000 0.9837 0.8684 0.6916 0.6858 0.9819 0.9799 0.6804 0.9837 1.0000 Multivariate Data Analyis variables Pair-wiseCorrelations

CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set. samples Multivariate Data Analyis • What is the SVD of a data matrix X?.... – singular value decomposition – singular values are the root square of the eigenvalues – X=USVT, U ana VT are orthonormal matrices and S is a diagonal matrix with singular values – SVD is an orthogonal matrix decomposition – The elements in S are ordered according to the variance explained by each each component – Variance is concentrated in the first components, it allows for reducing the number of variables explaining the variance structure and filtering the noise K xusvei1Ij1JKIorJij=+==<<∑ ik k kj ij , ,... , ,... , k =1 XUSVE= T + Multivariate Data Analyis Effect of data pretreatments on SVD plot(svds) Raw data X Mean-centered data X 1000 800

800 4 larger components 600 600 400 400 200 200

0 0 0 2 4 6 8 10 0 2 4 6 8 10 Autoscaled data X 50 Scaled data X 40

40 30

30 20 20 10 10

0 0 0 2 4 6 8 10 0 2 4 6 8 10 Multivariate Data Analyis Methods of data pretreatment Effect of pretreatments Raw data X log10 data X 900 35

800 30

700

25 600

20 500

400 how many components? 4 components 15 non-linearity? 300 10

200

5 100

0 0 0 2 4 6 8 10 0 2 4 6 8 10 Multivariate Data Analysis Multivariate Data Analysis Introduction to Multivariate Data Analysis Principal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)

Laboratory exercises: Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona) [email protected] Principal Components (PCA) Data Representation

x2 PC1 Direction of PC maximum variation • 1 • • • m2 • • • • • • • • • m1 PC1 m1 • • • m2

x1 Original Variables Principal Components

Reduce the dimensions of the original space. Keep the relevant information about data variance. Principal Components (PCA) Data Representation

x2 PC2 Direcction of maximum • PC1 • • • remaining variance ⊥ PC1 PC2 • • • • • • • • • PC2 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • PC1

x1 Reduce dimensions of the original space.

x3 Keep the relevant information about data variance Not repeat information among PCs (they are orthogonal). Principal Components (PCA) Geometrical interpretation

λ1 λ2 λ3 λ3 a 1 2 3 b 3 3 5 c 4 5 8 *d d 5 4 7 *c e 2 1 2 *a *b f 0 0 0 *e λ2 *f

λ1 Principal Components (PCA)

*a PC2 *c

*f PC1 *b

*e *d

PCA Principal Components (PCA)

λ2

PC2 orthogonal to PC1 λ3 PC1 direction * * of maximum * * variance * * * * * * * * * λ1 * * * * * * * * * Principal Components (PCA)

• They are mathematical variables which describe efficiently the data variance. – The relevant variance of the original data is described by a reduced number of components (PCs) – Visualization of large data sets (many variables) in the PC space of reduced dimensions. • Information is not repeated (overloaded, PCs are orthogonal). • Describe the main directions of data variance in decreasing order. • They are linear combination of the original variables. Principal Components (PCA)

• They are linear combination of the original variables.

x2

t1 • t1 = x1p11 + x2p21 •

21 • • p 2 x • p11, p21 m1 • • • m2 loadings of the original x1p11 x1 variables in first PC, t1. PCA Model Relationship betwen the data in the PC space and in the original space. PCs (linear combination of the original variables)

t1 t2 x1 x2 x3 p1 p2

p11

x x x p21 tj1 tj2 = j1 j2 j3

p31

T X P (n,2) (n,3) (3,2)

tj1 = xj1 p11 + xj2 p21 + xj3p31 T = XP PCA Model T = X P (n,npc) (n,m) (m,npc) n (samples), m (variables), npc (principal components) T (scores matrix). P (loadings matrix). • Describe the • Describe the original samples in the variables in the principal components principal components space. space. • They are orthonormal. • They are orthogonal. T – pi pj = 0. T – ti tj = 0. –||pi|| = 1. – PTP = I X = TPT PCA Model PCA Model: X = T PT scores loadings (projections) PT X =+E T

T T T X = t1p1 + t2p2 + ……+ tnpn + E n number of components (<< number of variables in X)

T T T Xt=++….++1p1 t2 p2 tnpn E

rank 1 rank 1 rank 1 PCA Model

Model: X = T PT + E X = structure + noise

It is an approximation to the experimental data matrix X

• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).

• Scores, Targets: T relationships between the samples (coordinates of samples or objects in the space defined by the principal components Vectors in T (scores) are orthogonal

Noise E Experimental error, non-explained PCA Model Determination of the number of components • When the expected experimental error is known. – Plots of explained or residual variance as a function of the nr. of components of the model. • Ex. models explaining a 95% of the variance are satisfactory?. – Compare mean residual values with experimental error size. • Ex. absornance errors in UV are aprox. 0.002.

2 ∑ eij n × m number of elements of X matrix e = i,j n× m

Number of PC → ē ≤ error (0.002). PCA Model Determination of the number of components

• When experimental error is unknown: – Plot of singular values (or of eigenvalues). – Empirical functions related to experimental errors. – Cross Validation Methods. PCA Model Determination of the number of components Example n cromatograms m spectra

How many components coeluted? The number of coeluting components is deduced from the number of pricipal components. PCA Model Determination of the number of components

– Plot of singular values(sk) or of functions of eigenvalues (λk).

• Singular values or eigenvalues vs. number of PCs (sk).

• Log(eigenvalues) vs. number of PCs (λk).

• Log(reduced eigenvalues) vs. number of PCs (REVk).

λk r ne files 2 REV (k) = λk = (sk) (r − k +1)(c − k +1) c ne columnes

Size of sk /λk /REVk ∝ associate PC importance PCA Model Determination of the number of components

log eigenvalues log(REV)

4 significative components The rest of components are used to explain the experimental noise PCA Model Determination of the number of components

of empirical functions related to error. • Eigenvalue fucntions. Take advantage of the relation between the explained variance and teh size of the eigenvalues. • These functions have minima or considerable size changes for the optimal number of PCs PCA Model Determination of the number of components Malinowski error functions

c RSD 0 IND = ∑ λk Indicator Function (IND) 2 RSD = k=n+1 ()cn− r(c-n)

c number of columns r number of rows n number of components 0 λ k eigenvalue of component k

Mínimum IND → optimal number of PCs PCA Model Determination of the number of components Imbedded error Malinowski error fucntions n c IE = RSD 0 ∑ λk RSD c k=n+1 RSD= IND = 2 Extracted error r(c-n) ()cn− c − n XE = RSD c Statistical test estadístic of eigenvalues (Malinowski)

s ∑ (r - k + 1)(c - k + 1) λ k=n+1 n F(1,s - n)= s (r - n + 1)(c - n + 1) 0 ∑ λ k k=n+1 PCA Model Determination of the number of components

n Valor propi RE IND REV %SL (test F) 1 5.4068e+001 1.6716e-002 6.6864e-006 1.1043e-002 0 2 1.2172e+000 5.1353e-003 2.1388e-006 2.5625e-004 -3.5756e-005 3 6.6753e-002 3.5263e-003 1.5305e-006 1.4493e-005 1.6084e-003 4 1.8613e-002 2.9281e-003 1.3255e-006 4.1695e-006 3.7223e-001 5 2.1991e-003 2.8744e-003 1.3584e-006 5.0858e-007 2.9015e+001 6 2.0762e-003 2.8223e-003 1.3937e-006 4.9599e-007 2.9475e+001 7 1.9640e-003 2.7715e-003 1.4316e-006 4.8495e-007 2.9895e+001 8 1.9102e-003 2.7198e-003 1.4710e-006 4.8779e-007 2.9620e+001 9 1.7319e-003 2.6728e-003 1.5152e-006 4.5770e-007 3.1087e+001 10 1.6755e-003 2.6254e-003 1.5618e-006 4.5852e-007 3.0982e+001

Eigenvalues and REVs > 4 have lower sizes IND has a minimum in 4. RE lower its size. Eigenvalue for PC 4 is significatively larger than higher ones. PCA Model Determination of the number of components

– Cross-validation methods. • A part of the data is used to built the model and another part of the data is described by this model. The optimal number of components is teh one giving lower resoiduals in the description of new data. • This procedure is repeated until all the samples are used to built the model and as external data set. • The final results are the mean of all repetitions obtained in the modeling/description of the non-included samples. PCA Model Determination of the number of components

1. Divide the data sample set on q subsets.

2. Built PCA modesl with q –1 data subsets (Xmodel).

3. Use these PCA models to explain the external data subset (Xextern).

i. Scores. Textern = XexternP. ii. Reproduction X extern. Xˆ = T PT 4. PRESS (Predictive Residual Sumextern of Sq uares)extern calculation using differnt numbert of PCs. i. For PC k. 5. Repeat steps 1-4 until the q subsets have been used as external data sets.

6. Plot PRESScum vs. number of PCs . i. For PC k. 2 ∑ ∑ PRESS (k ) = (xˆ ij − x ij ) PRESS cum (k) = PRESS (k) i,j q q number of PCA models PCA Model Determination of the number of components

Xextern

Xmodeln Xmodel1 ... Xextern

1PC 2PC .... mPC

Xmodel1 PRESS11 PRESS12 PRESS1m

Xmodel2 PRESS PRESS PRESS . 21 22 2m . . X modeln PRESSn1 PRESSn2 PRESSnm

PRESScum,1 PRESScum,2 PRESScum,m

PRESScum,i = Σj PRESSji PCA Model Determination of the number of components (cross-validation)

PC Nr. PRESScum

3.5 1.0000 3.4772

3 2.0000 0.2320 3.0000 0.1117 2.5

m

u

c 4.0000 0.0505

S 2

S 5.0000 0.0515

E 1.5 R 6.0000 0.0517

P

1 7.0000 0.0521

0.5 8.0000 0.0524 9.0000 0.0535 0 1 2 3 4 5 6 7 8 9 10 10.0000 0.0531 Nombre de PCs

Optimal number of PCs → Minimum value of PRESS PCA Model Determination of the number of components Model Fitting

How many principal components? PC reliability

Explained Variànce Number of PCs Model reliability

Higher PCs explain data noise A PCA model with noisy PCs is less reliable when describes new data With more PCs in the model, better data fitting but the model reliability when it is applied to new data may be worse (overfitting) PCA model visualitzation

X = T PT

Scores plot Loadings plot (map of samples on PC space) (map of variables on PC space)

PC1 PC2 PC1 PC2 PC2 PC2 m (t , t ) • j j1 j2 m t t j j1 j2 xj pj1 pj2 PC1 PC1

P x (p , p ) T j j1 j2 Example PCA model visualitzation

Sn Zn Fe Ni 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 0.61 3.96 0.09 0.0062

Samples 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014

X (9,4) Example PCA model visualitzation

Original data • distance among samples shows their Scores plot similarity (5,6 and 4,8 I very similar). 0.3 9 • Detection of sample 0.2 3 21 groups (clusters) (I i II). 0.1 7 PC1

PC2 • External information 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 can help to identify the -0.1 nature of the detected -0.2 5 6 groups (ex. samples 8 4 II origen,...). • Very distant samples Samples Map are extreme samples. Example PCA model visualitzation • Relevant variables for the Original Data model have high loadings and far from the origen (Zn, Sn). PC2 • Close variables to the origen do not give information about Loadings0.8 plot the data variance (Fe, Ni). • High Loadings in one PC 0.4 ishow high weight in this Ni Zn component (Zn – PC1, Sn – -0.8 -0.4 0.4 0.8 Fe PC1 PC2).

-0.4 • Correlated Variables with lower PCs have more importance. -0.8 Sn • Correlations between variables are described by their angle (Zn are Sn little correlated). Map of variables • Positive (direct) and negative (indirect) correlations between variables can be detected. Example PCA model visualitzation Centered data Scores plot Loadings plot PC2 PC2 7 0.8 3 0.2 0.6 8 0.1 0.4 -0.4 -0.3 -0.2 -0.1 0.1 0.2 0.3 0.2 2 4 PC1 -0.2 0.2 0.4 0.6 0.8 1 1.2 1 NiFe PC1 -0.1 5 -0.2 Sn

9 -0.2 -0.4 6 -0.6 -0.3 -0.8 Zn • Close samples to the origen are • The importance of the variables sililar to the average sample. change because centering This cannot be distinguished eliminates the weight of the scale when they are separated in size. Sn, related with PC1, is groups. now more important than Zn. Example PCA model visualitzation Autoescaled data Scores plot Loadings plot 1.2 PC2 1.5 PC2 9 1 Zn 6 0.8 1 0.5 5 PC1 0.6 2 -1 -0.5 0 0.5 11.5 2 4 0.4 3 -0.5 Fe 0.2 -1 8 0.1 0.2 0.3 0.4 0.5 0.6 -1.5 PC1 SnNi 7 • They eliminate the scale size and range. The effect of all variables is enhanced. • Correlation information among variables is more clearly seen. Principal Components (PCA) Interpretation of the ‘scores’ (targets, punctuations) • Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’ • Objects close between them are ‘similar’ objects • Objects far between them are ‘different • Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other ‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwise there are ‘clusters’ • Principal Components identification may be achieved from the external identification of the clusters of objects • Simultaneous analysis of ‘loadings’ and ‘scores’ plots help to identify/interpret the principal components principals Principal Components (PCA) Scores and Loadings

• ‘scores’ show the relationships between samples

• ‘loadings’ show the relationships between variables

• ‘scores’ and ‘loadings’ should be interpreted in pairs

• ‘scores’ and ‘loadings’ are plotted ones against the others Principal Components (PCA)

Outliers

• Outlier samples can have a great influence (leverage) in the PCA model

• They can be detected

• To detect them, find: 9isolated samples in scores plots 9samples with large values of Q or T2 or both Principal Components (PCA) Extreme Objects: Outlier objects • Diferent to the rest of • Extreme objects that objects. cannot be fitted by the x2 model.

O. extreme• PC1 Detection of anomalous objects • • • • • • • • • • • O. outlier

x1 An object (samples) is extreme or outlier when one or more variables have their values very different to the other samples Principal Components (PCA) Why is needed? Outlier detection • They distort the model • They hidden the structure of the rest of the data. When they should be eliminated and How? • If it is justified methematically and chemically. • It should be done gradually and stating with the more outstanding ones. x2

• PC1 • • • • • • • • • O. outlier •

x1 Outlier Detection Mathematical Indicadors • Extreme objects –Mahalanobis Distance (to the 9 Leverage. Related to the scores mean model) size

n 2 n 2 2 t t dns=−(1)ik, 1 ik, i ∑ hi =+ λk ns ∑ λ k=1 k=1 k

di distànce of object (sample) i (for centered data) hi ‘leverage’ of the sample i ns number of samples 2 ti,k score value for sample i component k λk eigenvalue of component k n number of PCs 9 T2. Related to the scores size Outlier objects 9 Size of the residual. n t2 T2 = i,k Q e2 e2 i ∑ = i = ∑ ij k=1 λk j j variable index Detection of outliers • Leverage plot.

Objects poorly

• • described by the model

Residual (Q) Outliers

• •

• •

• • • •

• • • •

• • •

• • • •

• • • •

• Extreme Objects

• • •

• • • •

• • • • •

• • • • • • • •

Leverage (hi) To be used with a low number of PCs !!!! (Higher PCs are used to describe outliers)

Similar gràfics Q as a function of T2 eare used in process control Principal Component Analysis (PCA)

PCA Statistics Residuals to measure the lack of fit (large residuals) T T T Qi = ei ei = xi (I-PkPk ) xk samples with large Qi values are unusual (they are out of the model!!!!)

Hotelling statistic T2 2 -1 T -1 T T Ti = ti λi ti = xi Pk λi Pk xi 2 samples with large values of Ti are unusual (they are inside the model with high leverage!!!!)

These statistics are used to develop control charts and limits in Statistics Process Control Principal Component Analysis (PCA)

PCA Statistics T T T Qi = eiei = xi(I - PkPk )xi , variation out of the PCA model 2 T Τ T T Ti = ti-1ti = xiP -1P xi , variation inside of the PCA model Summary: Building PCA model

• Need of data pretreatment. • Determination of the number of principal components of the PCA model. • Detection and elimination of outliers • Repetition of previous steps • Assesment of the model quality and reliability Singular Value Decomposition (SVD)

• What is SVD?.... – PCA is many times done using SVD – SVD is an orthogonal matrix decomposition –X=USVT, U ana VT are orthonormal matrices and S is a diagonal matrix with singular values – The elements in S are ordered according to the variance explained by each each component – Variance is concentrated in the first components, it allows for reducing the number of variables explaining the variance structure and filtering the noise – singular values are the root square of the eigenvalues K scores TT xusvij==∑ ik k kj X=USV TP k =1 loadings Principal Component Analysis (PCA) Principal Components in m dimensions

t1 = p11x1 + p12x2 + ...... + p1mxm t2 = p21x1 + p22x2 + ...... + p2mxm Linear combination of ...... the original variables ...... (already mean centred) tm = pm1x1 + pm2x2 + ...... + pmmxm

⎛⎞⎛tpp...px111121m1 ⎞⎛⎞ ⎜⎟⎜tpp...px ⎟⎜⎟ ⎜⎟⎜221222m2= ⎟⎜⎟ ⎜⎟⎜...... ⎟⎜⎟ ... ⎜⎟⎜ ⎟⎜⎟ ⎝⎠⎝tpp...pxmm1m2mmm ⎠⎝⎠ principal original components variables Principal Component Analysis (PCA) PCA Model: X = T PT scores loadings (projections) PT X =+E T

T T T X = t1p1 + t2p2 + ……+ tnpn + E n number of components (<< number of variables in X)

T T T Xt=++….++1p1 t2 p2 tnpn E

rank 1 rank 1 rank 1 Principal Component Analysis (PCA)

Model: X = T PT + E X = structure + noise

It is an approximation to the experimental data matrix X

• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).

• Scores, Targets: T relationships between the samples (coordinates of samples or objects in the space defined by the principal components Vectors in T (scores) are orthogonal

Noise E Experimental error, non-explained variances Principal Component Analysis (PCA) PCA Model : X = TPT + E Determination of the number of principal components A

X(n,m), when n > m, m is the maximum number of PCs, Amax=m when m > n, n is the maximum number of PCs, Amax=n

In general, a number much smaller of PCs is used ΠΑdata compression’, ‘data reduction’, A << n or m

A is chosen for the variance in TPT having most of the relevant structure of X, whereas noise remains in E (noise does not interest us! we want to filter it! ...)

To select the appropriate number of PCs, A, E (residuals, lack of fit,...) has to be studied Î quantitation of the variance in E, e.g. by residuals variance in % Principal Component Analysis (PCA)

Determination of the number of principal components a) visual inspection of the magnitudee of the singular values. Graphical representation (search for an inflexion) b) representationss of the explained/residual variance respect the number of principal component c) For autoscaled data, keep components until their λ aprox 1-2 d) When the noise level is known, select the number of PCs until the residual variance is similar to noise variance. e) Consider PCs until ‘loadings’ have structural features (nor noise) f) Use statistical tests and methods based in the previous knowledge of experimental noise size. g) Approximate methods when experimental noise is not known • Malinowski error functions • cross validation • ...... Principal Component Analysis (PCA)

Determination of the number of PCs from eigenvalue/singular value plots

450

400

350

300

250

200

150

100 4 components

50

0 0 2 4 6 8 10 12 Principal Component Analysis (PCA)

How many principal components? Model Fitting

PC reliability

Explained Variànce Model reliability

Number of PCs With more PCs in the model, better data fitting but the model reliability when it is applied to new data may be worse (overfitting) Principal Component Analysis (PCA) Determination of the number of principal components Cross-validation Methods

- a data subset is eliminated from the original data matrix X ÎXr - a number of components k is estimated for Xr - the eliminated data subset is predicted for k components and the predicted values are compared with the actual values

eliminated

Xr Æ k PCs PCA T Xr = Tk Pk

T X x xxˆ = PP PCA kk projection evaluation of (x− x)ˆ Principal Component Analysis (PCA) Determination of the number of principal components

Cross-validation Methods a data subset is eliminated from the original data matrix X ÎXr - a number of components k is estimated for Xr - the eliminated data subset is predicted for k components and the predicted values are compared with the actual values

rc ˆ 2 PRESS(k) =∑∑ (xij - x ij (k)) i=1 j=1 PRESS is plotted for the different number of considered components k, and the minimum value of PRESS or when it does not decrease any more is looked for Principal Component Analysis (PCA)

x2 PC1

o oo o g

Loadings o n i

o o d

a

o o l

o o 2

o x x1 o x1 loading o o o o

Loadings are orthonormal, PTP = I and PT = P-1 Principal Component Analysis (PCA)

Loadings interpretation

• Determination of the more important variables in the formation of the principal components (those variables with large loadings are important, either neg. or pos.)

• Multivariate correlation between variables – positive correlation (common variation) – negative correlation (contrary variation)

• Identification and qualitative information (fingerprinting) on the variation sources Principal Component Analysis (PCA)

x2 PC1

oo o o Scores o o o t 1 o e o o or sc o ple o m x1 sa o o o o

Projection of X in the PCs (loadings) gives the ‘scores’ T = XP Principal Component Analysis (PCA)

Interpretation of the ‘scores’ (targets, punctuations) • Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’ • Objects close between them are ‘similar’ objects • Objects far between them are ‘different • Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other ‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwise there are ‘clusters’ • Principal Components identification may be achieved from the external identification of the clusters of objects • Simultaneous analysis of ‘loadings’ and ‘scores’ plots help to identify/interpret the principal components principals Principal Component Analysis (PCA)

Scores and Loadings

• ‘scores’ show the relationships between samples

• ‘loadings’ show the relationships between variables

• ‘scores’ and ‘loadings’ should be interpreted in pairs

• ‘scores’ and ‘loadings’ are plotted ones against the others Principal Component Analysis (PCA)

PCA Statistics Residuals statistic to measure the lack of fit (large residuals) T T T Qi = ei ei = xi (I-PkPk ) xk samples with large Qi values are unusual (they are out of the model!!!!)

Hotelling statistic T2 2 -1 T -1 T T Ti = ti λi ti = xi Pk λi Pk xi 2 samples with large values of Ti are unusual (they are inside the model with high leverage!!!!)

These statistics are used to develop control charts and limits in Statistics Process Control Principal Component Analysis (PCA)

PCA Statistics T T T Qi = eiei = xi(I - PkPk )xi , variation out of the PCA model 2 T Τ T T Ti = ti-1ti = xiP -1P xi , variation inside of the PCA model Principal Component Analysis (PCA)

Outliers

• Outlier samples can have a great influence (leverage) in the PCA model

• They can be detected

• To detect them, find: 9isolated samples in scores plots 9samples with large values of Q or T2 or both Principal Component Analysis (PCA) Outliers in scores plots

x x

x x x x x x x x xx x xx x x xxxxxx x xxx Scores en PC2 xxxxx xx x x Scores en PC1 Principal Component Analysis (PCA)

Detection of ‘outliers’ From scores plots ==> outlier samples From loadings plots ==> outlier variables Leverage samples or variables affecting very much the PCA model It is evaluated from the expression:

n t 2 h =+1 ik, i ns ∑ λ k=1 k hi sample i ‘leverage’ ti,k ‘score’ of sample i on the k component λk singular value of k component ns number of samples n number of considered components Multivariate Data Analysis Multivariate Data Analysis Introduction to Multivariate Data Analysis Principal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)

Laboratory exercises: Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona) [email protected] Multivariate (Multiple) Linear Regression (MLR) Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable (predictand) and one or more independent variables (predictors).

MLR is based on least squares: the model is fit such that the sum-of-squares of differences of observed and predicted values is minimized..

The performance of the model on data not used to fit the model is usually checked in some way by a process called validation.

The reconstruction is a "prediction" in the sense that the regression model is applied to generate estimates of the predictand variable different to the used to fit the data. The Uncertainty in the reconstruction is summarized by confidence intervals, which can be computed by various alternative ways. Multivariate Linear Regression

MLR Model ybbxbxiiiKiKi=+01,12,2 + ++… bxe , + th xjij, = value of predictor in sample i

b0 = regression constant th bjj = coefficient on the predictor K = total number of predictors

yii = predictand in sample

ei = error term In vector-matrix form: yXbe=+ Multivariate Linear Regression

MLR Predictions

ˆˆ ˆ ˆ ybbxbxˆiiiKiK=+01,12,2 + ++… bx , th xkik, = value of predictor in new sample i ˆˆ ˆ bb01, , bk = estimated regression constant and coefficients

yiˆi = predicted value for new sample

in matrix-vector form yXbˆ = ˆ Multivariate Linear Regression

MLR Prediction

ˆˆ ˆ ˆ ybbxbxˆiiiKiK=+01,12,2 + ++… bx ,

Measurement i might be outside the range used for calibration or validation Multivariate Linear Regression

MLR Residuals

eyyˆˆiii=− yi = observed value of predictand in sample i yˆi = predicted value of predictand in sample i Multivariate Linear Regression

MLR Assumptions

1. Relationships are linear 2. Predictors are nonstochastic 3. Residuals have zero mean 4. Residuals have constant variance 5. Residuals are not autocorrelated 6. Residuals are normally distributed Multivariate Linear Regression

Caveats/Problems to interpretation

1. Relationship may be nonlinear or outlier-driven 2. Correlation ≠causation 3. ≠ practical significance 4. Lagged effects not measured 5. Problems when X variables (predictors) are strongly correlated (common in practice) Multivariate Linear Regression

Correlation Alternatives to MLR among predictors? Nonlinearity?

Reduce number Factor Analysis of variables based methods Data transformation, Stepwise Regrssion (e.g., kernel regression) use kernel and Neural Networks Categorical try MLR again predictand? Quadratic response surfaces

Discriminant analysis Classification trees Multivariate Linear Regression

MLR Statistics

•R2 -- explanatory power • Adjusted R2: R2 adjusted for loss of degrees of freedom due to number of predictors in model • F and its p-value -- significance of the equation

• se of the estimate; equivalent to “root mean square error” (RMSEc); subscript “c” denotes “calibration” • for parameters X Multivariate Linear Regression

MLR ANOVA Table (testing linearity) Source df Sum of Mean squares Squares

Total n-1 SST

Regression K SSR MSR=SSR/K (model)

Residual n-K-1 SSE MSE=SSE/(n-K-1) Multivariate Linear Regression Validating the MLR regression model

•Regression R-squared, even if adjusted for loss of degrees of freedom due to the number of predictors in the model, can give a misleading, overly optimistic view of accuracy of prediction when the model is applied outside the calibration period.

•Several approaches to validation are available. Among these are cross-validation and split-sample validation.

•In cross-validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict the predictand for the deleted observation. The merged series of predictions for deleted observations is then checked for accuracy against the observed data.

•In split-sample calibration, the model is fit to some portion of the data (say, the second half), and accuracy is measured on the predictions for the other half of the data. The calibration and validation periods are then exchanged and the process repeated. Multivariate Linear Regression

Model Calibration vs Validation

Calibration Validation 1. Fitting the model to the data 1. Testing the model on data not used to fit the model 2. “calibration”, “construction”, 2. “validation”, “verification”, “estimation” data “independent” data 3. Accuracy statistics: 3. Accuracy statistics: 2 2 RE {R , Ra ,} SSE SSEc v MSE MSEc v RMSE RMSEc v

Definitions: validation, cross-validation, split-sample validation, mean square error (MSE), root-mean-square error (RMSE); standard error of prediction, PRESS statistic, "hat" matrix, extrapolation vs interpolation Advantages of cross-validation over alternative validation methods Multivariate Linear Regression Cross validation stopping rule

Stop here Multivariate Linear Regression

Error bars for MLR predictions Heirarchy

1. Standard error of the estimate (calibration statistic) 2. Standard error of prediction (calibration statistic) 3. Root-mean-square error of validation (validation statistic) Multivariate Linear Regression

Standard Error of MLR Prediction (Equation for ) 1/2 ⎡ ⎤ ⎢ 2 ⎥ 1 ()xx* − ssyeˆ =++⎢1 n ⎥ n 2 ⎢ xx− ⎥ ⎢ ∑()i ⎥ ⎣ i=1 ⎦ Standard error of the estimate Term due to departure of s = MSE Term due to uncertainty the predictor value for the e In the estimate of the predicted observation from the predictor mean for the = RMSEc predictand mean; a function of sample size calibration period n Least Squares and linear regression Univariate linear regression yio1iiiiy=+ b b x + e , y =f(x ), s >>≈ s x 0

∑∑(xii−− X)(y Y) (x i − X) by1i==22 ∑∑(xii−− X) (x X) ss22 s s,2 ===yy s y bb11(x− X)2 S 2 ∑ iXX∑(xi − X) ⎛⎞x2 ss= ⎜⎟∑ i by0 ⎜⎟n(xX)− 2 ⎝⎠∑ i (OLS) Multivariate Linear regression n experimental measures, m variables

X = {xij} n x m independent variables y = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the Assumption that experimental errors are only important for yi

y = X b, b= (XTX)-1XTy s2(b) = (A AT)-1 s2(y) = (X XT)-1 s2(y) where: A = {Aij}; Aij = δ ri / δ bj = {-Xij}; 2 ri = yi –yci, residuals and s (y) is estimated from: 2 2 s (y) = ∑ ri / (m-n) Weighted Least Squares (WLS) Multivariate Linear regression n experimental measures

X = {xij} n x m independent variables y = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model W = {wij} n x m weights to each xij value, considering errors in X (error standard deviations of xij,, sij; wij = 1/sij) unweigthed weighted y = X b y = X b b= (XTX)-1XTy b= (XTWX)-1XTWy s2(b) = (A AT)-1 s2(y) s2(b) = (AWAT)-1 s2(y) where: A = {Aij}; Aij = δ ri / δ bj; 2 ri = yi –yci, residuals and s (y) is estimated from: 2 2 2 s (y) = ∑ wi ri / (m-n) Generalized Least Squares (GLS) Multivariate Linear regression n experimental measures

X = {xij} n x m independent variables y = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model M = {mij} n x m weights to each xij value, calculated from errors in X and y (it is more complex!)

y = X b, b= (XTMX)-1XTMy s2(b) = (AMAT)-1 s2(y) where: A = {Aij}; Aij = δ ri / δ bj; 2 ri = yi –yci, residuals and s (y) is estimated from: 2 2 s (y) = ∑ ri / (m-n) Multivariate Linear Regression

Interpolation vs Extrapolation

Prediction based on predictor Prediction based on predictor data “similar” to that in the data “unlike” that in the calibration range calibration range

X Multivariate Linear Regression

Classifying MLR predicted values

T1T− “Hat” matrix, computed from HXXXX= () Calibration-only predictor data

TT− 1 Classification statistic h**= xXXx() * Vector of predictor data for some observation Outside calibration range

hh*max> Rule identifying “extrapolation”

Maximum value along diagonal of the hat matrix Multivariate Linear Regression Conventions & Notation in Calibration Data are arrenged in two blocs/tables/matrices X and Y where: X = matrix of predictor variables Y = matrix of predicted (predictand) variables ns = number of samples/observations nx = number of variables in X ny = number of variables in Y n = number of PCs/latent variables/components

predictor variables predicted (predictant) variables 1 2 3 .... nx 1 2 3 .... ny

1 1 2 Y = f(X) 2 3 Matrix 3 Matrix . X . Y, y samples Samples . Find f .

ns ns Multivariate Linear Regression Multivariate Linear Regression Causal vs Predictive Model Causal models X = f (Y)(1) Predictive models (inverse) Y = f(X) or y = f(X)(2)

Independent (predictors) vs dependent (predictands)

Example: X (R) is the matrix of multivariate (instrumental) responses for different samples Y is the matrix of concentrations of one chemical component (or more) in the different samples y (c) is the concentration of one component in the samples f is the calibration function in the causal (1) and predictive in (2) linear models, f is a linear function Multivariate Linear Regression Multicomponent Analysis:Bilinear Model R = C ST + E ns,nw ns,nc nc,nw ns,nw

R matrix of sensor responses (ns samples, nw wavelengths) C matrix of concentrations (ns samples, nc components) ST matrix of sensibilities (nc components, nw wavelengths) E matrix of experimental errors (ns samples, nw wavelengths)

⎛⎞⎛⎞rr1,1 1,nw cc 1,1.. 1, nc⎛⎞ ee 1,1 1, nw ⎜⎟⎜⎟⎛⎞ss1,1 1,nw ⎜⎟ ⎜⎟⎜⎟rr2,1 2,nw cc 2,1.. 2, nc⎜⎟⎜⎟ ee 2,1 2, nw =+⎜⎟ ⎜⎟⎜⎟ ⎜⎟ ⎜⎟⎜⎟⎜⎟ss ⎜⎟ ⎜⎟⎜⎟⎝⎠nc,1 nc , nw ⎜⎟ ⎝⎠⎝⎠rrns,1 ns , nw cc ns ,1.. ns , nc⎝⎠ ee ns ,1 ns, nw Multivariate Linear Regression

• Advantages • Total selectivity is not needed • Allow multicomponent analysis • Outlier detection is possible

• More used methods – MLR, Multilinear Regression • Classical Least Squares (CLS) • Inverse Least squares (ILS) – Factors based Linear Regression (biased) • Principal Components Regression (PCR) • Partial Least Squares Regression (PLSR) – Non-linear Regression CLS Model : R = C ST + E ns,nw ns,nc nc,nw ns,nw

The responses are modelled as a function of the concentrations. It is the same causal model as for the generalized Beer’s law (generalized multilinear model)

Calibration step: a) direct: pure component spectra of the components or sensibilities are previously known; ST is known b) indirect: pure component spectra of the components are not previously known; ST is unknown, it has to be estimated in the calibration step: ST = C+R where: C+=(CTC)-1CT (pseudoinverse) Prediction step: foe a set of samples ‘nunk’ with unknown analyte concentration

T + Cunk = Runk(S ) nunk,nc nunk,nw nw,nc

Cunk matrix of concentrations of the nunk unknown samples (nunk samples, nc components)

Runk matrix of their instrumental responses (nunk mostres, nw wavelengths)

(ST)+ pseudoinverse of the sensibilities matrix (pure spectra) (nw wavelengths, nc components) (ST)+=S(STS)-1 Prediction step (one sample): the concentration of several analytes in one sample:

T T T + c unk= r unk(S ) 1,nc 1,nw nw,nc or what is the same:

+ cunk= S r nc,1 nc,nw nw,1 where S+ = (STS)-1ST Multivariate Linear Regression Classical (Causal) Least Squares (CLS) Advantages (compared to univariate least squares): 1. Increase of precision in the estimations (signal averaging) 2. Allows the estimation of the pure responses (pure spectra) => qualitative information, identification 3. Allows multicomponent quantitative analysis Disadvantages: Needs knowing and introducing the whole information of all these components contributing to the measured analytical response It does not allow calibration in the presence of unknown interferents (it is not used in the analysis of natural samples) 2. Inverse Calibration. Inverse Least Squares (ILS)

Model: c = R b + e (y = X b) ns,1 ns,nw nw,1 ns,1

The concentrations are modeled as a function of the instrumental responses. It is not a causal model. It is a predictive model. Only needs knowing the analyte concentration in the calibration samples.

Calibration: b = R+ c(b = X+c) nw,1 nw,ns ns,1 b is the calibration vector, evaluated from the responses of the calibration samples R where the analyte concentration c is known. R+ = (RTR)-1RT pseudoinverse of R X+ = (XTX)-1XT pseudoinverse of X T T Prediction: cunk = r unk b (yunk = x unk b) 1,1 1,nw nw,1 cunk is the concentration of the analyte in a new sample T r unk is the instrumental response given by this sample b is the calibration vector previously evaluated

R is not square => calculation of the generalized inverse or pseudoinverse R+ = (RTR)-1RT Problem: In the evaluation of the calibration vector b = (RTR)-1RT c, (RTR)-1 nw,nw has to be evaluated The nw rows and nw columns should be linearly independent!!! and ns > nw (number of calibration samples > number of wavelengths Multivariate Linear Regression

Inverse Least squares, ILS (inverse model)

Advantages - allows the determination of one analyte in the presence of unknown interferences (this is not possible with CLS!!!) - Only needs the calibration information for one analyte Disadvantages - It does not use all the variables; only uses a reduced number of selected variables (sensors or wavelengths). - There is no increase of measurement precision (there is no signal averaging) Multivariate Linear Regression

Methods based on Factor Analysis •Factor decomposition of matrix X • Resolve the colinearity problem in X • Backgroundd noise filtering • Improve precision (signal averaging) • ‘Compression’ of the information in a reduced number of new variables o factors Multivariate Linear Regression

Pretreatment Methods in Multivariate Regression/Calibration - as in PCA - linearization, if possible - mean centering - variance scaling, when the variables are in different units or differ considerably in magnitude - outliers elimination is critical Multivariate Linear Regression

Principal Component Regression PCR

1. Descomposition in factors of X by PCA X = T PT + E ns,nw ns,nc nc,nw (ns,nw) 2. Multilinear regression (MLR) on the scores T (instead of on original variables in X) y = T b ns,1 ns,nc nc,1 Multivariate Linear Regression

Principal Component Regression PCR

3. Evaluation of the regression vector b

b= T+ y = (TTT)-1TT y the scores (PCA) are orthogonals

T -1 (T T) = diag(1/λi), i =1,...,nc Multivariate Linear Regression

Principal Component Regression PCR 4. Prediction of a new sample with response T x unk T T score of the new sample: t unk= x unkP 1,nc 1,nw nw,nc T prediction of its concentration yunk = t b 1,1 1,nc nc,1 Multivariate Linear Regression Principal Component Regression PCR

Direct calculation ycal = X bcal + bcal = X ycal + + T + T -1 T X Ù XPCA = (T P ) = P (T T) T PT orthonormal ÙP PT = I Ù (PT)-1 = P

T T orthogonal Ù T T = diag(λi) T yunk = x unk bcal 1,1 1,nw nw,1 nw nc nw PT X Î T nc PCA nc<

ns ns Multivariate Linear Regression

Principal Components Regression, PCR PCR is one of the ways to solve the inversion of matrices ill-conditioned in linear regression The property of interest is modelled (regressed) on the PCA scores:

y = TkbPC + e = XPkbPC + e T -1 T bPC = X(Tk Tk) Tk y (regression vector from PCA ‘sores’)

b = PkbPC ; y = Xb (regression vector from the original variables) + T -1 T X = Pk(Tk Tk) Tk (la inverse matrix is calculated from orthogonal matrices) Multivariate Linear Regression

Possible problems with PCR Some of the PCs cannot be relevant for the prediction of y, only are relevant for the description of X The PCs are estimated without considering the prperty to predict y Solution: find the components using the information in y, not only the information in Î PLS The number of components has to be estimated using validation methods. Diagnostics are used to find out outliers (Q residuals, T2 values, leverage plots) Multivariate Linear Regression

Partial Least Squares Regression PLSR

The responses matrix X is decomposed and ‘truncated’ in a similar way to PCR, but in the decomposition the information of y in the calibration samples is considered

X= T PT + E Y= U QT + F

u = y (in the case of a single component)

X ÅÆ T ÅÆ U ÅÆ Y variance covariance variance W x1 x2 y x3 CLS: X = f(y) x4 x5

x1 x2 ILS,MLR y = f(X) x y 3 x4 x5 x1 PCA x2 t1 PCR = PCA + y x3 x + MLR t2 4 x5

x1 x PLSR t1 2 y x3 t 2 x4 x5 Multivariate Linear Regression

Partial Least Squares, PLS •PLS is a mixture of PCR and MLR •PCR captures the maximum variance in X •MLR gets maxima correlation between X and y •PLS tries both things: makes maximum the covariance •PLS requires an additional matrix of weights W to keep the orthogonality of the scores and makes easier matrix inversion •The factors are evaluated sequentially by projection of y on X •The expression to evaluate the matrix inverse is more complex than for PCR: + T -1 T -1 T X = Wk(Pk Wk) (Tk Tk) Tk Comparison of predictive (inverse) linear calibration models The different methods differ in the way they resolve the same equation of calibration (calculation of the inverse matrix of R): b = X+y MLR (ILS) X+ = (XTX)-1XT Maximum correlation between X and y is achieved, but the direct inversion of X is problematic

+ T -1 T PCR X = Pk(Tk Tk) Tk Maximum variance in X is achieved

+ T -1 T -1 T PLSR X = Wk(Pk Wk) (Tk Tk) Tk Maximum covariance between X and y is achieved

Inverses in PCR i PLS are calculated from orthogonal matrices A-1 A= D (diagonal); orthonormal A-1A=ATA=I Multivariate Linear of the calibration models (regression) Calibration step: ycal Xcal

Calculation of the Model Model Obtain nr. components Validation step:

yval Xval

Test the model Test del Model Test nr. components Multivariate Linear Regression

Validation of the model; with the same calibration samples ˆ Xcal + Model Î ycal

Model Error : yˆ - y cal cal ns ˆ 2 ∑(yi,cal-y i,cal ) residual variance (calibration) i=1 ns RMSEC (Root Mean Square ns ˆ 2 ∑(yi,cal-y i,cal ) Error of Calibration) i=1 ns Multivariate Linear Regression

Validation of the Model; with new validation samples

Xval + Model Î yˆ val

Model Error : y ˆ - yval val ns ˆ 2 ∑(yi,val-y i,val ) residual variance (prediction) i=1 ns RMSEP (Root Mean Square ns ˆ 2 ∑(yi,val-y i,val ) Error of Prediction) i=1 ns Multivariate Linear Regression

Validation of the calibration model

The ability of the calibration model to predict has to be evaluated using a new sample set not used in the development of the calibration model:

Samples: Calibration of the Model Xcal Ù ycal Training set fcal Calculation of RMSEC

Validation of the Model

Test set Xval Î yˆ val fval Calculation of RMSEP Multivariate Linear Regression

Validation Methods 1) With a calibration set of samples and a different validation set of samples. Both data sets should representative. It is the best method. 2) Cross-Validation 2A) Two groups of data X and y

calibration XA yA X y A validation B X y validation B B calibration Prediction error is evaluated for A and B and the average is evaluated Multivariate Linear Regression

2B) Full cross-validation or leave-one-out validation

The same number of PCA models as samples are. Successively, one sample is removed, a new PCA model is built and the left out sample is predicted. This is repeated for every sample and the prediction error is calculated.

2C) Segmented Validation (for small groups of samples, i.e. 10 % of samples)

3) Leverage correction (leverage, hi) yˆ val

Residuals fi = yval - are weighted according to hi

corr fi = fi / (1-hi) Multivariate Linear Regression

n t2 h=1 + i,k where i ns ∑ λ k=1 k

hi values are between 0 and 1

cor samples with low ‘leverage’ hi Æ 0 fi Æ fi

cor samples with high ‘leverage’ hi Æ 1 fi >> fi

This procedure uses only the calibration samples dada; it gives a first approximation of the future prediction ability. Multivariate Linear Regression

Cross-Validation

Data are divided in q subsets Built the model with the q-1 subsets Calculate PRESS (predictive Residual Sum of Squares ˆ 2 PRESS=−∑∑ (yyij ij ) ij

Repeat until all the groups have been left out one time. Find the minimum (or the inflexion point) in the the plot of PRESS vs nr. components Multivariate Linear Regression Determination of the number of components using PRESS PLOTS

500

450

400

S

S

E 350

R

P

e 300

v

i

t

a l 250

u

m

u 200

c

150 5 components 100

50

0 1 2 3 4 5 6 7 8 9 10 number of components Multivariate Linear Regression Model evaluation and validation

Prediction Error Sum of ˆ 2 PRESS=−∑∑ (yij y ij ) Squares, PRESS ij ˆ 2 ∑∑(yij− ÿ ij ) Root Mean Square Error in RMSEP = ij Prediction n.samples ˆ 2 ∑∑(yij−− y ij bias) Standard Error in Prediction SEP = ij n.samples− 1 ˆ ∑∑(yij− y ij ) bias = ij bias n.samples (yˆ − y )2 Relative Error ∑∑ ij ij ij RE= 100 2 ∑∑(yij ) i j Multivariate Linear Regression Model evaluation and validation

• Comparison of experimental values versus model predicted values – for the calibration samples – for the external validation samples • Plot and calculate regression line of predicted versus actual values: predicted values = slope x experimental values + offset slope should be one, offset should be zero and r2 = 1 Multivariate Linear Regression PLS model interpretation Partial Least Squares Regression PLSR models X loadings

X= T PT + E Y= U QT + F

Y loadings

u = y (in the case of a single component)

X scores Y scores X ÅÆ T ÅÆ U ÅÆ Y variance covariance variance W (weights) Multivariate Linear Regression PLS model interpretation

Interpretation of PLS models. • Physical interpretation of PLS models can be obtained from plots of scores and loadings like in PCA.

• More interestingly PLS models provide the weights (Wk) which describe the covariance structure between X and y blocks. Plot of the weights are extremely useful for PLS models interpretation + T -1 T -1 T X = Wk(Pk Wk) (TkTk ) Tk b=X+y y=Xb • Other measures exist like the variable influence (importance) on projection, VIP, parameter. Plot of VIPs are also very useful for PLS model interpretation Multivariate Linear Regression PLS model interpretation Variable influence (importance) on projection, VIP, parameter

A 2 K VIPAk = (∑ wak (SSYa−1 − SSYa ) a=1 (SSYo − SSYa )

A total number of factors considered in the model a considered factor k considered variable

The variables with larger VIPs (larger than one) are more influential and important for explaining Y