Fundamentals of Chemometrics and Modeling

Dr. Tom Dearing CPAC, University of Washington Outline • Fundamentals of Chemometrics – Introduction to Chemometrics – Measurements – The Analysis Procedure • Basic Modeling – Principal Component Analysis – Scores and Loadings • Advanced Modeling – Partial – Latent Variables – Scores and Loadings – Calibration and Validation – Prediction • Case Study Section 1

Through the looking glass….. Chemometrics

• Chemometrics is:

The science of extracting information from measurements made on chemical systems with the use of mathematical and statistical procedures.

• Keywords and phrases:

data analysis, data processing, univariate, multivariate, , modeling, scores, loadings, calibration and validations, predictions, real time decision making. Near IR Tablet Data Measurements 7

6.5

• Measurements come in many 6 different forms. 5.5

– Spectroscopic 5 Signal Signal Intensity • Near IR, Fluorescence, Raman. 4.5 – Chromatographic 4 • Gas , HPLC. 3.5

3 600 800 1000 1200 1400 1600 1800 2000 – Physical Wavenumber cm-1 • Temperature, Pressure, Flow rate, Melting Points, Viscosity, Concentrations.

• All measurements yield data. • NIR data set containing 255 spectra measured at 650 (counts) Intensity different wavenumbers has

165750 data points!! Wavelength (nm) Two Types Of Data

• Univariate • Multivariate – One variable to measure – Multiple variables – One variable to predict – Multiple predictions

– Typically select one – Typically use entire wavelength and monitor spectra. change of absorbance – Allows investigation into over time. the relationship – Wavelength must not between variables. have contributions or – Allows revealing of overlapping from other latent variation within a peaks. set of spectra. Multivariate Analysis

• Analysis performed on multiple sets of measurements, wavelengths, samples and data sets.

and dependence between variables in crucial to multivariate analysis. The Chemometrics Process • All chemometrics begin with taking a measurement and collecting 5. Understanding data. • Mathematical and statistical methods are employed to extract 4. Knowledge relevant information from the data. • The information is related to the chemical process to extract 3. Information knowledge about a system. • Finally, the knowledge provided 2. Data allows comprehension and understanding of a system. • Understanding facilitates decision 1. Measurement making. Converting Data to Information

• Advances in measurement science rate of is extremely fast. • Large amounts of data produced. • Data rich, information poor. • Chemometrics used to remove redundant data, reduce variation not relating to the analytical signal and build models. Data Analysis Flow Chart

OUTLIER INPUT PREPROCESSING REMOVAL

DATA ANALYSIS

OUTPUT Input

• Most overlooked stage of data analysis. • Most critical stage of all.

• Data must be converted or transferred into the analysis software. • Proprietary collection software make this task difficult. • However, some analysis software have excellent data importing functionality Outliers – Problems and Removal

• Removing outliers is a delicate procedure. • Grubbs test used to detect outliers. • Frequently requires knowledge about the process being examined. • False outliers, samples at extremes of the system that appear infrequently within the data. – These are NOT REMOVED • True outliers, samples or variable that is statistically different from the other samples. – These ARE REMOVED Preprocessing

Near IR Tablet Data 7

• Preprocessing 6.5 – Main goal of the preprocessing 6 5.5

stage is to remove variation 5 Signal Signal Intensity within the data that does not 4.5 pertain to the analytical 4 information. 3.5 3 600 800 1000 1200 1400 1600 1800 2000 Wavenumber cm-1

MEAN • Typical preprocessing methods CENTRING

– Baseline Correction Centred NIR Spectra – Mean Centering 0.8 0.6 – Normalization 0.4 0.2 – Orthogonal Signal Correction 0

-0.2 Mean Centred Signal Intensity Centred Mean – Multiplicative Scatter Correction -0.4

-0.6

– Savitsky-Golay Derivatisation -0.8

600 800 1000 1200 1400 1600 1800 Wavenumber cm-1 Data Analysis

Scores Plot • Many different methods for 3 performing multivariate data 2 1

analysis. 0

-1 • Principal Component Analysis Scores PC on 2 (12.88%) – Section 2 -2 -3 -15 -10 -5 0 5 10 • Scores on PC 1 (81.38%) Partial Least Squares 60

– Section 3 50 • MCR 40 30

• Neural Networks 20

10

0 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15 Output

• Qualitative • Quantitative – Classification models. – Prediction models – Does a sample belong to – What is the a group or not?? concentration of the – Calibration and sample?? Validations – Calibration and – Classifications Validations – Classification error – Predictions – Number of samples – Calibration and classified correctly Prediction Errors – RMSEC and RMSEP Error • Many different methods of calculating errors. • Method used is critical as model quality determined by the error. • Procedure used can heavily influence model errors. (Discussed later in PCA section).

• The choice of error metric depends on many different factors • Top Three – What are you showing? – What is the of data? – How many samples do you have? Summary

• Chemometrics is a method of extracting relevant information from complex chemical data. • Multivariate data allows analysis robust investigation of overlapping signals. • Multivariate analysis allows investigation of the relationship between variables. • The chemometrics process yields understanding and comprehension of the process under investigation. Summary

• Data analysis is a multistep procedure involving many algorithms and many different paths to go down. • The end results of data analysis are commonly a model that could provide qualitative or quantitative information. • MatLab and PLS_Toolbox are software packages used to perform chemometrics analysis. Section 2

Principal Component Analysis P.C.A. PCA

• Method of reducing a set of data into three new sets of variables – Principal Components (PC’s) – Scores – Loadings • Using these three new variables latent variation can be developed and examined. • Incredibly important for investigating the relationships between samples and variables PCA • NIR spectra run through a PCA routine without any form of preprocessing. • Scores produced show apparent variation in concentration. • Loadings illustrate the mean spectra, suggesting that preprocessing should be used.

Near IR Tablet Data Samples/Scores Plot Variables/Loadings Plot 7 8 0.055

6.5 6 0.05

6 4

0.045 5.5 PCA 2

5 0 0.04 Signal Signal Intensity 4.5 -2

0.035

Scores PC on 2 (0.05%) Loadings on PC on 1 (99.93%) Loadings 4 -4

0.03 3.5 -6

3 -8 0.025 600 800 1000 1200 1400 1600 1800 2000 118 120 122 124 126 128 130 132 134 136 100 200 300 400 500 600 -1 Wavenumber cm Scores on PC 1 (99.93%) Variable SPECTRAL DATA SCORES LOADINGS Principal Components

• Each principal component calculated captures as much of the variation within the data as possible. • This variation is removed and a new principal component is determined. • The first PC describes the greatest source of variation within the data Scores

• The scores are organized in a column fashion. • The first column denotes the scores relating to the variation captured on PC1. • Intra-sample relationships can be observed by plotting the scores from PC1 against PC2. • This can be expanded to the scores of the first three PC’s. Scores

Samples/Scores Plot of aldat Samples/Scores Plot of aldat Samples/Scores Plot of aldat 400 300 150

300 200 100

200 100 50

100 0 0 0 -100 -50

-100

Scores PC on 2 (29.86%) Scores PC on 3 (11.16%) Scores PC on 1 (52.89%) -200 -100 -200

-300 -300 -150

-400 -400 -200 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 Sample Sample Sample Scores on PC1 Scores on PC2 Scores on PC3

Samples/Scores Plot of aldat 300 Samples/Scores Plot of aldat 150 Samples/Scores Plot of aldat

200 100 100 100 50 50

0 0 0 -50 -100 -50 -100

Scores PC on 2 (29.86%) -200 -150 Scores PC on 3 (11.16%) Scores PC on 3 (11.16%) -100 -200 -300 -150 200 0 -200 -400 -400 200 -400 -300 -200 -100 0 100 200 300 400 -200 -200 0 Scores on PC 1 (52.89%) -400 -300 -200 -100 0 100 200 300 400 Scores on PC 2 (29.86%) Scores on PC 1 (52.89%) Scores on PC 1 (52.89%) Scores of PC1 vs. Scores of PC1 vs. Scores of PC2 PC3 PC1 vs. PC2 vs PC3 Loadings

• Illustrate the weight or importance of each variable within the original data. • From loadings it is possible to see the most significant variables. • Loadings can be used to track the process of a reaction e.g. monitor reactant consumption. • Deduce variables responsible for the clustering in the scores. Loadings Variables/Loadings Plot Variables/Loadings Plot 0.055 0.07

0.06 0.05 0.05

0.045 0.04

0.03 0.04 0.02

0.035 0.01

Loadings on PC on 1 (81.38%) Loadings Loadings on PC on 1 (99.93%) Loadings 0 0.03 -0.01

0.025 -0.02 100 200 300 400 500 600 100 200 300 400 500 600 Variable Variable NO PREPROCESSING MEAN CENTRING

Variables/Loadings Plot 0.05

0.04

0.03

0.02

0.01

0

Loadings on PC on 1 (62.61%) Loadings -0.01

-0.02

-0.03 100 200 300 400 500 600 Variable AUTO SCALING Outlier Removal

• PCA can be used in conjunction with confidence intervals to identify outliers within a set of data.

Samples/Scores Plot Samples/Scores Plot

4 6

3 4 2

2 1

0 0

-1

-2 Scores PC on 2 (12.88%) Scores PC on 2 (12.88%) -2 -4 -3

-4 -6

-15 -10 -5 0 5 10 -15 -10 -5 0 5 10 15 Scores on PC 1 (81.38%) Scores on PC 1 (81.38%) 95% 99.9% Confidence Interval Summary

• PCA used to decompose the data into scores and loadings • Scores reveal information about between sample variation. • Loadings tell us which variables from within the original data contribute most to the scores. • PCA can also be used to analyze and investigate data to perform tasks such as outlier removal. • PCA facilitates process understanding. Section 3

Partial Least Squares Inverse Calibration

• Calibration Equation: y = Xb y is concentration data, X is spectra and b is the produced model. • Calibration requires only spectra and calibration property, such as a concentration. • Demanding strategy as assumption made about errors. • Requires good lab data. PLS

• Partial Least Squares (PLS) is an extension of the PCA method. • PCA extracts PC’s describing the sources of variation within the data. • PLS takes the PC’s and correlates them with Y-Block information to calculate Latent Variables (LV’s). • Y-Block information is typically sample concentrations, physical properties. • PLS is a quantitative procedure and can be used to model and predict y-block information for future samples The X- and Y-Block

• PLS uses X-Block and Y-Block information. • X-Block tends to refer to spectra. • Y-Block relates to the information you want to predict, such as concentration or some physical property. • Y-Block data is normally collected offline in a lab. • Y-Block is often referred to as the reference method. PLS Data Analysis

INPUT SPECTRA PREPROCESSING

NEW X-Block PLS CALIBRATION MODEL MEASUREMENT DATA CONCENTRATIONS PREPROCESSING

Y-Block PLS PREDICTION MODEL

CONCETRATIONS FOR NEW MEASURMENTS Difference between PLS and PCA

• PCA • PLS • Classification • Quantification • Exploratory analysis of • Prediction data. • Modeling of current • PC’s extracted describe and future samples. sources of variation in • Latent variables order of significance. important factor in • Used for the removal of determining model outliers performance. Calibration

• Building a calibration model, requires retaining as much relevant variation as possible. • Whilst removing as much irrelevant variation as possible. • Selecting calibration data VITAL to final predictions. • Use Design of (DoE) to effectively map a data space or series of experiments. • Quality of calibration determine by calculating the Root Mean Square Error in Calibration (RMSEC) Selecting Samples For Calibrations

– Use optimal methods to effectively map the data – Methods such as D-Optimal, E-Optimal and Kennard- Stone. – These methods only need to be run once. • Random Subsets – Select a set of samples entirely at random. – Perform analysis and calculate errors. – Re-select a new random subset and repeat procedure for a number of iterations – Calculate average errors at the end. Selecting Samples For Calibrations

• Visual depiction of data DATA SET Selecting Samples For Calibrations

• D-Optimal

• Samples selected according to D-Optimal criteria. Selecting Samples For Calibrations

• Kennard-Stone

• Samples selected in an attempt to uniformly map the data. Validation

• Validation data is used to check the predictive performance of the model. • Validation can be performed using subsets of the calibration data (Cross Validation). • Separate validation sets of data can be collected (True Validation). • Cross validation leads to overly positive results. • Quality of validation calculated using the Root Mean Square Error in Prediction (RMSEP). • Quality of predictions determines quality of model. Modeling

• The quality of calibrations and validations can vary significantly with the number of LV’s included in the model. • Too few and the model will make poor predictions as there is insufficient information in the calibration • Too many and the model has become overly focused and contains too much variation making it not robust to small amounts of variation. Modeling

Ideal Number of Latent Variables RMSEP

for model. Error

RMSEC

Number of Latent Variables Model Maintenance

• We’ve built the model: So what next? MODEL MAINTENANCE • Collect lab data weekly to re-validate the model. – Are model results within significant error? – If not what do we do? • Re-evaluate calibration samples – Is the calibration model still relevant? • Perform DoE to re-select more data. • Check LV model to make sure appropriate LV’s being used. • Continual improvement. Summary

• PLS implements inverse calibration to incorporate concentration information into a model. • Makes quantitative predictions of unseen samples • Requires calibration and validation • Latent variables have significant effect on model. • Quality of model determined by prediction and the RMSEP Case Study

Model Building From Beginning to End Case Study 1 • Near IR spectra of tablets collected over a period of 4 years. • GC analysis of tablets showed active pharmaceutical ingredient within specification for all samples.

7 60

6.5 50

6

40 5.5

5 30 Signal Signal Intensity

4.5 of Samples Number 20

4

10 3.5

3 0 600 800 1000 1200 1400 1600 1800 2000 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15 Wavenumber cm-1 Tablet API Concentration The Problem

• The NIR calibration model produced has determined 32% samples are out of specification.

• The Plan: Use PCA to investigate and examine the spectra to improve the NIR calibration. Data Analysis Plan

NIR TABLET OUTLIER MEAN DATA REMOVAL CENTRING

PCA

SCORES LOADINGS NIR Data – Visual Inspection

7

6.5 VARIATION IN

6 BASELINE

5.5 DETECTOR

5 NOISE Signal Signal Intensity 4.5

4

3.5

3 600 800 1000 1200 1400 1600 1800 2000 Wavenumber cm-1 Pre-processing Mean Centred NIR Spectra 7 1

0.8 6.5

0.6 6 0.4

5.5 0.2

5 0

Signal Signal Intensity -0.2

4.5 Mean Centred Signal Intensity Centred Mean -0.4 4 -0.6

3.5 -0.8

3 -1 600 800 1000 1200 1400 1600 1800 2000 600 800 1000 1200 1400 1600 1800 2000 Wavenumber cm-1 Wavenumber cm-1 • Data mean centered to reduce the magnitude of some variables. • After mean centering large peak between 1350cm-1 and 1700cm-1 Mean Centered Scores

Samples/Scores Plot • Strange distribution of 6 scores. 4 • For samples that should 2 all be the same 0 theoretically should form

-2 one group. Scores PC on 2 (12.88%) -4 • However 6 clusters

-6 formed. -15 -10 -5 0 5 10 15 Scores on PC 1 (81.38%) • Further investigation found 6 different tablet presses had been used. Mean Centered Loadings

Variables/Loadings Plot • Loadings on PC1 0.07 0.06 show that the 0.05 variables after 400 0.04 0.03 contribute little 0.02

0.01 information or noise Loadings on PC on 1 (81.38%) Loadings 0 to the scores. -0.01

-0.02 100 200 300 400 500 600 • Spectra truncated at Variable variable 400, which is 1398cm-1 Scatter Correction

• Investigation into the manufacturing procedure reveal tablets made using different presses. • This cause minor variations in the tablet depth. • This altered the pathlength and scattering of the NIR radiation. • Preprocessing must be applied to minimize the variation in the data due to the change in tablet depth. Data Analysis Plan 2

NIR TABLET VARIABLE MULTIPLICATIVE MEAN SCATTER DATA REMOVAL CORRECTION CENTRING

PCA

SCORES LOADINGS Scatter Correction

UNCORRECTED SPECTRA SCATTER CORRECTED SPECTRA 7 7

6.5 6.5

6 6

5.5 5.5

5

5

Signal Signal Intensity Signal Intensity 4.5

4.5 4

4 3.5

3 3.5 600 700 800 900 1000 1100 1200 1300 1400 600 700 800 900 1000 1100 1200 1300 1400 Wavenumber cm-1 Wavenumber cm-1 New Scores

Samples/Scores of Original and Scatter Corrected Data • After performing the 3 new stages of 2 preprocessing the new scores (red 1 triangles) have 0 formed one tight

-1 cluster showing that Scores PC on 2 (11.24%) variation not -2 relating to the API

-3 concentration has -15 -10 -5 0 5 10 Scores on PC 1 (88.42%) been removed. What Next?

Partial Least Squares PLS Modeling Strategy

• Stage One: Build calibration model

CALIBRATION PREPROCESSING SPECTRA

INPUT SPECTRA

7 VALIDATION 6.5 PLS CALIBRATION 6 SPECTRA 5.5 MODEL

5 Signal Signal Intensity

4.5

4

3.5 600 700 800 900 1000 1100 1200 1300 1400 CALIBRATION Wavenumber cm-1 CONCENTRATION PREPROCESSING CONCENTRATIONS

5.05 VALIDATION

5 4.95 CONCENTRATION 4.9

4.85 API Concentration

4.8

4.75

4.7 50 100 150 200 250 Sample Number PLS Calibration Model

• Large number of Samples/Scores Plot LV’s used to 5.15 produce the best 5.1 calibration model. 5.05 5

• Too many LV’s can 4.95

cause ‘over- 4.9 API Calibrated fitting’. 4.85

• RMSEC = 0.03539 4.8 • Error of 0.723% of 4.75 4.7 the mean API 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15 concentration. API Measured PLS Modeling Strategy

• Stage Two: Test Validate Calibration Model.

VALIDATION PREPROCESSING SPECTRA PLS CALIBRATION VARY LATENT MODEL VARIABLES VALIDATION CONCENTRATION PREPROCESSING

RMSEC

RMSEP LV Model

0.065

0.06

0.055

0.05

Error RMSEC 0.045 RMSEP

0.04

0.035

0.03 0 2 4 6 8 10 12 14 16 Number of Latent Variables • Varying number of LV’s to use in the model, lead to the conclusion that 7 LV’s will give the best predictions. PLS Validation Model

Samples/Scores Plot of Predicted v.s. Actual For API Concentration • Using 7 LV’s the 5.05 validation data was applied to the 5 calibration model to

4.95 determine the RMSEP. • Sacrifice calibration to 4.9 ensure better

API Predicted predictions 4.85 • RMSEC = 0.050381 4.8 • RMSEP = 0.053719 • Prediction error 1.087% 4.75 4.65 4.7 4.75 4.8 4.85 4.9 4.95 5 5.05 5.1 5.15 API Measured of the mean API concentration. PLS Future Modeling Strategy

• Stage Three: Predict new samples.

NEW SPECTRA PLS CALIBRATION PREDICTED MEASUREMENTS MODEL CONCENTRATIONS

6

5.5 CONCENTRATION

5 PLS CALIBRATION 4.75 Signal Signal Intensity 4.5 MODEL 4.79

4 4.9

3.5 600 700 800 900 1000 1100 1200 1300 1400 Wavenumber cm-1 Case Study Summary

• PCA used to explore variation within the spectra • Samples and variables selected for calibration. • Scatter correction and mean centering used to preprocess data. • PLS model built and validated using calibration and validation data. • RMSEC and RMSEP calculated. • Concentrations determined for new sample measurements. Acknowledgements