Chemometrics in Analytical Chemistry—Part II: Modeling, Validation, and Applications

Analytical and Bioanalytical Chemistry (2018) 410:6691–6704 https://doi.org/10.1007/s00216-018-1283-4 FEATURE ARTICLE Chemometrics in analytical chemistry—part II: modeling, validation, and applications Richard G. Brereton1 & Jeroen Jansen2 & João Lopes3 & Federico Marini4 & Alexey Pomerantsev5 & Oxana Rodionova5 & Jean Michel Roger6 & Beata Walczak7 & Romà Tauler8 Received: 8 June 2018 /Accepted: 18 July 2018 /Published online: 2 August 2018 # Springer-Verlag GmbH Germany, part of Springer Nature 2018 Abstract The contribution of chemometrics to important stages throughout the entire analytical process such as experimental design, sampling, and explorative data analysis, including data pretreatment and fusion, was described in the first part of the tutorial BChemometrics in analytical chemistry.^ This is the second part of a tutorial article on chemometrics which is devoted to the supervised modeling of multivariate chemical data, i.e., to the building of calibration and discrimination models, their quantitative validation, and their successful applications in different scientific fields. This tutorial provides an overview of the popularity of chemometrics in analytical chemistry. Keywords Calibration . Discrimination . Validation . Prediction . Omics . Hyperspectral imaging Introduction that chemometrics has within analytical chemistry and the different tools which could and should be used to address key Chemometrics is a highly interdisciplinary field (see Fig. 1) issues in the field. Starting from these premises, in the first part whose relevance among the chemical disciplines, in general, of this tutorial article [1], topics like sampling, experimental and, analytical chemistry, in particular, has considerably grown design, data preprocessing projection methods for data explo- over the years. Despite this, it is still largely unknown to many ration and factor analysis, and data fusion strategies were cov- analytical chemists and often misused or not completely under- ered and contextualized in the framework of specific goals stood by practitioners who rely on those very few techniques within analytical chemistry. On the other hand, the present which are implemented in the most widespread commercial paper deals with aspects related to predictive modeling (both software. However, chemometrics represents a wealth of pos- for quantitative and qualitative responses) and validation, and sibilities for the (analytical) chemists, as it is a discipline which presents some successful examples of the application of che- accompanies the analytical workflow at all stages of the pipe- mometric strategies to different Bhot topics,^ together with a line. In this context, the present paper represents the second part few considerations about how the discipline will evolve in the of a feature article aiming at highlighting the fundamental role next years. * Romà Tauler 4 Department of Chemistry, University of Rome La Sapienza, Piazzale [email protected] Aldo Moro 5, 00185 Rome, Italy 5 Institute of Chemical Physics RAS, 4, Kosygin Str, 1 School of Chemistry, University of Bristol, Cantocks Close, 119991 Moscow, Russia Bristol BS8 1TS, UK 6 Irstea, UMR ITAP, 361 Rue Jean-François Breton, 2 Institute for Molecules and Materials, Radboud University Nijmegen, 34000 Montpellier, France Postvak 61, P.O. Box 9010, 6500 GL Nijmegen, The Netherlands 7 Institute of Chemistry, University of Silesia, 3 Research Institute for Medicines (iMed.ULisboa), Faculdade de 40-006 Katowice, Poland Farmácia, Universidade de Lisboa, Av. Prof. Gama Pinto, 8 1649-003 Lisbon, Portugal IDAEA-CSIC, Jordi Girona 18-26, 08034 Barcelona, Spain 6692 Brereton R.G. et al. Fig. 1 Chemometrics as an interdisciplinary field Predictive chemometrics modeling (energy of interactions, topological indices, etc.). The responses to be predicted are called dependent variables and Many applications of chemometrics in analytical chemistry the variables used to perform this prediction are called inde- are associated with the so-called predictive modeling [2], pendent variables. where a mathematical model represents the relationship between a chemical and/or physical property of a sample with generally easier and cheaper to acquire instrumental signals. Multivariate calibration Typical examples are modeling of protein content in wheat samples based on their NIR signals, modeling of octane num- Assuming that the set of centered dependent variables is de- ber of fuel based on the NIR signals, modeling of antioxidant noted as Y (m × k) and a set of centered independent variables properties of tea extract based on their chromatographic fingerprints, modeling of biological activity of studied com- pounds based on their descriptive parameters, medical diag- nostics based on genomic, proteomic and/or metabolomics fingerprints, etc. Chemometric models may predict both, quantitative (con- tinuous-valued, e.g., protein content) and qualitative (discrete, such as healthy/ill or compliant/non-compliant, class membership) properties. Calibration, classification, and discrimination type of problems (see Fig. 2) are common in chemometrics. The same considerations hold in other predictive modeling situations. In the other words, the main goal of multivariate modeling is the prediction of parameter(s), the measurement of which would be time- and cost-intensive, based on fast and cheap Fig. 2 Schematic representation of calibration and discrimination measurements such as NIR, or on calculated parameters principles Chemometrics in analytical chemistry—part II: modeling, validation, and applications 6693 is denoted as X (m × n), the constructed calibration model can One-class classification (OCC) methods (also known as be represented as: class-modeling methods) are directed toward modeling of in- Y ¼ XB þ E ð1Þ dividual classes new [4] independently of others. They are focused on similarities among the samples from the same class where the matrix B (n × k) collects the k vectors of regression rather than the differences between classes. coefficients, and E (m × k) denotes the residuals. In chemometrics, the most popular OCC method is soft If only one parameter is to be predicted, the above equation independent modeling of class analogy (SIMCA) new ref. becomes: [5]. It forms a closed acceptance area that delineates the target y ¼ Xb þ e ð2Þ class in the multivariate space using disjoint PCA (performed just on the training set in the target class rather than all objects where y and e are vectors of dimensionality m × 1 and b is a studied). SIMCA calculates two types of distance, the D sta- vector of dimensionality n × 1. In the following, it is assumed tistic (within the disjoint PC space) and the Q statistic (from an that only one parameter is predicted for the sake of parsimony, object to the disjoint PC space), also known by various differ- but this approach may be extended to the prediction of multi- ent names according to author. There are various ways to ple dependent variables. interpret these distances according to author and software How the model is constructed depends mainly on data di- package new ref. [6]. mensionality and its correlation structure. If matrix X contains In contrast, two or multi-class classifiers (often known as fewer variables than samples (n < m) and these variables are supervised discriminant methods) are used to form boundaries uncorrelated, then the vector of regression coefficients may be between classes rather than around classes. The oldest and calculated as: most known is linear discriminant analysis (LDA) proposed ÀÁ by Fisher [7]. It looks for linear surfaces (hyperplanes) which ’ −1 ’ b ¼ X X X y ð3Þ best separate the samples from different categories in the mul- tidimensional space, the identification of which relies on the If matrix X contains more variables than samples (n > m), relative position of the barycenters of the groups (between- and/or if these predictor variables are highly correlated, then class scatter B) and on the within-class variance/covariance classical multivariate methods like multiple linear regression W. However, as it happens in regression, the presence of a (MLR) cannot be applied. This is because matrix (X’X)can- higher number of variables than samples (n >m)orcollinear- not be inverted due to rank deficiency. In this case, Eq. (3) ity among the predictors result in the inapplicability of the may be replaced by a more general equation: method, so that alternatives should be found to deal with the b ¼ Xþ y ð4Þ multi- or megavariate data, commonly encountered in modern chemical problems. A possible way of overcoming this limi- where X+ represents a Bgeneralized inverse^ of X. tation, analogously to what already discussed for calibration, X Several common multivariate methods in common use, is to project onto a suitable low-dimensional subspace of such as Ridge regression (RR), principal components regres- orthogonal variables, where discriminant analysis can be per- sion (PCR), and partial least squares regression (PLSR), differ formed. This can be done, for instance, by a preliminary PCA X in the estimation of this inverse X+. RR is more popular decomposition of the matrix (PCA-DA) or a PLS between X Y among statisticians than in chemometrics [3], probably due and suitably defined (PLS-DA) [7], new [8, 9]. the widespread use of PCR and PLS loading and score plots To perform discrimination among k classes via PLS-DA, for model visualization and interpretation in chemometrics. information about class membership of each

Load more