Université d’Aix-Marseille

École Doctorale 250 – Sciences Chimiques

Institut Méditerranéen de Biodiversité et d'Écologie marine et continentale

TRAÇABILITÉ ET AUTHENTIFICATION D’HUILES D’ MONOVARIÉTALES PAR DES OUTILS CHIMIOMÉTRIQUES

Thèse pour obtenir le grade universitaire de Docteur en Sciences

Spécialité Sciences Chimiques

Présentée par Astrid MALÉCHAUX

Soutenue le 10 Janvier 2020 devant le jury :

E. VIGNEAU Professeur, Oniris Rapporteur

C. CORDELLA Ingénieur de recherche, INRA Rapporteur

M. SERVILI Professeur, Université de Pérouse Président du jury

F. MARINI Professeur associé, Université de Rome Examinateur

N. DUPUY Professeur, Aix-Marseille Université Directrice de thèse

Y. LE DRÉAU Maitre de conférences, Aix-Marseille Université Co-encadrante de thèse

J. ARTAUD Professeur émérite, Aix-Marseille Université Invité

ILLUSTRATION ORIGINALE PAR FILIP RIZOV, 2019

« Are you interested in science by any chance? I'm interested in molecules. The Sufis say each one of us is a planet spinning in ecstasy. But I say each one of us is a set of shifting molecules. Spinning in ecstasy. »

THE LIMITS OF CONTROL (JIM JARMUSCH, 2009)

1

2

REMERCIEMENTS

Merci tout d’abord aux membres du jury, qui ont pris le temps d’évaluer ce travail et ont accepté d’assister à la soutenance de ma thèse : Évelyne Vigneau, professeur et directrice de l’unité Statistique-Sensométrie-Chimiométrie de l’Oniris ; Christophe Cordella, ingénieur de recherche de l’UMR Physiologie de la Nutrition et du Comportement Alimentaire de l’INRA ; Maurizio Servili, professeur du département de Sciences et Technologies Alimentaires de l’Université de Pérouse ; et Federico Marini, professeur associé du département de Chimie de l’Université de Rome « La Sapienza ».

Merci également à tous les membres de l’équipe Biotechnologie Environnementale et Chimiométrie de l’IMBE pour leur accueil chaleureux, et plus particulièrement :

Ma directrice de thèse Nathalie Dupuy, professeur et responsable de l’équipe BEC, et ma co-encadrante Yveline Le Dréau, maitre de conférences, qui m’ont accordé leur confiance, ont partagé leurs savoirs en matière de chimiométrie et de techniques d’analyses, m’ont guidée dans la rédaction d’articles, conseillée dans mon projet professionnel et ont su faire preuve de patience pendant ces trois années. Merci aussi de m’avoir donné l’opportunité de m’épanouir au-delà des activités quotidiennes, avec la participation aux diverses formations et conférences, l’expérience de l’enseignement, l’implication dans la vie du laboratoire et dans les associations (une pensée pour les amis de MODOCC et IMBEST).

Pierre Vanloot, maitre de conférences, pour ses conseils avisés en chimiométrie et traitement des données spectrales, et pour son accueil à l’IUT de Chimie (dont je remercie aussi toute l’équipe enseignante).

Jacques Artaud, professeur émérite et spécialiste des huiles d’olive, pour avoir partagé sa passion, ses connaissances, son expertise de l’analyse des acides gras, et pour ses relectures attentives.

Sandrine Amat, ingénieur d’études et spécialiste du remontage de moral, pour avoir partagé sa bonne humeur, ses pauses café, et « accessoirement » sa maitrise des instruments du laboratoire.

3

Je remercie aussi toutes les autres personnes qui ont contribué aux travaux présentés dans cette thèse, notamment : le personnel de Olive (ex-AFIDOL) pour leur accueil, le partage de leurs savoir-faire et la fourniture des échantillons d’huiles d’olive ; les anciens professeurs et chercheurs d’Aix-Marseille Université, Henri Dou et Jacky Kister, pour leur formation à la bibliométrie ; les stagiaires Cécile Grapeloup, Théo Brunet et Tracy Richard-Smith pour leur aide dans la réalisation des analyses spectroscopiques et chromatographiques.

Des remerciement plus personnels vont à mes amis, de longue date ou plus récents, grâce à qui ces trois ans auront été riches en expériences non seulement scientifiques mais aussi sportives, musicales, culinaires, … et cinématographiques ! Une mention spéciale à :

Mathilde, Solène et Filip, les premiers à m’avoir donné envie de venir, puis de rester, à Marseille (merci aussi à Filip pour les superbes dessins et la découverte de la Macédoine).

Louisanne, Romain et Etienne pour les traditionnels festivals à Lyon, Nîmes ou Avignon.

Mes collègues doctorants : Lucie (référente apéro), Nina (team café), Rayhanne (miss sourire), Quentin (« un joyeux pessimiste »), Inès (gastronome en chef), Barhoum (meilleur coach sportif), Enrique et Maya (le feuilleton de l’année, et merci pour les chansons !). A cette liste s’ajoutent évidemment les deux autres membres fondateurs de la Social Team :

Sébastien, mon acolyte cinéphile à l’imagination débordante, merci pour les séances de films d’auteur que personne d’autre ne voulait voir, l’initiation à l’escalade, le road-trip italien, les délicieux gâteaux et la folle aventure du projet Fungi (dont les acteurs se reconnaitront et seront doublement remerciés !).

Lise, ma « jumelle » de thèse, un immense merci pour avoir partagé tous les bons (et moins bons) moments, pour ta motivation sans faille (pas toujours suivie sur la piste de danse), les discussions scientifiques (et autres), les Pépés du vendredi, les soirées séries, les weekends de voyages printaniers, les histoires de chats… et tellement d’autres souvenirs !

Enfin je remercie ma famille et surtout mes parents, Nathalie et Olivier, qui m’ont transmis la curiosité et le goût d’apprendre, et m’ont encouragée et soutenue jusqu’à l’aboutissement de cette ultime étape de mon parcours étudiant. Merci aussi pour les magnifiques voyages, dont le dépaysement m’a permis d’échapper à la routine afin de revenir avec toujours plus de motivation !

4

SOMMAIRE

REMERCIEMENTS ...... 3 LISTE DES FIGURES ...... 7 LISTE DES TABLEAUX ...... 9 LISTE DES ABBREVIATIONS ...... 12 LISTE DES PRODUCTIONS SCIENTIFIQUES ...... 15 INTRODUCTION GÉNÉRALE ...... 19 CHAPITRE 1 : LE CONTEXTE SCIENTIFIQUE – ANALYSE BILBIOMETRIQUE ET SYNTHESE BIBLIOGRAPHIQUE . 25 Exploring the scientific interest for origin: a bibliometric study from 1991 to 2018 ...... 26 Abstract ...... 26 1. Introduction ...... 27 2. Methodology ...... 28 3. Results and discussion ...... 28 4. Conclusion ...... 41 References ...... 42 Applications of Vibrational Spectroscopy Techniques ...... 47 Abstract ...... 47 1. Introduction ...... 48 2. Bibliometrics ...... 48 3. Spectroscopy ...... 51 4. Chemometrics...... 53 5. Near infrared spectroscopy ...... 54 6. Mid infrared spectroscopy ...... 60 7. Raman spectroscopy ...... 69 8. Multiblock analysis - concatenation of spectral data ...... 74 9. Conclusion ...... 76 References ...... 77 CHAPITRE 2 : PROPOSITION D’UNE NOUVELLE APPROCHE APPLIQUANT LE PRINCIPE DES CARTES DE CONTRÔLE AUX MODÈLES CHIMIOMÉTRIQUES ...... 87 Discrimination of extra virgin olive oils from five French cultivars: en route to a control chart approach 88 Abstract ...... 88 1. Introduction ...... 89 2. Material and methods ...... 90

5

3. Results and discussion ...... 92 4. Conclusion ...... 104 References ...... 105 CHAPITRE 3 : APPLICATION DE DIFFÉRENTES STRATÉGIES DE FUSION DES DONNÉES ISSUES DE PLUSIEURS TECHNIQUES D’ANALYSE ...... 107 Multiblock chemometrics for the discrimination of three extra virgin olive oil varieties ...... 108 Abstract ...... 108 1. Introduction ...... 109 2. Materials and methods ...... 110 3. Results and discussion ...... 115 4. Conclusion ...... 124 References ...... 126 Supporting Information ...... 130 Comparison of near- and mid-infrared data fusion strategies: are two better than one? ...... 132 Abstract ...... 132 1. Introduction ...... 133 2. Material and methods ...... 134 3. Results and discussion ...... 136 4. Conclusion ...... 146 References ...... 148 Supporting Information ...... 150 CONCLUSION GÉNÉRALE ET PERSPECTIVES ...... 155 ANNEXES ...... 159

6

LISTE DES FIGURES

CHAPITRE 1 : LE CONTEXTE SCIENTIFIQUE – ANALYSE BILBIOMETRIQUE ET SYNTHESE BIBLIOGRAPHIQUE

Exploring the scientific interest for olive oil origin: a bibliometric study from 1991 to 2018

Figure 1. Evolution of the total number of articles and number of articles containing keywords related to specific subjects between 1991 and 2018 ...... 29 Figure 2. Annual volumes of olive oil production and consumption for the countries having published more than 10 articles between 1991 and 2018 ...... 35 Figure 3. Network of the main keywords associated at least ten times with one or more of the thematic clusters ...... 36 Figure 4. Network of the main authors, having published at least 6 articles and associated at least 3 times with one or more of the thematic clusters, with their country ...... 38

Applications of Vibrational Spectroscopy Techniques

Figure 1. Number of articles containing the words “olive oil”, “authentication”, “spectroscopy” or “chromatography” and their combinations ...... 49 Figure 2. Evolution of the number of publications found for the query “olive oil” and authentic* and (NIR or “near infrared”) or (MIR or “mid infrared”) or Raman ...... 50 Figure 3. Word cloud generated by the titles of the articles from Web of Science query ...... 51 Figure 4. Spectroscopic techniques related to the infrared region of the electromagnetic spectrum ...... 52 Figure 5. Near infrared spectrum of EVOO with identification of main absorbance bands ...... 54 Figure 6. Mid infrared spectrum of EVOO with identification of main absorbance bands ...... 60 Figure 7. Raman spectrum of EVOO with identification of main absorbance bands ...... 69

CHAPITRE 2 : PROPOSITION D’UNE NOUVELLE APPROCHE APPLIQUANT LE PRINCIPE DES CARTES DE CONTRÔLE AUX MODÈLES CHIMIOMÉTRIQUES

Discrimination of extra virgin olive oils from five French cultivars: en route to a control chart approach

Figure 1. Scores and loadings for the first two PCs of the PCA using fatty acids and squalene composition ...... 95 Figure 2. Predicted Y scores with decision rules for the two thresholds ...... 98

7

CHAPITRE 3 : APPLICATION DE DIFFÉRENTES STRATÉGIES DE FUSION DES DONNÉES ISSUES DE

PLUSIEURS TECHNIQUES D’ANALYSE

Multiblock chemometrics for the discrimination of three extra virgin olive oil varieties

Figure 1. Definition of the thresholds indicating the true, false or uncertain attribution of predicted samples to the modelled variety ...... 113 Figure 2. Example of a chromatogram from VOO with identification of the peaks ...... 116 Figure 3. Example of a MIR spectrum from VOO with identification of the bands ...... 118 Figure 4. Weights of the GC and MIR blocks for each latent variable of the MB-PLS models with the first version of the calibration and prediction sets, with weighted block scores ...... 123 SI 2. Weights of the GC and MIR blocks for each latent variable of the MB-PLS models with the second version of the calibration and prediction sets, with weighted block scores ...... 131

Comparison of near- and mid-infrared data fusion strategies: are two better than one?

Figure 1. A: PCA scores, B: loadings for PC1 and C: loadings for PC3, obtained with the concatenated NIR (6100-4500 cm-1) and MIR (1800-700 cm-1) data, with samples represented according to their cultivar on the score plot and most influential bands identified on the loading plots ...... 137 Figure 2. VIP scores of the PLS1-DA models using the concatenated NIR (6100-4500 cm-1) and MIR (1800-700 cm-1) data ...... 141 Figure 3. VIP scores of the MB-PLS1-DA models, calculated from A: the initial NIR (6100-4500 cm-1) and MIR (1800-700 cm-1) variables weights, or B: the NIR and MIR block weights ...... 142 Figure 4. VIP scores of the PCA-PLS1-DA models with 15 PCs from NIR and 15 PCs from MIR ...... 144 Figure 5. VIP scores of the PLS-PLS1-DA models with 15 LVs from NIR and 15 LVs from MIR ...... 145 SI 1. Compilation of a: NIR spectra and b: MIR spectra of the olive oil samples ...... 150

8

LISTE DES TABLEAUX

CHAPITRE 1 : LE CONTEXTE SCIENTIFIQUE – ANALYSE BILBIOMETRIQUE ET SYNTHESE BIBLIOGRAPHIQUE

Exploring the scientific interest for olive oil origin: a bibliometric study from 1991 to 2018

Table 1. Number of articles containing keywords related to specific analytical techniques and most cited article for each technique, with citation count up to 2018...... 30 Table 2. Journals having published at least 10 articles between 1991 and 2018, sorted by their respective number of articles, with their associated impact factor (IF), subject category(ies), ranking in each category notified by their quartile (Q), most cited article and citation count up to 2018 ...... 32 Table 3. Number of publications over time for the main authors (having a total of at least 10 articles), and most cited article for each author with its citation count up to 2018 ...... 33 Table 4. Evolution of the number of publications in the thematic clusters between 1991 and 2018 ...... 37

Applications of Vibrational Spectroscopy Techniques

Table 1. Examples of NIR spectroscopy applications to differentiate olive oils from other oils ...... 55 Table 2. Examples of NIR spectroscopy applications to analyse VOOs adulterated with other oils . 57 Table 3. Examples of NIR spectroscopy applications to determine the origin of VOOs ...... 58-59 Table 4. Examples of MIR spectroscopy applications to differentiate VOOs from other oils ...... 61-62 Table 5. Examples of MIR spectroscopy applications to analyse VOOs adulterated with other oils ...... 63-66 Table 6. Examples of MIR spectroscopy applications to determine the origin of VOOs ...... 67-68 Table 7. Examples of Raman spectroscopy applications to differentiate VOOs from other oils ...... 70 Table 8. Examples of Raman spectroscopy applications to analyse VOOs adulterated with other oils ...... 72-73 Table 9. Examples of Raman spectroscopy applications to determine the origin of VOOs ...... 73 Table 10. Examples of concatenated data applications to analyse VOOs adulterated with other oils 74 Table 11. Examples of concatenated data applications to determine the origin of VOOs ...... 75-76

9

CHAPITRE 2 : PROPOSITION D’UNE NOUVELLE APPROCHE APPLIQUANT LE PRINCIPE DES CARTES DE CONTRÔLE AUX MODÈLES CHIMIOMÉTRIQUES

Discrimination of extra virgin olive oils from five French cultivars: en route to a control chart

approach

Table 1. Mean, maximum and minimum fatty acid and squalene percentages in the five French olive oil cultivars ...... 93 Table 2. Statistical parameters, mean and standard deviation (SD) of the Y scores of the modelled cultivar for each PLS1-DA calibration model...... 99 Table 3. Confusion matrices and statistical parameters of the PLS1-DA models predicting the origin of each cultivar for the samples from all years but last without outliers ...... 101 Table 4. Confusion matrices and statistical parameters of the PLS1-DA models predicting the origin of each cultivar for the samples from the final year ...... 103

CHAPITRE 3 : APPLICATION DE DIFFÉRENTES STRATÉGIES DE FUSION DES DONNÉES ISSUES DE

PLUSIEURS TECHNIQUES D’ANALYSE

Multiblock chemometrics for the discrimination of three extra virgin olive oil varieties

Table 1. Mean, maximum and minimum proportions (%) of fatty acids and squalene for the three varieties ...... 117 Table 2. Statistical parameters and results (sensitivity, specificity and correct classification rates) of the PLS1-DA models using the first version of the calibration and prediction sets of either GC, MIR, weighted multiblock or non-weighted multiblock data to discriminate the three EVOO varieties ...... 119 SI 1. Statistical parameters and results (sensitivity, specificity and correct classification rates) of the PLS1-DA models using the second version of the calibration and prediction sets of either GC, MIR, weighted multiblock or non-weighted multiblock data to discriminate the three EVOO varieties ...... 130

Comparison of near- and mid-infrared data fusion strategies: are two better than one?

Table 1. Number of factors, quality parameters and correct classification rate for the PLS1-DA models on individual datasets ...... 139 Table 2. Number of factors, quality parameters and correct classification rate for the PLS1-DA models on fused datasets ...... 140 Table 3. Number of factors, quality parameters and correct classification rate for the hierarchical models with different number of variables in the first step ...... 143

10

SI 2. Detailed confusion matrices for the PLS1-DA models on individual datasets ...... 151 SI 3. Detailed confusion matrices for the PLS1-DA models on fused datasets ...... 152 SI 4. Detailed confusion matrices for the hierarchical PCA-PLS1-DA models with different number of variables in the first step...... 153 SI 5. Detailed confusion matrices for the hierarchical PLS-PLS1-DA models with different number of variables in the first step...... 154

11

LISTE DES ABBREVIATIONS

AFIDOL : Association Française Interprofessionnelle de l'Olive

AG : cultivar

ANN : Artificial Neural Networks

AOP : Appellation d’Origine Protégée

ATR : Attenuated Total Reflectance

CA : cultivar Cailletier

CG: Chromatographie Gazeuse

CM : cultivar Chemlali

CT : cultivar Chetoui

CVA : Canonical Variate Analysis

DNA : Desoxyribonucleic Acid

DTGS : Deuterated Triglycine Sulfate

EVOO : Extra Virgin Olive Oil

FT : Fourier Transform

GA : Genetic Algorithm

GC : Gas Chromatography

HCA : Hierarchical Cluster Analysis

HPLC : High Performance Liquid Chromatography

IF : Impact Factor

IOC : International Olive Council

LDA : Linear Discriminant Analysis

LOD : Limit of Detection

LV : Latent Variable

MB : Multiblock

MCS : Mean Calibration Score

MCTA : Mercury Cadmium Telluride

12

MIR : Mid-Infrared ou Moyen-Infrarouge

MLR : Multiple Linear Regression

MS : Mass Spectrometry

MSC : Multiplicative Scatter Correction

MUFA : Mono-Unsaturated Fatty Acids

NIR : Near-Infrared

NMR : Nuclear Magnetic Resonance

OL : cultivar Olivière

OSC : Orthogonal Signal Correction

OU : cultivar Oueslati

PC : Principal Component

PCA : Principal Component Analysis

PCR : Principal Component Regression

PDO : Protected Designation of Origin

PLS : Partial Least Square

PLSR : Partial Least Square Regression

PLS-DA : Partial Least Square Discriminant Analysis

PI : cultivar

PIR: Proche-Infrarouge

PUFA : Poly-Unsaturated Fatty Acids

Q : Quartile

Q2 : coefficient de détermination de prédiction

R2 : coefficient de détermination de calibration

RMN : Résonance Magnétique Nucléaire

RMSEC : Root Mean Square Error of Calibration

RMSECV : Root Mean Square Error of Cross-Validation

RMSEP : Root Mean Square Error of Prediction

SA : cultivar

13

SD : Standard Deviation

SEC : Standard Error of Calibration

SECV : Standard Error of Cross-Validation

SEP : Standard Error of Prediction

SFA : Saturated Fatty Acids

SG : Savitzky-Golay

SIMCA : Soft Independent Modelling of Class Analogy

SNV : Standard Normal Variate

SVM : Support Vector Machines

TA : cultivar

UV : Ultra-Violet

VIP : Variable Importance in Projection

VOO : Virgin Olive Oil

14

LISTE DES PRODUCTIONS SCIENTIFIQUES

Publications acceptées

Food Control, Volume 106, Décembre 2019 : Discrimination of extra virgin olive oils from five French cultivars: En route to a control chart approach. Astrid Maléchaux, Yveline Le Dréau, Pierre Vanloot, Jacques Artaud, Nathalie Dupuy. IF=4.248 (https://doi.org/10.1016/j.foodcont.2019.06.017)

Food Chemistry, Volume 309, Mars 2020 : Multiblock chemometrics for the discrimination of three extra virgin olive oil varieties. Astrid Maléchaux, Sonda Laroussi- Mezghani, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy. IF=5.399 (https://doi.org/10.1016/j.foodchem.2019.125588)

Publications soumises

Food Research International : Exploring the scientific interest for olive oil origin: a bibliometric study from 1991 to 2018. Astrid Maléchaux, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy.

Analytica Chimica Acta : Comparison of near- and mid-infrared data fusion strategies: are two better than one? Astrid Maléchaux, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy.

Chapitre d’ouvrage

Nova Science Publishers, 2019 : Applications of Vibrational Spectroscopy Techniques, in Authentication and Detection of the Adulteration of Olive Oil (éditeur : Michael Kontominas). Astrid Maléchaux, Nathalie Dupuy, Jacques Artaud.

15

Communications orales

Conférence internationale

Chemometrics in Analytical Chemistry (CAC-2018), Halifax (Canada), Juin 2018 : « Varietal origin discrimination of three Tunisian virgin olive oils by multiblock partial least squares - discriminant analysis ». Astrid Maléchaux, Rabia Korifi, Sonda Laroussi, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy.

Conférences nationales

Journée du Club d’Expertise Chimique de Méditerranée (CECM), Toulon (France), Juin 2019 : « Chimiométrie et carte de contrôle : une nouvelle approche pour la détermination de l'origine variétale des huiles d'olive ». Astrid Maléchaux, Yveline Le Dréau, Pierre Vanloot, Jacques Artaud, Nathalie Dupuy.

Rencontres Scientifiques de l’École Doctorale de Chimie (ED250), Marseille (France), Mai 2019 : « Varietal discrimination of olive oils: a control chart approach ». Astrid Maléchaux, Yveline Le Dréau, Nathalie Dupuy.

Rencontres Scientifiques de l’École Doctorale de Chimie (ED250), Marseille (France), Juin 2018 : « Varietal discrimination of extra virgin olive oils using vibrational spectroscopy analyses and chemometrics ». Astrid Maléchaux, Yveline Le Dréau, Nathalie Dupuy.

18èmes Rencontres HélioSPIR, Montpellier (France), Novembre 2017 : « Varietal discrimination of extra virgin olive oils – terroir effect ». Astrid Maléchaux, Yveline Le Dréau, Maria João Cabrita, Jacques Artaud, Nathalie Dupuy.

Journée des Doctorants de l’IMBE, Marseille (France), Juillet 2017 : « Identifying chemical markers of olive oils traceability and authenticity ». Astrid Maléchaux. Prix Christopher Augur 2017 de la meilleure présentation orale.

16

Posters

Conférences internationales

Olivebioteq 2018, Seville (Espagne), Octobre 2018 : « Olive oil authenticity: a bibliometric study ». Astrid Maléchaux, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy.

Chemometrics in Analytical Chemistry (CAC-2018), Halifax (Canada), Juin 2018 : « Variable selection and dimension reduction applied to the varietal discrimination of extra virgin olive oils using MIR, NIR and Raman spectroscopies ». Astrid Maléchaux, Yveline Le Dréau, Pierre Vanloot, Nathalie Dupuy.

Conférence nationale

Chimiométrie 2019, Montpellier (France), Janvier 2019 : « Chemometrics applied to the varietal discrimination of French extra virgin olive oils using fatty acid and squalene compositions ». Astrid Maléchaux, Yveline Le Dréau, Pierre Vanloot, Jacques Artaud, Nathalie Dupuy.

17

18

INTRODUCTION GÉNÉRALE

L’huile d’olive bénéficie d’une image ambivalente auprès du grand public. Concernant les aspects positifs, elle est valorisée pour ses qualités gustatives1 et ses vertus nutritionnelles comme un des ingrédients essentiels du régime méditerranéen. 2 Plusieurs études épidémiologiques et cliniques ont exploré les bénéfices pour la santé liés au régime méditerranéen dans son ensemble et à la consommation d’huile d’olive en particulier. Ainsi, l’effet protecteur des acides gras mono-insaturés, constituants majoritaires des huiles d’olive, contre les maladies cardio-vasculaires a été démontré. Les régimes riches en acides gras mono- insaturés seraient également bénéfiques pour les personnes souffrant d’obésité ou de diabète. De plus, de nombreux composés mineurs ont aussi des influences favorables sur la santé humaine telles que des activités anti-inflammatoires et anti-oxydantes, qui sont notamment associées aux composés phénoliques. Enfin, bien que les résultats des études menées à ce jour soient plus incertains, la consommation d’huile d’olive pourrait aussi jouer un rôle dans la réduction de l’incidence de certains cancers et dans le ralentissement du vieillissement cognitif.3 Cependant, cette bonne réputation est entachée par de fréquents cas de fraudes.4 À titre d’exemple, 30 incidents ont été répertoriés dans différents pays par le Centre Commun de Recherche de la Commission Européenne entre septembre 2016 et septembre 2019. Ceci place l’huile d’olive parmi les aliments les plus concernés par les fraudes, derrière le poisson, la viande, le vin et les produits laitiers, mais devant les épices, les fruits et légumes, le miel et les œufs.5 Ces fraudes alimentaires représentent une tromperie intentionnelle du consommateur dans un but de gain économique, pouvant prendre la forme soit d’une substitution, dilution ou addition d’ingrédients, soit d’une contrefaçon ou fausse déclaration de l’étiquetage. Ce problème se pose plus particulièrement pour les huiles présentant une forte valeur ajoutée, telles que les huiles d’olive vierge extra issues d’un seul cultivar (i.e. variété cultivée) ou

1 https://www.lemonde.fr/m-gastronomie/article/2017/01/16/l-olive-fait-tache-d-huile_5063374_4497540.html 2 https://www.pourlascience.fr/sd/chimie/huile-dolive-et-sante-4458.php 3 López-Miranda, J., Pérez-Jiménez, F., Ros, E., et al. Olive oil and health: Summary of the II international conference on olive oil and health consensus report, Jaén and Córdoba (Spain) 2008. Nutrition, Metabolism & Cardiovascular Diseases, 2010, 20, 284-294. 4 https://www.quechoisir.org/actualite-huiles-d-olive-a-nouveau-pointees-du-doigt-n59641/ 5 https://ec.europa.eu/knowledge4policy/publication/food-fraud-summary-month-reports

19

bénéficiant d’une appellation d’origine protégée (AOP). En effet, de nombreux facteurs liés à l’origine géographique, à l’origine variétale, aux conditions de culture des oliviers et aux méthodes de production de l’huile peuvent influencer la composition chimique et donc modifier les propriétés sensorielles et nutritionnelles du produit final.6 Ainsi, même si les huiles frauduleuses ne présentent généralement pas de risque sanitaire particulier, elles n’offrent pas les qualités attendues en termes de goût et de bénéfice pour la santé.7 Ces fraudes entrainent donc des pertes financières pour les producteurs honnêtes qui subissent une concurrence déloyale de la part des fraudeurs et une perte de confiance des consommateurs, comme le confirme un récent rapport de la Commission Européenne.8 Pour faire face à ce problème, il est donc important de développer et d’optimiser des méthodes permettant de vérifier que la qualité et l’origine d’une huile correspondent bien à celles indiquées sur son étiquetage. À l’heure actuelle les contrôles officiels mesurent principalement des critères de qualité impliquant la détermination de l’acidité oléique et de l’indice de peroxyde, des mesures d’absorbance dans l’ultraviolet et une évaluation organoleptique, ainsi que des critères de pureté regroupant l’analyse des teneurs en acides gras, stérols, stigmastadiènes, cires, érythrodiol et uvaol par chromatographie gazeuse (CG) et la détermination de la composition en triglycérides par chromatographie liquide. 9 Ces analyses quantifient spécifiquement certaines molécules caractéristiques qui permettent de confirmer la nature de l’huile (olive ou autre plante oléagineuse) et la catégorie d’huile d’olive (vierge extra, vierge, lampante ou raffinée). Or, pour assurer la traçabilité des huiles d’olive, il est aussi nécessaire de confirmer leurs origines géographique et variétale. De nombreuses études scientifiques s’intéressent à ce sujet afin de développer et d’optimiser différentes techniques d’analyses, avec l’objectif à long terme de faire évoluer les analyses règlementaires pour y intégrer le contrôle d’origine. 10 Certaines études utilisent les analyses de composés spécifiques par chromatographie gazeuse ou liquide, qui nécessitent une préparation d’échantillon pouvant être longue et

6 https://huiles-et-olives.fr/degustation/pourquoi-differents-gouts/ 7 Esteki, M., Regueiro, J. and Simal-Gándara, J. Tackling Fraudsters with Global Strategies to Expose Fraud in the Food Chain. Comprehensive Reviews in Food Science and Food Safety, 2019, 18, 425-440. 8 https://ec.europa.eu/food/sites/food/files/safety/docs/food-fraud_network_activity_report_2018.pdf 9 Règlement (CEE) n°2568/91 de la Commission du 11 Juillet 1991 relatif aux caractéristiques des huiles d’olive et des huiles de grignons d’olive ainsi qu’aux méthodes d’analyse y afférentes, actualisé le 16.10.2015. 10 Bajoub, A., Bendini, A., Fernández-Gutiérrez A. et al. Olive oil authentication: A comparative analysis of regulatory frameworks with especial emphasis on quality and authenticity indices, and recent analytical techniques developed for their assessment. A review. Critical Reviews in Food Science and Nutrition, 2018, 58, 832-857.

20

consommatrice de solvants. D’autres emploient des analyses plus rapides avec des capteurs électroniques ou des techniques de spectroscopie proche-infrarouge (PIR), moyen-infrarouge (MIR), Raman ou résonance magnétique nucléaire (RMN), qui enregistrent une « empreinte digitale » globale de chaque échantillon.11 Ces différentes techniques d’analyses chimiques génèrent d’importants volumes de données complexes, et le développement de modèles statistiques multivariés est donc primordial afin d’en faire émerger les informations pertinentes. Cette application des statistiques aux données chimiques est au cœur des méthodes de chimiométrie, avec divers algorithmes pouvant être employés selon la nature des données et les objectifs visés, tels que les modèles de régression pour la prédiction de paramètres quantitatifs ou les modèles de classification et de discrimination pour la prédiction de paramètres qualitatifs. 12 Concernant la reconnaissance de l’origine des aliments, l’un des modèles les plus utilisés est l’analyse discriminante des moindres carrés partiels (PLS-DA), employant un codage arbitraire pour indiquer si les échantillons possèdent ou non les caractéristiques de la classe modélisée.13 Cependant, bien que les performances de ce modèle aient été démontrées depuis plusieurs dizaines d’années dans le cadre de recherches scientifiques, sa mise en œuvre et l’interprétation de ses résultats demeurent complexes et suscitent des propositions d’améliorations afin de favoriser son application à des cas concrets. Des interrogations portent notamment sur la capacité de la PLS-DA à traiter le cas de classes comprenant des nombres d’échantillons déséquilibrés, et sur le choix de la règle de décision la plus adaptée pour la prédiction de la classe d’appartenance de nouveaux échantillons.14 De plus, un même échantillon peut être analysé par des techniques différentes dans le but d’améliorer sa reconnaissance. La chimiométrie offre alors la possibilité d’associer les données complémentaires obtenues, pour mieux prendre en compte la complexité de composition des échantillons par rapport à chaque technique utilisée individuellement. La plupart des études actuelles appliquent une fusion des données par simple concaténation des matrices

11 Valli, E., Bendini, A., Berardinelli, A. et al. Rapid and innovative instrumental approaches for quality and authenticity of olive oils. European Journal of Lipid Science and Technology, 2016, 118, 1601-1619. 12 Brereton, R.G., Jansen, J., Lopes, J. et al. Chemometrics in analytical chemistry – part II : modeling, validation, and applications. Analytical and Bioanalytical Chemistry, 2018, 410, 6691-6704. 13 Granato, D., Putnik, P., Kovačević, D.B. et al. Trends in Chemometrics: Food Authentication, Microbiology, and Effects of Processing. Comprehensive Reviews in Food Science and Food Safety, 2018, 17, 663-677. 14 Lee, L.C., Liong, C.-Y. and Jemain, A.A. Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategies and knowledge gaps. Analyst, 2018, 143, 3526-3539.

21

individuelles, mais d’autres stratégies mettant en œuvre une fusion après une réduction de dimension ou une fusion au niveau de la prédiction finale pourraient être plus efficaces.15

L’objectif de cette thèse est donc de développer des outils chimiométriques répondant aux besoins identifiés pour assurer la traçabilité et l’authenticité des huiles d’olive, c’est-à-dire : - des modèles capables de gérer des situations proches de cas concrets avec des nombres d’échantillons importants mais déséquilibrés entre les classes, - une règle d’interprétation des résultats intégrant un aspect métrologique pour favoriser son application en contrôle qualité industriel, - des modèles permettant la fusion des données pour bénéficier de la complémentarité de plusieurs techniques d’analyses. Les recherches réalisées se concentrent sur l’application des modèles PLS-DA aux spectres obtenus par PIR et MIR ainsi qu’aux proportions d’acides gras obtenues par CG, dans le but de prédire de l’origine variétale des huiles d’olive. Chaque chapitre est constitué d’un ou plusieurs articles, déjà parus ou en cours de révision. Le premier chapitre présente le contexte de cette étude, avec d’une part une analyse bibliométrique détaillant les principaux journaux, auteurs et pays impliqués et l’orientation stratégique de la recherche dans le domaine de l’origine des huiles d’olive, et d’autre part une synthèse bibliographique centrée sur les applications des techniques de spectroscopie vibrationnelle qui sont très souvent associées aux modélisations chimiométriques. Le deuxième chapitre s’intéresse à la méthode d’analyse des acides gras par CG, et propose une nouvelle approche combinant le modèle chimiométrique PLS-DA avec un système de carte de contrôle. L’ajout de cet outil métrologique, fréquemment utilisé lors d’analyses de routine dans l’industrie pour détecter les dérives par rapport à une valeur de référence16, devrait faciliter l’interprétation des résultats et optimiser la prédiction d’origine variétale. Le troisième chapitre développe différents algorithmes basés sur le modèle PLS-DA afin d’explorer les améliorations apportées par l’association de données complémentaires, issues dans un premier temps de l’analyse des acides gras par CG et de l’analyse globale par

15 Borràs, E., Ferré, J., Boqué, R. et al. Data fusion methodologies for food and beverage authentication and quality assessment - A review. Analytica Chimica Acta, 2015, 891, 1-14. 16 Vasconcellos, J.A., Tamborero-Arnal, J.F., Araiz-Jam, A. Statistical methods of quality control in the food industry, in Quality assurance for the food industry: a practical approach, CRC Press (US), 2003, 141-174.

22

spectroscopie MIR, et dans un deuxième temps des deux techniques de spectroscopie MIR et PIR. Les performances de plusieurs stratégies de fusion sont comparées entre elles et avec les modèles utilisant les données issues de chaque technique d’analyse séparément pour mettre en évidence la méthode la plus efficace. Pour finir, les principales conclusions tirées de ces travaux sont mises en évidence et des perspectives d’études futures sont proposées.

23

24

CHAPITRE 1 : LE CONTEXTE SCIENTIFIQUE – ANALYSE BILBIOMETRIQUE ET SYNTHESE BIBLIOGRAPHIQUE

L’étude du contexte de recherche est une étape primordiale de tout nouveau projet afin de positionner le sujet en fonction des enjeux scientifiques et socio-économiques, d’identifier les travaux déjà réalisés et les pistes restant à explorer. Dans ce but, la première partie de ce chapitre présente une étude bibliométrique donnant vision globale du contexte scientifique dans le domaine de la détermination de l’origine des huiles d’olive. Contrairement à la synthèse bibliographique, la bibliométrie ne s’intéresse pas au contenu détaillé des articles mais applique des méthodes statistiques pour analyser des indicateurs d’influence (nombre de citations, facteurs d’impact), de provenance (journaux, auteurs, pays d’origine), ou de thématique (mots-clés).17 À travers cette étude, il est possible d’observer l’évolution au cours du temps de l’intérêt porté à l’étude de l’origine variétale ou géographique des huiles d’olive, d’identifier les tendances actuelles de la recherche sur ce sujet et les futurs axes de recherche envisageables. Cet article a été soumis au journal Food Research International. Dans la deuxième partie du chapitre, une synthèse bibliographique recense les diverses applications des techniques de spectroscopie vibrationnelle, c’est-à-dire proche-infrarouge, moyen-infrarouge et Raman, pour détecter les fraudes et assurer la traçabilité des huiles d’olive. L’attention particulière portée à ces trois techniques et due à leur forte association avec les méthodes de chimiométrie pour exploiter les données spectrales obtenues. 18 Cette synthèse fait l’objet du chapitre 7 du livre « Authentication and Detection of the Adulteration of Olive Oil » publié par Nova Science.

17 Hood, W.W. and Wilson, C.S. The literature of bibliometrics, scientometrics and informetrics. Scientometrics, 2001, 52, 291-314. 18 Wang, P., Sun, J., Zhang, T. et al. Vibrational spectroscopic approaches for the quality evaluation and authentication of virgin olive oil. Applied Spectroscopy Reviews, 2016, 51, 763-790.

25

Exploring the scientific interest for olive oil origin: a bibliometric study from 1991 to 2018

Astrid Maléchaux, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy Aix Marseille Univ, Avignon Université, CNRS, IRD, IMBE, Marseille, France

Abstract

The authenticity and traceability of olive oils have been a growing concern over the past decades, generating numerous scientific studies. This article applies the tools of bibliometric analysis to explore the evolution and strategic orientation of the research focused on olive oil geographical and varietal origins. A corpus of 732 papers published in 178 different journals between 1991 and 2018 is considered. The most productive journals, authors and countries are highlighted, as well as the most cited articles associated with specific analytical techniques. A cluster analysis on the keywords generates 8 main themes of research, each focused on different analytical techniques or compounds of interest. A network between these thematic clusters and the main authors indicates their area of expertise. The metabolomics methods are drawing increasing interest and studies focused on the relationships between the origin and the sensory or nutritional properties provided by minor compounds of olive oils appear to be future lines of research.

Keywords

olive oil, bibliometrics, clustering, keywords analysis, citation analysis, authors network

26

1. Introduction

Olive oils and their composition in relation to their geographic or varietal origins have been extensively studied in recent years, as part of the food authenticity topic. Indeed, consumers have been paying more and more attention to the quality of the food products they buy. Marketing and sociology studies show that consumers rely on their sensory perceptions and on external information, like nutritional values or certifications present on the label, to assess food quality. Furthermore, the concepts of “quality”, “safety”, “traceability” and “authenticity” are not clearly defined by consumers and are often related with each other [1]. Authenticity is also associated with more “natural” and “healthy” products, for which consumers are willing to pay a higher price. Many aspects such as the production method, geographical origin, or variety of the ingredients, play a role in the perception of food authenticity and can also be viewed as part of a cultural heritage, inciting to buy local or traditional food products [2]. Therefore, insuring the consistency between the label and the actual content of a food product is crucial in maintaining the consumers trust. In the case of olive oil, various information can be present on the label, including a quality grade (virgin or extra virgin) or a protected designation of origin (PDO). Authenticity and traceability are essential since the attribution of a PDO is based on specific rules regarding the geographical and varietal origins for each designation [3]. Thus, there is a wealth of scientific studies focusing on the determination of olive oil origin and some challenges in this area have been introduced in a previous literature review [4]. However, the evolution of this research topic should be further analysed by means of bibliometric methods [5]. Indeed, as explained in previous bibliometric studies, the exploration of the structure of a research field can provide strategic insight on a subject [6,7]. The literature associated with the fields of olive oil and determination of origin is extremely important. Rapid screening methods using vibrational spectroscopy and/or specific methods based on chromatography can be used to detect adulterations, and various compounds including triglycerides, fatty acids, phenolic compounds, vitamins, etc., can be analysed. Different methodologies have been applied to perform these researches, and they have evolved over time. It is thus essential to perform a bibliometric analysis to understand the publishing and citation trends of this research field. The articles which associate olive oils with their varietal or geographical origins have been identified, and show that the research pathways chosen to treat the problem vary over time. Moreover, the most cited articles as well as the authors who have contributed to a large

27

number of papers have been highlighted. Finally, the main keywords have been used to conduct a cluster analysis allowing to represent the network of relationships between the authors and their main themes of research. The aim of this study is to strategically position future research through the identification of declining and emerging thematic clusters.

2. Methodology

The references were obtained using the Web of Science database, on June 12th 2019. The following terms were researched in the “Topic” field, which gathers the “Title”, “Abstract” and “Keywords” fields: (“olive oil” OR “olive oils”) AND origin AND (cultivar OR variet* OR geograph*) The timespan was limited to studies published between 1991 and 2018. This query yielded 732 references, classified by Web of Science® into 653 scientific articles, 51 reviews, 25 proceedings papers and 3 notes or corrections, all of which will be referred to as “articles” for this study. The full citation records were exported in order to be treated by the Matheo Analyzer® software [8]. This bibliometric study can provide strategic orientation for future research, although it should be considered with caution due the limits of the database used to retrieve the studied articles. The results from different databases could not be merged because of the differences in the methods used to collect citations, which were incompatible with the treatment by Matheo Analyzer®.

3. Results and discussion

3.1. Evolution of keywords

Figure 1 shows the number of articles studying the varietal or geographical origins of olive oils. It has been steadily increasing since the first publication in 1991 by Alberghina et al [9]. This progress has been especially important since 2007, leading to the publication of more than 60 articles each year between 2015 and 2018. Figure 1 also transcribes the evolution of the number of articles during the studied period according to the types of keywords used. Articles have been grouped together under generic terms. The number of articles grouped under the

28

term "analytical techniques", which encompasses articles using either spectral methods or chromatographic methods, has grown much more than the number of articles mentioning other keywords. Otherwise, a significant number of articles also mention the use of chemometrics modelling or target particular compounds of olive oils that could be markers of their origin. Some articles employ general keywords indicating their interest in olive oil authenticity, traceability or quality, while other publications specify whether they focus on geographical or varietal origins. In this case, a stronger increase of the number of articles dealing with geographical origin compared to varietal origin can be observed since 2007.

FIGURE 1. EVOLUTION OF THE TOTAL NUMBER OF ARTICLES (BARS) AND NUMBER OF ARTICLES CONTAINING KEYWORDS

RELATED TO SPECIFIC SUBJECTS (■: ANALYTICAL TECHNIQUES, ♦: CHEMOMETRICS, ▲: CHEMICAL COMPOUNDS, ●:

AUTHENTICITY AND QUALITY, ×: GEOGRAPHICAL ORIGIN, +: VARIETAL ORIGIN) BETWEEN 1991 AND 2018

Table 1 focuses on the number of articles containing keywords related to specific analytical techniques. These categories were obtained after a time-consuming manual grouping of many keywords, most of which only appeared in one article, showing that spelling disparities and the choice of very specific keywords can be an issue for bibliometric studies. Moreover, 67 articles did not have any keywords and were thus not taken into account. Nuclear magnetic resonance spectroscopy is the most popular analysis, mentioned in 75 articles. There appears to be a strong interest in sensory analysis with 57 articles, but olfactometric measurements are almost as often conducted with electronic sensors (e-nose and e-tongue, found in 41 articles). DNA

29

analysis is also at the centre of many studies, with 49 articles containing keywords related to this subject. Gas and liquid chromatographies are sensibly as popular, with 34 and 31 articles respectively, and often associated with detection by mass spectrometry. Another important application of mass spectrometry is for isotope ratio analysis, with 40 articles. Regarding vibrational spectroscopic techniques, mid- and near-infrared are often used, with 40 and 28 articles respectively. Finally, there is a marginal interest for UV-visible, Raman and fluorescence spectroscopies, each appearing in less than 20 articles. The most cited articles [10-21] for each technique can be retrieved (Table 1). Their publication dates range from 1997 for NMR [10] to 2011 for fluorescence spectroscopy [21], and the number of citations by the end of 2018 are between 80 for UV-Visible spectroscopy [19] and 301 for electronic sensors [13].

TABLE 1. NUMBER OF ARTICLES CONTAINING KEYWORDS RELATED TO SPECIFIC ANALYTICAL TECHNIQUES AND MOST CITED

ARTICLE FOR EACH TECHNIQUE, WITH CITATION COUNT UP TO 2018

Analytical technique Articles Most cited article Citation count Nuclear magnetic resonance 75 Sacchi et al. (1997) [10] 142 Sensory analysis 57 Kiritsakis (1998) [11] 189 DNA analysis 49 Busconi et al. (2003) [12] 83 Electronic sensors 41 Peris & Escuder-Gilabert (2009) [13] 301 Isotope ratio 40 Gonzalvez et al. (2009) [14] 96 Mid-infrared spectroscopy 40 Lerma-Garcia et al. (2010) [15] 181 Gas chromatography 34 Luna et al. (2006) [16] 141 Liquid chromatography 31 Vinha et al. (2005) [17] 201 Near-infrared spectroscopy 28 Galtier et al. (2007) [18] 129 UV-Visible spectroscopy 16 Casale et al. (2010) [19] 80 Raman spectroscopy 9 Baeten et al. (2005) [20] 117 Fluorescence spectroscopy 7 Karoui & Blecker (2011) [21] 84

3.2. Core journals

The 732 articles of the corpus were published in 178 different journals between 1991 and 2018, with 57% of the journals having only one article dealing with olive oil varietal or geographical origins. The 15 journals having published at least 10 articles on the subject are presented in Table 2. The impact factors (IF) and subject category quartile ranks (Q, indicating the rank of the considered journal compared to the other journals in its subject category) reported therein,

30

were retrieved from InCites Journal Citation Reports (Clarivariate Analytics). Amongst these main journals, Food Chemistry and the Journal of Agricultural and Food Chemistry lead the rankings, with 94 and 66 articles respectively. The journal with the highest IF as of 2018 is Critical Reviews in Food Science and Nutrition, with an IF over 6, followed by Food Chemistry and Analytica Chimica Acta, with IF over 5. Grasas y Aceites and Rivista Italiana delle Sostanze Grasse are the only journals having published more than 10 articles with an IF under 1, indicating the strong scientific interest for olive oil authenticity their respective countries of origin (Spain and ). The subject categories assigned in the Journal Citation Reports indicate the general research area of each journal. Some journals focus only on one area, while others are multidisciplinary since they are included in several subject categories. The most common category appears to be ”food science”, with 12 of the 15 journals concerned. Other subjects of interest include “applied chemistry”, “analytical chemistry”, “nutrition”, “agriculture” and “biochemical research methods”. The most cited articles from each journal [11, 13, 22-34] are also presented in Table 2. The citation count by the end of 2018 indicates that journals with the highest IF do not necessarily have highest number of citations. Indeed, the most cited article was kinetic study of the radical scavenger capacity of different oils published in the Journal of Agricultural and Food Chemistry [23] with and IF of 3.571 and 405 citations by the end of 2018, followed by a review on electronic sensors in Analytica Chimica Acta [13] with an IF of 5.256 and 301 citations by the end of 2018. Among the most cited articles, the oldest one was published in 1993 in the Journal of the Science of Food and Agriculture and deals with the classification of geographical origin based on fatty acid profiles obtained by gas chromatography [27]. The two most recent articles were published in 2014 in the Journal of Oleo Science, studying the influence of olive ripening on the fatty alcohol composition analysed by gas chromatography [34], and in Food Analytical Methods, using liquid chromatography and mass spectrometry to analyse polyphenols [32]. This last article is also the one with the fewest citations (only 13 by the end of 2018 for an IF of 2.413).

31

TABLE 2. JOURNALS HAVING PUBLISHED AT LEAST 10 ARTICLES BETWEEN 1991 AND 2018, SORTED BY THEIR RESPECTIVE NUMBER OF ARTICLES, WITH THEIR ASSOCIATED IMPACT FACTOR

(IF), SUBJECT CATEGORY(IES), RANKING IN EACH CATEGORY NOTIFIED BY THEIR QUARTILE (Q), MOST CITED ARTICLE AND CITATION COUNT UP TO 2018 Citation Journal Articles IF 2018 Most cited article Category: Q 2018 count Food Chemistry 94 5.399 Luykx & Van Ruth (2008) [22] 269 Chemistry (Applied): Q1, Food Science: Q1, Nutrition: Q1 Journal of Agricultural and Food Chemistry 66 3.571 Espin et al. (2000) [23] 405 Agriculture: Q1, Chemistry (Applied): Q1, Food Science: Q1 Journal of the American Oil Chemists Society 43 1.720 Kiritsakis (1998) [11] 183 Chemistry (Applied): Q2, Food Science: Q2 European Journal of Lipid Science and Technology 38 1.852 Gurdeniz et al. (2007) [24] 40 Food Science: Q2, Nutrition: Q3 Food Research International 25 3.579 Marcone et al. (2013) [25] 95 Food Science: Q1 European Food Research and Technology 22 2.056 Consolandi et al. (2008) [26] 70 Food Science: Q2 Journal of the Science of Food and Agriculture 21 2.422 Tsimidou & Karakostas (1993) [27] 51 Agriculture: Q1, Chemistry (Applied): Q2, Food Science: Q2 Analytica Chimica Acta 20 5.256 Peris & Escuder-Gilabert (2009) [13] 301 Chemistry (Analytical): Q1 Grasas y Aceites 18 0.891 Tous et al. (1997) [28] 46 Chemistry (Applied): Q4, Food Science: Q4 Talanta 15 4.916 Mannina et al. (2010) [29] 61 Chemistry (Analytical): Q1 Rivista Italiana delle Sostanze Grasse 14 0.694 Cerretani et al. (2008) [30] 15 Food Science: Q4 Journal of Chromatography A 10 3.858 Parcerisa et al. (2000) [31] 108 Biochemical Research Methods: Q1, Chemistry (Analytical): Q1 Food Analytical Methods 10 2.413 Gilbert-Lopez et al. (2014) [32] 13 Food Science: Q2 Critical Reviews in Food Science and Nutrition 10 6.704 Tzouros & Arvanitoyannis (2001) [33] 113 Food Science: Q1, Nutrition: Q1 Journal of Oleo Science 10 1.208 Giuffre (2014) [34] 23 Chemistry (Applied): Q3, Food Science: Q3 Q1: first quartile (rank in top 25%), Q2: second quartile (rank between 25% and 50%), Q3: third quartile (rank between 50% and 75%), Q4: fourth quartile (rank over 75%)

32

TABLE 3. NUMBER OF PUBLICATIONS OVER TIME FOR THE MAIN AUTHORS (HAVING A TOTAL OF AT LEAST 10 ARTICLES), AND MOST CITED ARTICLE FOR EACH AUTHOR WITH ITS CITATION

COUNT UP TO 2018 1991- 1995- 1999- 2003- 2007- 2011- 2015- Citation Total Most cited article 1994 1998 2002 2006 2010 2014 2018 count Fernandez-Gutierrez, A - - - - 2 4 13 19 Ouni et al. (2011) [22] 63 Mannina, L - 1 3 3 5 1 5 18 Mannina et al. (2001) [23] 127 Perri, E - - 3 - 5 6 3 17 Benincasa et al. (2007) [24] 119 Pereira, JA - - 1 2 3 2 8 16 Vinha et al. (2005) [17] 203 Bendini, A - - - - 7 4 4 15 Sinelli et al. (2010) [25] 96 Carrasco-Pancorbo, A - - - - 1 1 13 15 Bajoub et al. (2015) [26] 28 Zarrouk, M - - - 1 5 5 4 15 Ouni et al. (2011) [22] 63 Artaud, J - - - 3 5 2 4 14 Galtier et al. (2007) [18] 134 Bajoub, A - - - - - 1 13 14 Bajoub et al. (2015) [26] 28 Cerretani, L - - - - 8 5 - 13 Sinelli et al. (2010) [25] 96 Camin, F - - - - 3 1 7 11 Camin et al. (2010) [27] 85 Casale, M - - - 1 7 1 2 11 Casale et al. (2010) [19] 86 Fanizzi, FP - - - - - 4 7 11 Del Coco et al. (2012) [28] 24 Guillou, C - - 4 1 4 1 1 11 Rezzi et al. (2005) [29] 120 Longobardi, F - - - - - 6 5 11 Longobardi et al. (2012) [30] 72 Aparicio, R - - - 5 2 1 2 10 Luna et al. (2006) [16] 151 Del Coco, L - - - - - 4 6 10 Del Coco et al. (2012) [28] 24 Downey, G - - - 1 7 2 - 10 Karoui et al. (2010) [31] 185 Dupuy, N - - - 1 5 2 2 10 Galtier et al. (2007) [18] 134 Forina, M 1 - - 1 6 1 1 10 Casale et al. (2010) [19] 86 Marini, F - - 1 2 3 3 1 10 Marini (2009) [32] 100 Reniero, F 1 - 5 1 2 - 1 10 Rezzi et al. (2005) [29] 120

33

3.3. Main authors

A total of 2168 authors have contributed to at least one of the 732 studied articles. However, the vast majority (73%) do not show a strong interest in the subject of olive oil origin since they only appear in a single article between 1991 and 2018. Only 22 authors have published 10 articles or more in this period. Table 3 shows the evolution of the number of publications for these most productive authors. Some of them have taken an early interest in the subject: Reniero published an article in 1993, followed by Forina in 1994. On the contrary, Bajoub demonstrates the most recent but quite strong interest with a contribution to 14 articles since 2014. Most of these main authors appear to still be active with at least one publication since 2015, except for Cerretani and Downey whose activity is restricted to the period between 2003 and 2014. Looking at the most cited articles from each author [16-19, 35-45] indicates that some of them have worked in collaboration with each other. This is the case for Artaud and Dupuy using near-infrared spectroscopy and chemometrics [18], Casale and Forina applying chemometrics to combine data from e-nose, UV-visible and near-infrared spectroscopy [19], Fernandez-Gutierrez and Zarrouk [35] as well as Bajoub and Carrasco-Pancorbo [39] analysing phenolic compounds by liquid chromatography and mass spectrometry, Bendini and Cerretani predicting sensory attributes with chemometric models using near- and mid-infrared spectroscopy [38], Del Coco and Fanizzi [41], or Guillou and Reniero [42], applying chemometrics to nuclear magnetic resonance data.

3.4. Main countries

The number of articles published by each country can be compared to the volumes of olive oil production and consumption obtained from the International Olive Oil Council report [46]. Figure 2 presents the annual volumes of olive oil production and consumption for the countries having published more than 10 articles between 1991 and 2018. It indicates that Italy and Spain are by far the most productive countries, both in terms of articles and volume of olive oil produced. However, there is a stronger scientific interest for olive oil origin in Italy, which may be related the higher number of PDO: 42 olive oils with a PDO are produced in Italy versus 29 in Spain [47]. The interest of Tunisia, Greece, Portugal, Turkey and Morocco in studying olive oil origin seems consistent with their position as olive oil producing countries, even though

34

Greece has fewer articles and Portugal more articles than expected from their respective production volumes and numbers of PDO (19 in Greece and 6 in Portugal). The number of articles from France, the USA, and to a lesser extent Germany and the UK, can be explained by their relatively high consumption of olive oil, and for France by existence of 7 PDO olive oils despite a low production. However, other countries such as Belgium, Argentina, Ireland, the Netherlands, Croatia and more importantly China have a higher number of publications than would be expected from their volumes of olive oil production or consumption.

FIGURE 2. ANNUAL VOLUMES OF OLIVE OIL PRODUCTION (GREEN BARS) AND CONSUMPTION (YELLOW BARS) FOR THE

COUNTRIES HAVING PUBLISHED MORE THAN 10 ARTICLES (BLUE DIAMONDS) BETWEEN 1991 AND 2018

3.5. Data clustering

In order to reveal the existence of some thematic groups which structure this research, keywords that were present in at least ten articles, excluding the words used in the search query (i.e. “olive oil”, “geographical origin” and “cultivar”), were subjected to a K-means analysis [48] in Matheo Analyzer® resulting in their partition into eight clusters.

35

FIGURE 3. NETWORK OF THE MAIN KEYWORDS ASSOCIATED AT LEAST TEN TIMES WITH ONE OR MORE OF THE THEMATIC CLUSTERS (□: TOTAL NUMBER OF ARTICLES IN THE CORPUS ASSOCIATED

WITH THIS KEYWORD OR CLUSTER, ○: NUMBER OF ARTICLES ASSOCIATED WITH THIS KEYWORD IN THE SPECIFIED CLUSTER)

36

Figure 3 shows the resulting network of keywords and their distribution into the clusters. To keep it easily readable, only the keywords associated at least ten times with a cluster are represented. Table 4 presents the evolution of the number of publications in the 8 thematic clusters between 1991 and 2018 and Figure 4 shows the network of the main authors (with the indication of their nationality), having published at least 6 articles and associated at least 3 times with one or more of the thematic clusters. Some keywords are present in several clusters. For instance, “chemometrics” is present in all but two clusters, although it is mainly related to cluster 8. Similarly, “fatty acids” is mostly found in cluster 6 and to a lesser extent in clusters 1, 7 and 8, while “fats and oils” is divided between clusters 6, 7 and 8. However, since most of the keywords can be mainly attributed to one specific cluster, the theme of each cluster can be identified. The network of relationships between thematic clusters and the main authors (Figure 4) indicates the orientation of their research. Most authors are associated with a single topic, although some of them appear to create bridges between two or three thematic clusters. As could be expected, authors from the two most productive countries, Italy and Spain, are present in a wide range of themes.

TABLE 4. EVOLUTION OF THE NUMBER OF PUBLICATIONS IN THE THEMATIC CLUSTERS BETWEEN 1991 AND 2018 1991- 1995- 1999- 2003- 2007- 2011- 2015- Total 1994 1998 2002 2006 2010 2014 2018 Cluster 1 2 1 10 8 22 31 18 92 Cluster 2 - - 1 4 5 7 20 37 Cluster 3 2 1 3 12 19 28 29 94 Cluster 4 - - 4 5 9 11 21 50 Cluster 5 1 1 2 2 7 5 16 34 Cluster 6 1 1 2 13 26 27 28 98 Cluster 7 - 2 2 5 13 22 38 82 Cluster 8 - 5 11 10 24 44 56 150

37

FIGURE 4. NETWORK OF THE MAIN AUTHORS, HAVING PUBLISHED AT LEAST 6 ARTICLES AND ASSOCIATED AT LEAST 3 TIMES WITH ONE OR MORE OF THE THEMATIC CLUSTERS, WITH THEIR COUNTRY (ESP: SPAIN, FRA: FRANCE, GRC: GREECE, IRL: IRELAND, ITA: ITALY, MAR: MOROCCO, NLD: THE NETHERLANDS, PRT: PORTUGAL, TUN: TUNISIA, TUR: TURKEY, □: TOTAL NUMBER OF ARTICLES IN THE CORPUS ASSOCIATED WITH THIS AUTHOR OR CLUSTER, ○: NUMBER OF ARTICLES ASSOCIATED WITH THIS AUTHOR IN THE SPECIFIED CLUSTER)

38

3.5.1. Cluster 1 This cluster is focused on near- and mid-infrared analytical techniques, combined with chemometrics classification and discrimination models (Figure 3). It has been studied consistently throughout the years, but has known a decline of popularity since 2015 (Table 4). Authors from several countries take an interest in this subject, including Casale and others in Italy, Artaud and colleagues in France, Downey in Ireland, Tokatli in Turkey and Kontominas in Greece (Figure 4). This cluster is characterized by an intense use of chemometric methods, which justifies the strong presence of specialists in this field like Downey, Dupuy and Marini. 3.5.2. Cluster 2 This small group of articles deals with the application of liquid chromatography to analyse phenolic compounds. Interest in this subject is more recent, with a first article in 2002, and has seen a strong increase in the 2015-2018 period. This field benefits from technological advances with instrumentation allowing the analysis of compounds present at very low concentrations. It is mostly studied by Spanish researchers such as Fernandez-Gutierrez and others, but also by Ajal in Morocco and Oliveira in Portugal. 3.5.3. Cluster 3 Articles of the cluster 3 are centred on sensory analysis and the use of electronic sensors, related to the analysis of volatile compounds. They also study the quality and physico-chemical parameters of olive oil. These themes have been increasingly studied between 1991 and 2014, but the number of articles is stagnating since 2015 (28 and 29 articles for the two most recent periods). The subject attracts researchers from various countries like Pereira and others in Portugal, Pardo and Garcia-Gonzalez in Spain, Tura and colleagues in Italy, Kontominas in Greece and Zarrouk in Tunisia. This cluster is strongly connected to chemometrics methods, as the cluster 1. 3.5.4. Cluster 4 This isolated group covers the use of DNA analysis to insure the genetic traceability of Olea europaea subsp europaea L. This rather recent subject is more and more studied since the first article published in 2000. This type of analysis concerns only a few authors and remains the responsibility of specialists. It is specifically studied by Italian researchers, with Montemurro and colleagues.

39

3.5.5. Cluster 5 This small and isolated cluster containing only 34 articles seems to be somewhat outside of the main thematic at first sight, since it concerns olive fruit and ripening. However, a closer look indicates that these articles actually deal with the influence of maturity degree on the composition and quality of the resulting olive oil. It is the opening door to a much broader work that relates to the nutritional impact of olive oils in relation to their chemical composition and more particularly with phenolic compounds, antioxidants and vitamins. The interest in this subject has been growing sharply in recent years, even though none of the main authors appears to be strongly connected to this subject so far due to its novelty. 3.5.6. Cluster 6 This theme gathers the analysis of various compounds such as fatty acids, triacylglycerols, sterols, tocopherols, phenolic compounds and the study of antioxidant activity. It has been studied since 1991 but with a strong increase of publications between 2003 and 2010, and stagnation since 2011. It is mostly studied by researchers that are also interested in another subject, including Zarrouk and others in Tunisia, Artaud and Le Dréau in France, Aparicio and Fernandez-Gutierrez in Spain, Pereira, Oliveira and Amaral in Portugal, and Cerretani, Bendini and Chiavaro in Italy. 3.5.7. Cluster 7 This cluster is mainly related to isotope ratio, mass spectrometry and gas chromatography to analyse trace elements, as well as fatty acids. It has known a steady increase of popularity since 1998, and has become the second most studied theme in the recent years. It attracts a large number of Italian researchers, like Camin and colleagues, but also Cuadros-Rodriguez in Spain and van Ruth in the Netherlands. 3.5.8. Cluster 8 The 150 articles that compose this largest cluster are part of the “omics” movement (metabolomics), with the processing of infrared and nuclear magnetic resonance data by chemometrics methods such as partial least square discriminant analysis, principal component analysis or linear discriminant analysis in order to solve problems of adulteration and determination of quality parameters of olive oils. It is the most popular theme, with an especially strong increase of publications since 2007. Once again, this subject is largely studied by Italian researchers who focus specifically on this area, like Fanizzi and colleagues, or have interest in other clusters like Mannina and others who are also implicated in specific analytical

40

techniques (cluster 7), Marini who is also connected to vibrational spectroscopic analyses (cluster 1), or Bendini and others who are also concerned with sensory analyses and quality (cluster 3) and with target compounds (cluster 6). A team of Spanish researchers with Simo- Alfonso, Lerma-Garcia and Herrero-Martinez is also connected to cluster 8.

4. Conclusion

It is clear that the scientific interest in olive oil origin has been consistently increasing since the early 1990s, concurrently with the growing consumption of this product and awareness of authenticity issues. This bibliometric study highlights the core journals in which research articles on this topic are most likely to be published, the most prominent authors with their specific areas of expertise, and the relationships between the scientific and economic interests of the most productive countries. The 732 references published between 1991 and 2018 can be distributed into eight clusters by a K-means analysis performed on their keywords, allowing to identify the main themes of research. A shift of popularity seems to be occurring from chemical fingerprinting using vibrational spectroscopy towards biological phenotyping using genetic and metabolomic techniques. Chemometric tools are now well established and are expected to continue to be more and more applied to treat the results from various analytical techniques. Moreover, a trend to focus on the sensory and nutritional properties brought by minor compounds of olive oils appears to be emerging.

Acknowledgements

The authors thank Dr. Jacky Kister and Pr. Henri Dou for their insight on bibliometric methods and support with the use of Matheo Analyzer®.

Funding sources

This work was supported by the French National Agency for Research (ANR) as part of the European Union’s Seventh Framework Program for research, technological development and demonstration (grant agreement number 618127).

41

References

[1] Van Rijswijk, W., & Frewer, L.J. (2008). Consumer perceptions of food quality and safety and their relation to traceability, Brit. Food J., 110:1034-1046. [2] Dimara, E., & Dimitris, S. (2005). Consumer demand for informative labeling of quality food and drink products: a European Union case study, J. Consum. Market., 22:90-100. [3] Likudis, Z. (2016). Olive oils with protected designation of origin (PDO) and protected geographical indication (PGI), in Products from Olive Tree, London, UK: IntechOpen, 175-190. [4] Garcia-Gonzalez, D.L., & Aparicio, R. (2010). Research in olive oil: challenges for the near future, J. Agric. Food Chem., 58:12569-12577. [5] Hood, W., & Wilson, C. (2001). The literature of bibliometrics, scientometrics, and informetrics, Scientometrics, 52:291-314. [6] Aleixandre, J.L., Aleixandre-Tudó, J.L., Bolaños-Pizzaro, M., & Aleixandre-Benavent, R. (2013). Mapping the scientific research on wine and health (2001–2011), J. Agric. Food Chem., 61:11871-11880. [7] Dias, C., & Mendes, L. (2018). Protected Designation of Origin (PDO), Protected Geographical Indication (PGI) and Traditional Speciality Guaranteed (TSG): A bibiliometric analysis, Food Res. Int., 103:492-508. [8] Matheo Software. https://www.matheo-software.com/matheo-analyzer/ (accessed July 2019) [9] Alberghina, G., Caruso, L., Fisichella, S., & Musumarra, G. (1991). Geographical classification of Sicilian olive oils in terms of sterols and fatty-acids content, J. Sci. Food Agric., 56:445-455. [10] Sacchi, R., Addeo, F., & Paolillo, L. (1997). (1)H and (13)C NMR of virgin olive oil. An overview, Magn. Reson. Chem., 35:133-145. [11] Kiritsakis, A.K. (1998). Flavor components of olive oil – A review, J. Am. Oil Chem. Soc., 75:673-681. [12] Busconi, M., Foroni, C., Corradi, M., Bongiorni, C., Cattapan, F., & Fogher, C. (2003). DNA extraction from olive oil and its use in the identification of the production cultivar, Food Chem., 83:127-134. [13] Peris, M., & Escuder-Gilabert, L. (2009). A 21st century technique for food control: Electronic noses, Anal. Chim. Acta, 638:1-15.

42

[14] Gonzalvez, A., Armenta, S., & De La Guardia, M. (2009). Trace-element composition and stable-isotope ratio for discrimination of foods with Protected Designation of Origin, TrAC, Trends Anal. Chem., 28:1295-1311. [15] Lerma-García, M.J., Ramis-Ramos, G., Herrero-Martínez, J.M., & Simó-Alfonso, E.F. (2010). Authentication of extra virgin olive oils by Fourier-transform infrared spectroscopy, Food Chem., 118:78-83. [16] Luna, G., Morales, M.T., & Aparicio, R. (2006). Characterisation of 39 varietal virgin olive oils by their volatile compositions, Food Chem., 98:243-252. [17] Vinha, A.F., Ferreres, F., Silva, B.M., Valentao, P., Gonçalves, A., Pereira, J.A., & Andrade, P.B. (2005). Phenolic profiles of Portuguese olive fruits (Olea europaea L.): Influences of cultivar and geographical origin, Food Chem., 89:561-568. [18] Galtier, O., Dupuy, N., Le Dréau, Y., Ollivier, D., Pinatel, C., Kister, J., & Artaud, J. (2007). Geographic origins and compositions of virgin olive oils determinated by chemometric analysis of NIR spectra, Anal. Chim. Acta, 595:136-144. [19] Casale, M., Casolino, C., Oliveri, P., & Forina, M. (2010). The potential of coupling information using three analytical techniques for identifying the geographical origin of extra virgin olive oil, Food Chem., 118:163-170. [20] Baeten, V., Fernández Pierna, J.A., Dardenne, P., Meurens, M., García-González, D.L., & Aparicio-Ruiz, R. (2005). Detection of the presence of hazelnut oil in olive oil by FT-Raman and FT-MIR spectroscopy, J. Agric. Food Chem., 53:6201-6206. [21] Karoui, R., & Blecker, C. (2011). Fluorescence spectroscopy measurement for quality assessment of food systems—a review, Food Bioprocess Tech., 4:364-386. [22] Luykx, D. M. A. M.; Van Ruth, S. M. An overview of analytical methods for determining the geographical origin of food products, Food Chem., 2008, 107, 897-911. [23] Espin, J. C.; Soler-Rivas, C.; Wichers, H. J. Characterization of the total free radical scavenger capacity of vegetable oils and oil fractions using 2.2-diphellyl-1-picrylhydrazyl radical, J. Agric. Food Chem., 2000, 48, 648-656. [24] Gurdeniz, G.; Tokatli, F.; Ozen, B. Differentiation of mixtures of monovarietal olive oils by mid-infrared spectroscopy and chemometrics, Eur. J. Lipid. Sci. Tech., 2007, 109, 1194-1202. [25] Marcone, M. F.; Wang, S. A.; Albabish, W.; Nie, S. P.; Somnarain, D.; Hill, A. Diverse food- based applications of nuclear magnetic resonance (NMR) technology, Food Res. Int., 2013, 51, 729-747.

43

[26] Consolandi, C.; Palmieri, L.; Severgnini, M.; Maestri, E.; Marmiroli, N.; Agrimonti, C.; Baldoni, L.; Donini, P.; De Bellis, G.; Castiglioni, B. A procedure for olive oil traceability and authenticity: DNA extraction, multiplex PCR and LDR-universal array analysis, Eur. Food Res. Technol., 2008, 227, 1429-1438. [27] Tsimidou, M.; Karakostas, K. X. Geographical classification of Greek virgin olive oil by nonparametric multivariate evaluation of fatty-acid composition, J. Sci. Food Agric., 1993, 62, 253-257. [28] Tous, J.; Romero, A.; Plana, J.; Guerrero, L.; Diaz, I.; Hermoso, J. F. Chemical and sensory characteristics of “” olive oil obtained in different growing areas of Spain, Grasas Aceites, 1997, 48, 415-424. [29] Mannina, L.; Marini, F.; Gobbino, M.; Sobolev, A. P.; Capitani, D. NMR and chemometrics in tracing European olive oils: The case study of Ligurian samples, Talanta, 2010, 80, 2141-2148. [30] Cerretani, L.; Bendini, A.; Barbieri, S.; Lercker, G. Preliminary observations on the change of some chemical characteristics of virgin olive oils subjected to a “soft deodorization” process, Riv. Ital. Sostanze Gr., 2008, 85, 75-82. [31] Parcerisa, J.; Casals, I.; Boatella, J.; Codony, R.; Rafecas, M. Analysis of olive and hazelnut oil mixtures by high-performance Liquid chromatography-atmospheric pressure chemical ionisation mass spectrometry of triacylglycerols and gas-liquid chromatography of non- saponifiable compounds (tocopherols and sterols), J. Chromatogr. A, 2000, 881, 149-158. [32] Gilbert-Lopez, B.; Valencia-Reyes, Z. L.; Yufra-Picardo, V. M.; Garcia-Reyes, J. F.; Ramos- Martos, N.; Molina-Diaz, A. Determination of Polyphenols in Commercial Extra Virgin Olive Oils from Different Origins (Mediterranean and South American Countries) by Liquid Chromatography-Electrospray Time-of-Flight Mass Spectrometry, Food Anal. Method., 2014, 7, 1824-1833. [33] Tzouros, N. E.; Arvanitoyannis, I. S. Agricultural produces: Synopsis of employed quality control methods for the authentication of foods and application of chemometrics for the classification of foods according to their variety or geographical origin, Crit. Rev. Food Sci. Nutr., 2001, 41, 287-319. [34] Giuffre, A. M. Evolution of Fatty Alcohols in Olive Oils produced in Calabria (Southern Italy) during Fruit Ripening, J. Oleo Sci., 2014, 63, 485-496. [35] Ouni, Y., Taamalli, A., Gómez-Caravaca, A.M., Segura-Carretero, A., Fernández- Gutiérrez, A., & Zarrouk, M. (2011). Characterisation and quantification of phenolic compounds

44

of extra-virgin olive oils according to their geographical origin by a rapid and resolutive LC–ESI- TOF MS method, Food Chem., 127:1263-1267. [36] Mannina, L., Patumi, M., Proietti, N., Bassi, D., & Segre, A.L.. (2001). Geographical characterization of Italian extra virgin olive oils using high-field H-1 NMR spectroscopy, J. Agric. Food Chem., 49:2687-2696. [37] Benincasa, C., Lewis, J., Perri, E., Sindona, G., & Tagarelli, A. (2007). Determination of trace element in Italian virgin olive oils and their characterization according to geographical origin by statistical analysis, Anal. Chim. Acta, 585:366-370. [38] Sinelli, N., Cerretani, L., Di Egidio, V., Bendini, A., & Casiraghi, E. (2010). Application of near (NIR) infrared and mid (MIR) infrared spectroscopy as a rapid tool to classify extra virgin olive oil on the basis of fruity attribute intensity, Food Res. Int., 43:369-375. [39] Bajoub, A., Carrasco-Pancorbo, A., Ouazzani, N., & Fernández-Gutiérrez, A. (2015). Potential of LC–MS phenolic profiling combined with multivariate analysis as an approach for the determination of the geographical origin of north Moroccan virgin olive oils, Food Chem., 166:292-300. [40] Camin, F., Larcher, R., Perini, M., Bontempo, L., Bertoldi, D., Gagliano, G., & Versini, G. (2010). Characterisation of authentic Italian extra-virgin olive oils by stable isotope ratios of C, O and H and mineral composition, Food Chem., 118:901-909. [41] Del Coco, L., Schena, F.P., & Fanizzi, F.P. (2012). 1H nuclear magnetic resonance study of olive oils commercially available as Italian products in the United States of America, Nutrients, 4:343-355. [42] Rezzi, S., Axelson, D.E., Héberger, K., Reniero, F., Mariani, C., & Guillou, C. (2005). Classification of olive oils using high throughput flow 1H NMR fingerprinting with principal component analysis, linear discriminant analysis and probabilistic neural networks, Anal. Chim. Acta, 552:13-24. [43] Longobardi, F., Ventrella, A., Napoli, C., Humpfer, E., Schütz, B., Schäfer, H., & Sacco, A. (2012). Classification of olive oils according to geographical origin by using 1H NMR fingerprinting combined with multivariate analysis, Food Chem., 130:177-183. [44] Karoui, R., Downey, G., & Blecker, C. (2010). Mid-infrared spectroscopy coupled with chemometrics: A tool for the analysis of intact food systems and the exploration of their molecular structure− Quality relationships− A review, Chem. Rev., 110:6144-6168.

45

[45] Marini, F. (2009). Artificial neural networks in foodstuff analyses: Trends and perspectives A review, Anal. Chim. Acta, 635:121-131. [46] International Olive Council. World Olive Oil Figures. http://www.internationaloliveoil.org/estaticos/view/131-world-olive-oil-figures (accessed June 2019). [47] European Commission, Agriculture and Rural Development. Database Of Origin & Registration. http://ec.europa.eu/agriculture/quality/door/list.html (accessed June 2019). [48] Jain, A.K. (2010). Data clustering: 50 years beyond K-means, Pattern Recognition Letters, 31:651-666.

46

Applications of Vibrational Spectroscopy Techniques

Astrid Maléchaux, Nathalie Dupuy, Jacques Artaud Aix Marseille Univ, Univ Avignon, CNRS, IRD, IMBE, Marseille, France

Abstract

A literature study focused on olive oil adulteration indicates that the use of vibrational spectroscopy (near infrared, mid infrared, Raman) has been steadily rising between 1990 and 2016. Spectral data are often subjected to supervised or unsupervised chemometric treatments, and can also be concatenated or analysed by a multiblock approach. The application of the three spectroscopic techniques to the identification of olive oils and their quality, the detection of adulteration by other oils, and the recognition of varietal or geographical origin, are reviewed in this chapter.

Keywords

spectroscopy, NIR, MIR, Raman, chemometrics

47

1. Introduction

One of the main issues facing the food industry to this day is the authentication of its products. Due to their high price compared to other edible oils, especially when they benefit from a certification like the Protected Designation of Origin (PDO), Extra Virgin Olive Oils (EVOOs) and Virgin Olive Oils (VOOs) are an attractive target for fraudsters. They can indeed be subjected to more or less sophisticated fraudulent practices, the most common ones being the falsification or adulteration of VOOs with lower-price oils such as seed oils, refined olive oil or . Many studies have thus been conducted in order to fight frauds that disrupt the market and deteriorate the positive image of VOOs. First of all, the quality criteria which have been set by the International Olive Council (IOC) allow the classification of olive oils in different categories (extra virgin, virgin, lampante virgin) according to their free acidity, peroxide value, UV absorbance, alkyl esters contents and sensory properties. In the second place, molecular markers including fatty acids (Z and E), sterols, triterpene dialcohols, waxes or stigmastadienes are used to detect possible frauds. However, the authentication of varietal or geographical origins, as well as the affiliation of a VOO to a PDO, often represents a real analytical challenge. Numerous research works, based on various physicochemical determinations associated with chemometric data processing, have sought to answer this problem. These studies can be classified into two main groups: those analysing the chemical composition of the oil, and those relying on spectroscopic techniques like nuclear magnetic resonance, infrared, Raman or fluorescence spectroscopies. For instance vibrational spectroscopic analyses, namely Near Infrared (NIR), Mid Infrared (MIR) and Raman, coupled with the predictive chemometric methods of Partial Least Squares (PLS) regression and PLS discriminant analysis (PLS-DA), have been successfully applied to the authentication of French VOOs from different PDOs [1, 2].

2. Bibliometrics

A quick search of the terms “olive oil,” “authentication” and “spectroscopy” in Google Scholar, restricted to articles published between 1990 and 2016, gives an idea of the vast amount of studies on these subjects. Figure 1 also indicates that “olive oil” is almost 3 times more often

48

associated with “spectroscopy” than with “authentication,” however “spectroscopy” is present in 94% of the articles containing both “olive oil” and “authentication.” This tends to show that olive oil authentication is often studied in relation with spectroscopic analyses, but that these analytical techniques also have other purposes, such as the characterisation of oil components or the measurement of quality parameters. It can also be noted that the number of articles containing “olive oil” and “chromatography” is higher than that for “olive oil” and “spectroscopy.” However, this is no longer the case when the term “authentication” is added.

FIGURE 1. NUMBER OF ARTICLES CONTAINING THE WORDS “OLIVE OIL”, “AUTHENTICATION”, “SPECTROSCOPY” OR

“CHROMATOGRAPHY” AND THEIR COMBINATIONS (GOOGLE SCHOLAR, 20TH MARCH 2017, FIGURE NOT TO SCALE).

A more specific search on Web of Science confirms that the authentication of virgin olive oil using vibrational spectroscopy has been a subject of interest since the 1990s, and even more so during the past 10 years. This is evidenced by the growing number of publications that are reported in Figure 2. The number of studies focused on NIR has been steadily increasing since 2002, while MIR has seen a more recent and sharper rise of interest. Raman spectroscopy used to be the most popular in the late 1990s and early 2000s, but has since then been overtaken by the other two techniques. On average, around 20% of the articles included experiments with at least two of the analytical methods of interest.

49

FIGURE 2. EVOLUTION OF THE NUMBER OF PUBLICATIONS FOUND FOR THE QUERY “OLIVE OIL” AND AUTHENTIC* AND (NIR

OR “NEAR INFRARED”) OR (MIR OR “MID INFRARED”) OR RAMAN (WEB OF SCIENCE, 20TH MARCH 2017).

In the year 2016 alone, six reviews dealing with the applications of spectroscopic and/or chemometric methods for the quality control and authentication of VOOs have been published [3–8]. Moreover, a book summing up the latest advances in food authenticity has also been edited and contains chapters regarding vibrational spectroscopy, chemometrics, the confirmation of the geographical origin of food and the analysis of adulterated vegetable oils [9]. The free software Wordle allowed the identification of the most frequently used keywords in the titles of the articles from the previous Web of Science search, and the result is presented in Figure 3. The terms “olive oil” and “spectroscopy” were removed in order to have a better view of the other words. Thus, the importance of Fourier-transform instruments and the predominance of studies using MIR over NIR and Raman spectroscopies appear. Other analytical techniques are mentioned, such as UV-visible, fluorescence or NMR spectroscopies, as well as the possibility to combine several methods. The association with chemometrics for multivariate analysis is also highlighted and a few specific models are cited, the most prominent one being PLS. The detection and quantification of extra-virgin or virgin olive oil adulteration

50

with other vegetable or edible oils seems to be the main application, followed by the authentication or determination of geographical and varietal origins.

FIGURE 3. WORD CLOUD GENERATED BY THE TITLES OF THE ARTICLES FROM WEB OF SCIENCE QUERY (WORDLE, 20TH

MARCH 2017, FONT SIZE REPRESENTATIVE OF FREQUENCY OF APPEARANCE).

3. Spectroscopy

Vibrational spectroscopic techniques, such as infrared and Raman spectroscopies, have gained in popularity during the past decades, and their applications to food analysis have been extensively studied. Compared to chromatographic methods they allow simple, non- destructive, time- and cost-saving analyses. Moreover, technological advances like the introduction of interferometers, attenuated total reflection instruments or detectors with increased sensitivity and resolution made them more user-friendly. The spreading use of chemometrics has also significantly improved the ability to extract meaningful information from spectral data, and to obtain reliable quantitative results.

51

FIGURE 4. SPECTROSCOPIC TECHNIQUES RELATED TO THE INFRARED REGION OF THE ELECTROMAGNETIC SPECTRUM.

Vibrational spectroscopy relies on changes in the energy levels of the molecules, due to the interaction between a sample and an electromagnetic radiation. Each bond between two atoms has a characteristic vibration frequency depending on parameters such as the reduced mass of these two atoms and binding force constants. The excitation brought by the radiation causes the bonds to stretch or bend. In the case of infrared absorption the molecular vibration is related to a change in the intrinsic dipole moment, while Raman inelastic scattering depends on a change in the electronic polarizability of the molecule. The amount of energy absorbed by the sample also influences the vibrations, as summarised in Figure 4. In the MIR region (4000- 400 cm-1), the transitions between energy levels correspond mainly to fundamental vibrations and a few overtones, whereas in the more energetic NIR area (12500-4000 cm-1) lower intensity bands of overtones and combinations of the fundamental vibrations can be observed. As a consequence, these three techniques provide complementary information about the chemical composition and physical state of a sample. For instance, some infrared absorption bands arise from polar groups such as C = O and O-H, while Raman spectra show more pronounced scattering bands for nonpolar groups like C = C or C-C. It is also worth noting that Raman is prone to fluorescence interference, which can be reduced by using a Fourier Transform (FT) interferometer and a laser source of lower energy [10, 11].

52

4. Chemometrics

Chemometrics is the use of multivariate statistical analyses to extract information from chemical data. Since its creation by Svante Wold and Bruce Kowalski in the 1970s [12, 13] different methods have been developed to serve various purposes, such as data pre- processing, qualitative or quantitative analyses. Pre-treatment of raw spectra is often necessary to reduce the effect of interferences and artefacts on the subsequent development of a predictive model. Wavelet filtering [14] or Savitzky-Golay (SG) smoothing [15] can be used to improve the signal to noise ratio, while detrending or SG 1st and 2nd derivatives provide a correction of the baseline shift. Moreover, 2nd derivative can resolve overlapping peaks. Other algorithms, like Standard Normal Variate (SNV) [16] and Multiplicative Scatter Correction (MSC) [17], are useful when both additive and multiplicative effects caused by light scattering are present. Normalisation or scaling can also be applied to ensure that each spectrum has the same importance in the model. Before the development of analytical models, the spectral data can be explored through Principal Component Analysis (PCA) [18, 19] which decomposes the initial matrix into sets of scores and loadings allowing to reduce its dimensions. When enough variability is taken into account by the PCs, the loadings show which variables have more influence on the PCs and a representation of the scores can provide insight into the similarities among samples or the presence of outliers. The discrimination between oils of different botanical, varietal or geographical origins involves the use of qualitative analyses. Unsupervised classification methods, such as Hierarchical Cluster Analysis (HCA) [20], separate the samples into different groups without prior knowledge of their category membership. On the other hand, supervised methods like classification by Linear Discriminant Analysis (LDA) [21] or class-modelling by Soft Independent Modelling of Class Analogy (SIMCA) [22], assign new samples to previously defined categories. LDA reduces the space dimensions by selecting directions that maximise the separation between classes, whereas SIMCA performs a PCA on each class to minimise their internal differences. More recently, artificial intelligence algorithms such as Artificial Neural Networks (ANN) [23] have been developed to categorise samples after a phase of training by iterative adjustments. The development of quantitative models is required to determine the amount of adulterant that may have been added to a sample. Multiple Linear Regression (MLR) [24], Partial Least

53

Squares (PLS) [25] or Principal Component Regression (PCR) [26] are the most commonly used methods. They are based on the construction of a linear relationship between the variations of spectral data and the chemical parameter to be explained. However, other methods using non- linear models, such as ANN or Support Vector Machines (SVM) [27], also have the ability to perform quantitative analyses [3, 11, 28].

5. Near infrared spectroscopy

5.1. Spectra Interpretation

As can be seen in Figure 5, characteristic NIR absorbance bands arise in several regions of the EVOO spectrum. Region A (8700-8000 cm-1) is attributed to the 2nd overtone of C-H stretching vibrations, while B (7400-6700 cm-1) results from combinations of C-H stretching and bending, and C (6000-5500 cm-1) corresponds to the 1st overtone of C-H stretching vibrations. These three regions contain information regarding the degree of unsaturation of the fatty acids and triacylglycerols present in a sample. The two bands in region D (5300-5100 cm-1) have been attributed to the 2nd overtone of C = O stretching vibration from carbonyl functional groups. Finally, region E (5000-4500 cm-1) presents combination bands of =C-H and C=C stretching vibrations [9, 11, 29, 30].

FIGURE 5. NEAR INFRARED SPECTRUM OF EVOO WITH IDENTIFICATION OF MAIN ABSORBANCE BANDS.

54

5.2. Identification of Virgin Olive Oils vs Other Oils

The first step of authentication is to differentiate olive oil from other oils and fats. This can be achieved through the analysis of their major compounds, such as fatty acids and triacylglycerols, usually conducted by gas chromatography and high performance liquid chromatography respectively. However, differences in the composition of the samples are also reflected in their NIR spectra, as can be seen in the examples presented in Table 1. Hourant et al. [31] indeed showed that the absorption intensity of the bands around 5814 cm-1 (1720 nm), 4668 cm-1 (2142 nm) and 4595 cm-1 (2176 nm) could be related to the degree of total unsaturation in the sample. This allowed the classification of eighteen different oils and fats with the modelling of a dendroid structure based on seven linear discriminant functions. Yang et al. [32] confirmed that LDA could discriminate pure edible oils and fats using FT-NIR spectra, but obtained more satisfying classification rates with Canonical Variate Analysis (CVA).

TABLE 1. EXAMPLES OF NIR SPECTROSCOPY APPLICATIONS TO DIFFERENTIATE OLIVE OILS FROM OTHER OILS Ref Other oils Materials Chemometrics Results [31] , Brazil nut, coconut, NIR, 1 mm quartz cell, Canonical Combination of grape seed, high oleic range: 9090-4000 cm-1 discrimination after 7 equations sunflower, hydrogenated variable selection by gives 90% fish, maize, palm, peanut, SLDA correct rapeseed, safflower, classification sesame, soya, sunflower, tallow, walnut [32] Butter, coconut, cod liver FT-NIR, DTGS detector, CVA after 92.2% correct oil, lard, maize, peanut, quartz cell, normalisation and classification rapeseed, safflower, soya range: 8000-2000 cm-1, data compression by resolution: 16 cm-1 PCA

5.3. Adulteration of Virgin Olive Oils with Other Oils

Several articles focusing on the ability of NIR to analyse binary mixtures of VOOs with other kinds of oils have been published over the past 20 years (Table 2). Dispersive and FT-NIR have been equally used in these studies, and three of them report results obtained with a fibre optic probe although not in an on-line setting [33–35].

55

Downey et al. [36] developed a SIMCA model that gave 100% of correct classification for VOOs versus adulterated samples containing 1 to 5% of sunflower oil. Karunathilaka et al. [37] also applied SIMCA to FT-NIR spectra to successfully detect the addition of 10 to 20% of various vegetable oils in EVOOs. Mignani et al. [33] obtained spectra through an integrating sphere and fibre optic detector. In this study, the application of PCA followed by LDA enabled the discrimination between EVOOs adulterated with refined olive oil, deodorised olive oil, olive pomace oil and refined olive pomace oil, with 75% of correct classification. In addition to the detection of adulteration, most of the articles are interested in the use of regression models to quantify the amount of adulterant. For instance, Downey et al. [36], Wesley et al. [38] and Christy et al. [39] applied PLS regression after various pre-treatments to predict the amount of sunflower oil added to olive oil. They all obtained R2 values superior to 0.9 and Standard Errors of Prediction (SEP) under 2%. The analysis of VOOs adulterated with maize, soya, rapeseed, safflower, peanut, walnut, hazelnut or palm oils yielded similar results according to Azizian et al. [34], Wesley et al. [38], Christy et al. [39] and Mendes et al. [40]. The latter constructed different models to quantify the addition of high linoleic oils, high oleic oils or palm olein, based on the absorption ratio at 5280 and 5180 cm-1, attributed respectively to volatile and non-volatile compounds. Mignani et al. [33], Azizian et al. [34], Yang and Irudayraj [35], Wesley et al. [38] and Wojcicki et al. [41] also tried to quantify the adulteration of EVOOs by refined olive oil or olive pomace oil. These studies tend to show higher errors of prediction, ranging from 1.78 to 13%, which may be due to the higher similarity between the compositions of pure and adulterated samples. Finally, Ozedmir and Ozturk [42] developed a Genetic Inverse Least Square model, capable of predicting the concentration of tertiary mixtures with SEP of 1.42%, 5.42% and 6.38% for the amount of VOO, sunflower oil and maize oil respectively.

56

TABLE 2. EXAMPLES OF NIR SPECTROSCOPY APPLICATIONS TO ANALYSE VOOS ADULTERATED WITH OTHER OILS Ref Adulterants Materials Chemometrics Results [33] Olive pomace, refined olive NIR, fibre optic source LDA and PLS LDA: 75% correct pomace, refined olive, and detector, integrating regression after SG classification deodorised olive oils (5 to sphere, range: 25000- smoothing PLS: R2 = 0.932 to 95%) 5880 cm-1 0.997, RMSEP = 2% to 13% [34] Refined olive oil (3 to 60%) FT-NIR, fibre optic probe, PLS regression on R2 = 97.6 to 99.9, and soya, sunflower, maize, InGaAs detector, range: the absorption ratio RMSECV = 3.7% to rapeseed, hazelnut, 8000-4500 cm-1, 5280/5180 cm-1 0.9% safflower, peanut, palm oils resolution: 8 cm-1 (3 to 30%) [35] Olive pomace oil (5 to NIR, fibre optic probe, PLS regression after R2 = 0.990, 100%) InGaAs DAD, MSC SECV = 3.48%, range: 25000-5880 cm-1 SEP = 3.27% [36] Sunflower oil (1 and 5%) NIR, 0.1 mm camlock cell, SIMCA and PLS SIMCA: 100% range: 25000-4000 cm-1 regression correct after SG 1st classification derivative PLS: R2 = 0.93, RMSEP = 0.8%, LOD = 1.6% [37] Sunflower, soya, rapeseed, FT-NIR, 8 mm glass vials, SIMCA after SG 100% correct maize, hazelnut, safflower, range: 12500-4000 cm-1, smoothing, SG 1st classification peanut oils, palm olein (10 resolution: 8 cm-1 derivative and SNV and 20%) [38] Refined olive oil, maize, NIR, 1 mm quartz cell, PLS regression after R2 = 0.97, sunflower oils (5 to 30%) range: 12500-4000 cm-1 SG smoothing SECV = 1.31%, and 1st derivative SEP = 1.78% [39] Hazelnut, walnut, maize, FT-NIR, Ge diode PLS regression after R2 = 0.999 soya, sunflower oils (0 to detector, 4 mm quartz MSC SEP = 0.56% to 100%) cell, range: 12000-4000 and SG smoothing 1.32% cm-1, resolution: 4 cm-1 [40] Soya oil (1.5 to 100%) FT-NIR, Te-InGaAs PLS regression R2 = 0.998, detector, 8 mm glass RMSECV = 1.71, vials, range: 12000-4000 RMSEP = 1.76 cm-1, resolution: 4 cm-1 [42] Sunflower and maize oils FT-NIR, PbSe detector, 2 Genetic Inverse SEP = 1.42% to (4 to 96%) mm quartz cell, Least Squares 6.38% for range: 10000-4000 cm-1 tertiary mixtures

57

5.4. Authentication of Geographical or Varietal Origins

The most recent and prominent application of NIR spectroscopy has been the classification of VOOs according to their geographical or varietal origins. Table 3 summarises some of the articles published on this subject, with a majority preferring FT-NIR to dispersive instruments. The potential of PLS-DA modelling applied to NIR spectra to discriminate VOOs from different cultivars or regions of origin has been highlighted by several authors, amongst which Dupuy et al. [1], Sinelli et al. [43], Woodcock et al. [44], Galtier et al. [45] and Bevilacqua et al. [46]. Indeed, all of them obtained 85 to 100% of correct classification rates. Other discriminant analysis algorithms, like FDA or LDA, have also been rather successfully tested by Downey et al. [47], Casale et al. [48] and Sinelli et al. [49]. Class modelling techniques such as SIMCA seem to give less satisfying results overall, although Casale et al. [50], Oliveri et al. [51] and Laroussi- Mezghani et al. [52] managed to correctly predict the origin of 84.5 to 98.5% of their samples. Oliveri et al. [51], Casale et al. [53] and Forina et al. [54] also used POTFUN or UNEQ class models giving 83 to 100% of correct classification. In another study, Oliveri et al. [55] developed a novel Multivariate Range Modelling technique yielding a classification rate of 94.9%. Devos et al. [56] achieved a classification rate of 86.3% with an SVM supervised learning model coupled with genetic algorithm for pre-treatment selection.

TABLE 3. EXAMPLES OF NIR SPECTROSCOPY APPLICATIONS TO DETERMINE THE ORIGIN OF VOOS Ref Origins Materials Chemometrics Results [1] 6 French PDOs, FT-NIR, 2 mm quartz cell, PLS-DA 85% correct classification 5 harvest years range: 10000-4500 cm-1, for PDOs resolution: 4 cm-1 [43] 3 Italian regions FT-NIR, 8 mm vials, PLS-DA after SG 2nd 93% correct classification range: 12500-4500 cm-1, derivative with commercial oils resolution: 8 cm-1 [44] Liguria and other NIR, 0.1 mm camlock cell, PLS-DA after SG 1st 92.8% correct European regions, 3 range: 9090-4000 cm-1 derivative classification for Ligurian harvest years oils, 81.5% for other oils [45] 5 French PDOs, 4 FT-NIR, 2 mm quartz cell, PLS-DA 100% correct harvest years range: 10000-4500 cm-1, classification for PDOs resolution: 4 cm-1 [46] PDO Sabina and FT-NIR, integrating PLS-DA after MSC, 100% correct other Mediterranean sphere, 19 mm glass cell, detrend, or SG 1st classification for regions, 2 harvest range: 10000-4000 cm-1, derivative Sabina and 95.5% for years resolution: 4 cm-1 other origins

58

TABLE 3. (CONTINUED) Ref Origins Materials Chemometrics Results [47] 3 Greek regions NIR, 0.1 mm camlock cell, FDA 94% correct classification range: 25000-4000 cm-1 for geographic origin [48] 3 cultivars from 3 FT-NIR, 8 mm vials, LDA after SNV, SG 1st 82.9% correct Italian regions range: 12500-4500 cm-1, derivative and classification for cultivars resolution: 8 cm-1 variable selection [49] 3 cultivars from 3 FT-NIR, 8 mm vials, LDA after SNV, SG 1st 83% correct classification Italian regions range: 12500-4500 cm-1, derivative and resolution: 8 cm-1 variable selection [50] Liguria and other FT-NIR, 5 mm quartz cell, SIMCA after SG 1st 92.4% correct Italian regions range: 10000-4000 cm-1, derivative and classification for Ligurian resolution: 8 cm-1 variable selection oils [51] Liguria and other NIR, 0.1 mm camlock cell, SIMCA or POTFUN 84.5% correct European regions, 3 range: 9090-4000 cm-1 after SG 1st classification with SIMCA, harvest years derivative [52] 6 Tunisian cultivars FT-NIR, 2 mm quartz cell, SIMCA after SNV and 89.55 to 98.50% correct and other countries, range: 10000-4500 cm-1, SG 1st derivative classification for cultivars 2 harvest years resolution: 4cm-1 [53] PDO Chianti Classico FT-NIR, 5 mm quartz cell, UNEQ after SNV, SG 97.5% correct and other Italian range: 10000-4000 cm-1, 1st derivative and classification regions resolution: 4 cm-1 variable selection (SELECT) [54] PDO Chianti Classico FT-NIR, 5 mm quartz cell, QDA-UNEQ after SG 100% correct and other Italian range: 10000-4000 cm-1, 1st derivative and classification regions resolution: 4 cm-1 variable selection (STEP-LDA) [55] PDO Chianti Classico FT-NIR, 5 mm quartz cell, MRM after SNV 94.9% correct and other Italian range: 10000-4000 cm-1, classification regions resolution: 4 cm-1 [56] Liguria and other NIR, 0.1 mm camlock cell, SVM after detrend 86.3% correct Italian regions, 3 range: 9090-4000 cm-1 classification harvest years

59

6. Mid infrared spectroscopy

6.1. Spectra Interpretation

Figure 6 shows a characteristic MIR spectrum of EVOO, presenting sharper absorption bands than the NIR spectrum. Band A, around 3005 cm-1, is associated to the =C-H stretching vibrations of cis (Z) double bonds. Bands B and C (2920 and 2850 cm-1) arise respectively from C-H aliphatic asymmetric and symmetric stretching vibrations. D (1740 cm-1) corresponds to the C = O stretching of carbonyl groups, and E (1650 cm-1) to C = C stretching vibrations. The fingerprinting region, under 1500 cm-1, presents overlapping peaks that are less easily attributed. However, region F between 1500 and 1300 cm-1 can be related to C-H aliphatic bending vibrations and region G (1250-1000 cm-1) to C-C and C-O bending vibrations. Finally, band H (700 cm-1) is attributed to the C-H bending of CH2. [9, 11, 29, 30].

FIGURE 6. MID INFRARED SPECTRUM OF EVOO WITH IDENTIFICATION OF MAIN ABSORBANCE BANDS.

6.2. Identification of Virgin Olive Oils vs Other Oils

The discrimination between VOOs and other fats and oils has been more extensively studied using MIR than NIR spectroscopy, and always with FT instruments (Table 4). Several authors, such as Lai et al. [57], Marigheto et al. [58], Tay et al. [59], Obeidat et al. [60], Lerma-Garcia et al. [61], de la Mata et al. [62], reported classification rates of 100% with the

60

use of various discriminant analysis techniques including PLS-DA and LDA. Javidnia et al. [63] reached the same result by using interval extended canonical variate analysis (iECVA). Yang et al. [32] obtained better results with CVA applied to MIR spectra of olive and sunflower oils compared to NIR, since 98.9% of the samples were correctly classified versus 92.2% for NIR spectra. In two different studies, Baeten identified refined olive oil and hazelnut oil using either ANN [64] or stepwise linear discriminant analysis (SLDA) [65].

TABLE 4. EXAMPLES OF MIR SPECTROSCOPY APPLICATIONS TO DIFFERENTIATE VOOS FROM OTHER OILS Ref Other oils Materials Chemometrics Results [32] Butter, coconut, cod FT-MIR, DTGS detector, CVA on 1800-1400 cm-1 98.9% correct liver oil, lard, maize, ZnSe ATR crystal, region, after normalisation classification peanut, rapeseed, range: 4000-400 cm-1, and data compression by safflower, soya resolution: 16 cm-1 PCA or PLS [57] Grapeseed, groundnut, FT-MIR, DTGS detector, DA on PC scores 100% correct maize, rapeseed, ZnSe ATR crystal, classification for refined olive, walnut range: 4800-800 cm-1, extra virgin vs resolution: 4 cm-1 refined olive oil [58] Coconut, grapeseed, FT-MIR, DTGS detector, LDA after normalisation, 100% correct hazelnut, maize, ZnSe ATR crystal, baseline correction and classification mustard, palm, peanut, range: 4000-800 cm-1, data compression by PLS rapeseed, refined olive, resolution: 4 cm-1 safflower, sesame, soya, sunflower, sweet almond, walnut [59] Maize, peanut, FT-MIR, MCTA detector, DA 100% correct rapeseed, sesame, soya, ZnSe ATR crystal, classification sunflower, walnut range: 4000-700 cm-1, resolution: 2 cm-1, 128 averaged scans [60] Cottonseed, maize, FT-MIR, DTGS detector, PLS-DA after mean centring 100% correct sunflower range: 4000-400 cm-1 and normalisation classification [61] Hazelnut, maize, soya, FT-MIR, KBr disks, LDA after normalisation and 100% correct sunflower range: 4000-500 cm-1, variable selection classification resolution: 4 cm-1 [62] Flaxseed, grapeseed, FT-MIR, MCTA detector, PLS-DA after normalisation, 100% correct maize, peanut, diamond ATR crystal, detrend and SG 1st classification rapeseed, safflower, range: 3800-600 cm-1, derivative sesame, soya, sunflower resolution: 2 cm-1 [63] Butter, maize, rapeseed, FT-MIR, range: 4000-450 iECVA after MSC 100% correct soya, sunflower cm-1, transmittance mode classification

61

TABLE 4. (CONTINUED) Ref Other oils Materials Chemometrics Results [64] Hazelnut FT-MIR, ZnSe ATR crystal, CP-ANN Good classification range: 4000-400 cm-1, for olive and resolution: 4 cm-1 hazelnut oils [65] Hazelnut FT-MIR, ZnSe ATR crystal, SLDA after SG smoothing, 95.5% correct range: 4000-900 cm-1, SG 1st derivative and classification for resolution: 4 cm-1 selection of variables olive vs hazelnut oil related to unsaponifiable matter

6.3. Adulteration of Virgin Olive Oils with Other Oils

Numerous articles, gathered in Table 5, focus on the qualitative or quantitative analysis of mixtures of olive oil and other oils based on MIR data. Once again, only FT-MIR instruments were used. Marigheto et al. [58] applied LDA after data compression by PLS and obtained 99% correct classification for olive oil adulterated with as little as 5% of various vegetable oils. Similarly, Oussama et al. [66] used PLS-DA after variable selection to correctly classify 100% of the samples containing 1 to 24% of soya or sunflower oils, and de la Mata et al. [62] to discriminate between VOOs adulterated with more and less than 50% of other oils. Discriminant analyses also allowed Tay et al. [59] to successfully detect the addition of 2 to 10% of sunflower oil, while Rohman and Che Man reached 100% correct classification for samples adulterated with palm oil [67], lard [68], rice bran oil [69], maize and sunflower oils [70] and 97.4% with rapeseed oil [71]. Other techniques seem to give satisfying results, for instance Sun et al. [72] reached 96.6% correct classification with a Nearest Centroid algorithm after dimension reduction. Mixtures of hazelnut oil in VOO appear to be more difficult to detect. Indeed, Ozen and Mauer [73] achieved a correct classification rate of 100% with DA, but only for samples containing at least 25% of hazelnut oil. Baeten et al. [65] reached a LOD of 8% for Turkish hazelnut oil in refined olive oil by applying SLDA on variables characterising the unsaponifiable matter. Georgouli et al. [74] obtained a correct classification rate of 75% for samples adulterated with as little as 1% of hazelnut oil, with the use of k-NN after Continuous Locality Preserving Projections. The application of CP-ANN by Baeten and Novi [64] only resulted in a partial separation between VOOs with and without the addition of 2 to 20% of hazelnut oil. As for the quantification of

62

adulterants, most authors found that PLS regression after various pre-treatments gave satisfactory results. For instance, Wojcicki et al. [41], Tay et al. [59], Oussama et al. [66], Sun et al. [72], Rohman and Che Man [75], Lai et al. [76], Küpper et al. [77], Gurdeniz et al. [78] and Nigri and Oumeddour [79] all obtained R2 superior to 0.97 and RMSECV or RMSEP below 2.5% when predicting the concentrations of diverse vegetable oils mixed with olive oil. However, Yang and Irudayaraj [35], Mendes et al. [40] and Maggio et al. [80] had higher errors of prediction for the analysis of added olive pomace oil, soya oil and hazelnut oil respectively. PCR was usually shown to be less efficient than PLS regression, except for Jovic et al. [81] who managed to quantify the amounts of olive oil, sunflower, high oleic sunflower and rapeseed oils in binary and ternary mixtures with R2 over 0.99 and RMSEP under 2.3%. Another method, based on linear regression between the amount of adulterant and a ratio of peak heights, was applied by Vlachos et al. [82] and Poiana et al. [83] using the absorbance at 3006 and 2925 cm- 1 which can be related to the degree of unsaturation. Allam and Hamed [84] employed a similar method, but focused on the peaks at 1118 and 1097 cm-1 that were assigned to C-O stretching.

TABLE 5. EXAMPLES OF MIR SPECTROSCOPY APPLICATIONS TO ANALYSE VOOS ADULTERATED WITH OTHER OILS Ref Adulterants Materials Chemometrics Results [40] Soya oil (1.5 to 100%) FT-MIR, RT-DLaTGS PLS regression R2 = 0.986, detector, range: 4000-350 RMSECV = 14.71, cm-1, resolution: 4 cm-1 RMSEP = 4.89 [35] Olive pomace oil (0 to FT-MIR, DTGS detector, PLS regression after R2 = 0.991, 100% in 5% ZnSe ATR crystal, range: MSC SECV = 4.74%, increments) 3200-600 cm-1, resolution: SEP = 3.28% 4 cm-1 [41] Mild deodorised and FT-MIR, ATR crystal, range: PLS after MSC and 1st R2 = 0.99, refined olive oils (2.5 4000-650 cm-1, resolution: derivative RMSEP = 2.1% to 75%) 4 cm-1 [58] Refined olive oil, FT-MIR, DTGS detector, LDA after 99% correct sunflower, rapeseed, ZnSe ATR crystal, range: normalisation, baseline classification peanut, soya, maize 4000-800 cm-1, resolution: correction and data LOD = 5% oils (5 to 45%) 4 cm-1 compression PLS [59] Sunflower oil (2 to FT-MIR, MCTA detector, DA, PLS regression DA: 100% correct 10%) ZnSe ATR crystal, range: classification 4000-700 cm-1, resolution: PLS: R2 = 0.974, 2 cm-1 RMSECV < 1% [60] Sunflower, maize oils FT-MIR, DTGS detector, PLS-DA after mean Good separation (25 to 75%) range: 4000-400 cm-1 centring and between pure and normalisation adulterated samples

63

TABLE 5. (CONTINUED) Ref Adulterants Materials Chemometrics Results [61] Sunflower, maize, FT-MIR, KBr disks, range: MLR after R2 = 0.91 to 0.99%, soya, hazelnut oils 4000-500 cm-1, normalisation SEP = 1.5 to 2%, (5 to 100%) resolution: 4 cm-1 LOD = 1.3 to 4.8% [62] Rapeseed, maize, FT-MIR, MCTA detector, PLS-DA and PLS PLS-DA: 95% correct flaxseed, grape diamond ATR crystal, regression after classification for samples seed, peanut, range: 3800-600 cm-1, normalisation, detrend >50% adulterant safflower, sesame, resolution: 2 cm-1 and SG 1st derivative PLS: R2 = 0.79, soya, sunflower RMSECV = 8.28 oils (10 to 90%) [64] Hazelnut oil FT-MIR, ZnSe ATR crystal, CP-ANN partial separation (2 to 20%) range: 4000-400 cm-1, between mixtures and resolution: 4 cm-1 olive oil [65] Hazelnut oil FT-MIR, ZnSe ATR crystal, SLDA after SG 100% correct (2 to 20%) range: 4000-900 cm-1, smoothing, SG 1st classification, resolution: 4 cm-1 derivative and selection LOD = 8% of Turkish of variables related to hazelnut oil in olive oil unsaponifiable matter [66] Soya, sunflower FT-MIR, DTGS detector, PLS-DA and PLS PLS-DA: 100% correct oils (1 to 24%) ATR crystal, range: 4000- regression after classification 600 cm-1, resolution: 4 cm-1 variable selection (VIP) PLSR: R2 = 0.996, RMSECV = 0.63, RMSEP = 0.41, LOD = 1.2% [67] Palm oil FT-MIR, DTGS detector, LDA and PLS regression LDA: 100% correct (1 to 50%) ATR crystal, range: 4000- after SG 1st derivative classification 650 cm-1, resolution: 4 cm-1 PLSR: R2 = 0.998, RMSECV = 0.285, RMSEP = 0.616 [68] Lard (1 to 50%) FT-MIR, DTGS detector, DA and PLS regression DA: 100% correct ZnSe ATR crystal, range: after SG 1st derivative classification 4000-650 cm-1, resolution: PLSR: R2 = 0.987, RMSEC 4 cm-1 = 0.070, RMSEP = 1.99 [69] Rice bran oil FT-MIR, DTGS detector, LDA and PLS regression LDA : 100% correct (1 to 50%) ZnSe ATR crystal, after normalisation classification range: 4000-650 cm-1, PLSR: R2 = 0.981, RMSECV resolution: 4 cm-1 = 1.34%, RMSEP = 2.15% [70] Maize and FT-MIR, DTGS detector, DA and PLS regression DA: 100% correct sunflower oils ZnSe ATR crystal, after SG 1st derivative classification (1 to 50%) range: 4000-650 cm-1, PLSR: R2 = 0.987 to 0.997, resolution: 4 cm-1 RMSEC = 0.034 to 0.404, RMSEP = 1.13 to 2.02 [71] Rapeseed oil FT-MIR, DTGS detector, DA and PLS regression DA: 97.4% correct (1 to 50%) ZnSe ATR crystal, after SG 1st derivative classification range: 4000-650 cm-1, PLSR: R2 = 0.997, RMSEC resolution: 4 cm-1 = 0.108, RMSEP = 1.52

64

TABLE 5. (CONTINUED) Ref Adulterants Materials Chemometrics Results [72] Camelia, soya, FT-MIR, DTGS detector, Nearest centroid NCC: 96.6% correct sunflower, maize ZnSe ATR crystal, classification after SLLE classification oils (1 to 90%) range: 4000-400 cm-1, dimension reduction, PLSR : R2 = 0.971 to resolution: 2 cm-1 PLS regression after 0.999, RMSECV = 0.095 mean centring, to 0.017 normalisation and SG 1st derivative [73] Hazelnut oil FT-MIR, MCTA detector, DA 100% correct (5 to 50%) ZnSe ATR crystal,range: classification for hazelnut 3200-800 cm-1, resolution: adulteration > 25% 4 cm-1 [74] Refined and crude FT-MIR, DTGS detector, kNN after SNV, SG 75% correct classification hazelnut oils diamond ATR crystal, smoothing and (1 to 90%) range: 4000-550 cm-1, Continuous Locality resolution: 4 cm-1 Preserving Projections [75] Virgin coconut oil FT-MIR, DTGS detector, PLS regression R2 = 0.997, RMSEC = (1 to 50%) ATR crystal, range: 4000- 0.756, RMSEP = 0.823 650 cm-1, resolution: 4 cm-1 [76] Refined olive oil, FT-MIR, DTGS detector, PLS regression after SEP = 0.68 to 0.92 walnut oil ZnSe ATR crystal, mean centring (0 to 22%) range: 4800-800 cm-1, and variance scaling resolution: 4 cm-1 [77] Sunflower oil FT-MIR, silver halide probe, PLS regression after SEP = 1.2% (2 to 10%) range: 3000-600 cm-1, variable selection resolution: 4 cm-1 [78] Rapeseed, cotton, FT-MIR, DTGS detector, PLS regression after R2 = 0.93 to 0.98, maize, sunflower ZnSe ATR crystal, mean centring and SEP = 1.04 to 1.4 oils (2 to 20%) range: 4000-650 cm-1, wavelet analysis LOD = 5% resolution: 2 cm-1 [79] Olive pomace oil FT-MIR, KBr disk, range: PLS regression R2 = 0.98 4000-450 cm-1, resolution: 4 cm-1 [80] Olive pomace, FT-MIR, ZnSe ATR crystal, PLS regression after R2 = 0.935 to 0.999, oleic and linoleic range: 4000-700 cm-1, mean centring SEP = 1.13 to 20.8% sunflower, resolution: 4 cm-1 and SG 1st derivative rapeseed, hazelnut oils (5 to 40%) [81] Sunflower, high FT-MIR, diamond ATR QDA and PCR after QDA: 89% correct oleic sunflower, crystal, range: 4000-600 mean-centring classification for binary rapeseed oils cm-1, resolution: 2 cm-1 and ternary mixtures (10 to 90%) PCR: R2 = 0.992 to 0.998, RMSEP = 2.27% to 1.22%

65

TABLE 5. (CONTINUED) Ref Adulterants Materials Chemometrics Results [82] Olive pomace, FT-MIR, DTGS detector, KBr linear regression on the R2 = 0.991 to 0.996 sunflower, soya, disks, ratio of peak height LOD = 6 to 9% sesame, maize oils range: 4000-400 cm-1, 3006/2925 cm-1 (2 to 90%) resolution: 4 cm-1 [83] Refined soya oil FT-MIR, ATR crystal, linear regression on the R2 = 0.998 (10 to 90%) range: 4000-400 cm-1, ratio of peak height LOD = 6% resolution: 4 cm-1 3006/2925 cm-1 [84] Refined sunflower, FT-MIR, DTGS detector, KBr linear regression on the R2 = 0.963 to 0.985 soya, maize oils disks, ratio of peak height (25 to 100%) range: 4000-400 cm-1, 1118/1097 cm-1 resolution: 4 cm-1

6.4. Authentication of Geographical or Varietal Origins

Similarly to NIR, the ability of FT-MIR spectroscopy to differentiate VOOs from various origins has been the subject of numerous research works, as can be seen in Table 6. EVOOs from three different Italian regions were correctly classified by Sinelli et al. [43] using PLS-DA, while Galtier et al. [85] discriminated virgin olive oils from France and other countries with the same technique. Moreover, PLS-DA allowed Galtier et al. [85] and Dupuy et al. [1] to reach correct classifications of 96% and 98% respectively between VOOs from the six French PDOs, with samples collected over several harvest years. Bevilacqua et al. [46] also correctly identified 92.3% of the samples from PDO Sabina versus other Mediterranean regions by applying PLS-DA to MIR data, even though NIR data provided better results. De Luca et al. [86] built a model based on PLS-DA after cluster analysis and variable selection by Martens test to separate VOOs from 4 Moroccan regions, and obtained satisfactory results with R2 over 0.986 and RMSEP under 0.049. LDA has also been used by several authors. For instance, Tapp et al. [87] applied it after variable selection by genetic algorithm (GA), resulting in a correct classification rate of 100% for the country of origin of VOO samples. Casale et al; [48] and Sinelli et al. [49] both obtained a correct classification of 86.6% between three Italian cultivars with LDA after variable selection, and Abdallah et al. [88] correctly classified 100% of the samples from seven Tunisian cultivars. Additionally, in this last study the concentrations of binary mixtures of cultivars were predicted by MLR, giving R2 over 0.956 and SEP under 3.88%. Although supposedly less efficient than discriminant analyses, SIMCA was applied by Gurdeniz in several studies [89–91] and

66

allowed the discrimination of Turkish olive oils according to their region of origin, harvest year and cultivar. PLS regression was also used to predict the concentrations of cultivars in binary mixtures with R2 between 0.84 and 0.91 and RMSEP between 3.14 and 20.9%. In another study, Casale et al. [53] developed a UNEQ model and achieved a correct classification of 92.5% between olive oils from PDO Chianti Classico and other Italian regions. This was, however, a less satisfactory result than that obtained with NIR data. Finally, SVM analyses were employed by Devos et al. [56] and Caetano et al. [92], resulting in mixed outcomes.

TABLE 6. EXAMPLES OF MIR SPECTROSCOPY APPLICATIONS TO DETERMINE THE ORIGIN OF VOOS Ref Origins Materials Chemometrics Results [1] 6 French PDOs, FT-MIR, DTGS detector, PLS-DA after mean 98% correct 5 harvest years diamond ATR crystal, centring and classification for PDO range: 4000-600 cm-1, normalisation resolution: 4 cm-1 [43] 3 Italian regions FT-MIR, DTGS detector, Ge PLS-DA after SG 2nd 100% correct ATR crystal, range: 4000-700 derivative classification cm-1, resolution: 4 cm-1 [46] PDO Sabina and FT-MIR, DTGS detector, ZnSe PLS-DA after MSC and 92.3% correct other Mediterranean ATR crystal, range: 4000-630 detrend classification for regions, 2 harvest cm-1, resolution: 2 cm-1 Sabina, 95.5% for years other origins [48] 3 cultivars from 3 FT-MIR, DTGS detector, Ge LDA after SNV, SG 1st 86.6% correct Italian regions ATR crystal, range: 4000-700 derivative and variable classification for cm-1, resolution: 4 cm-1 selection (SELECT) cultivars [49] 3 cultivars from 3 FT-MIR, DTGS detector, Ge LDA after SNV, SG 1st 86.6% correct Italian regions ATR crystal, range: 4000-700 derivative and variable classification cm-1, resolution: 4 cm-1 selection (SELECT) [53] PDO Chianti Classico FT-MIR, DTGS detector, Ge UNEQ after SNV, SG 1st 92.5% correct and other Italian ATR crystal, range: 4000-700 derivative and variable classification regions cm-1, resolution: 4 cm-1 selection (SELECT) [56] Liguria and other FT-MIR, Ge ATR crystal, SVM after SG 82.2% correct Italian regions, 3 range: 4000-600 cm-1, smoothing, SG 1st classification harvest years resolution: 4 cm-1 derivative and normalisation [85] 6 French PDOs and FT-MIR, DTGS detector, PLS-DA after MSC 100% correct other countries, 4 diamond ATR crystal, classification for harvest years range: 4000-600 cm-1, France vs other resolution: 4 cm-1 countries, 96% correct classification for PDOs

67

TABLE 6. (CONTINUED) Ref Origins Materials Chemometrics Results [86] 4 Moroccan regions FT-MIR, DTGS detector, PLS-DA after variable R2 = 0.986 to 0.993 range : 4000-600 cm-1, selection by Martens RMSEP = 3.55 to resolution: 4 cm-1 test 4.90% [87] Spain, Italy, Greece, FT-MIR, DTGS detector, ZnSe LDA after variable 100% correct Portugal ATR crystal, range: 4000-800 selection by genetic classification cm-1, resolution: 4 cm-1 algorithm [88] 7 Tunisian cultivars, FT-MIR, ATR crystal, LDA and MLR (binary LDA: 100% correct 2 harvest years / range: 4000-600 cm-1, mixtures) after classification for binary mixtures resolution: 4 cm-1 normalisation cultivars MLR: R2 = 0.956 to 0.998, RMSEC = 2.40 to 5.90, SEP = 1.09 to 3.88% [89] 3 Turkish cultivars / FT-MIR, DTGS detector, ZnSe PLS regression R2 = 0.84 to 0.91, binary mixtures ATR crystal, range: 4000-650 RMSE = 3.14 to 2.09% cm-1, resolution: 2 cm-1 [90] 5 cultivars from 2 FT-MIR, DTGS detector, ZnSe Coomans plot on PCA R2 = 0.759 to 0.953 Turkish regions, 2 ATR crystal, range: 4000-650 after wavelet for geographical harvest years cm-1, resolution: 2 cm-1 compression origin, effect of harvest year and cultivar [91] Turkey, 2 harvest FT-MIR, DTGS detector, ZnSe SIMCA after Orthogonal discrimination for years ATR crystal, range: 4000-650 Signal Correction and area of origin and cm-1, resolution: 2 cm-1 wavelet analysis harvest year [92] Italy, Greece, Spain, FT-MIR, Ge ATR crystal, SVM after SG 1st 88.7 to 94.2% France, Turkey, range: 4000-600 cm-1, derivative sensitivity and 50 to Cyprus, 2 harvest resolution: 4 cm-1 76.9% selectivity for years Italian vs other countries / 58.5 to 65.2% sensitivity and 91.4 to 94.8% selectivity for Ligurian vs other regions

68

7. Raman spectroscopy

7.1. Spectra Interpretation

The Raman spectrum of EVOO presented in Figure 7 gives complementary information compared to the MIR spectrum. Peak A (1750 cm-1) results from C = O stretching vibrations, and peak B (1660 cm-1) is related to cis C = C stretching. They correspond to the peaks D and E of the MIR spectrum, although their relative intensities are reversed. The two peaks labelled C (1450-1300 cm-1) are associated with C-H aliphatic bending vibrations, thus matching the region F of the MIR spectrum. Peak D, at 1270 cm-1, is attributed to =C-H bending vibrations of cis double bonds and is not identified on the MIR spectrum. Region E (1150-800 cm-1) is also characteristic of the Raman spectrum and related to C-C stretching vibrations [9, 11, 29, 30].

FIGURE 7. RAMAN SPECTRUM OF EVOO WITH IDENTIFICATION OF MAIN ABSORBANCE BANDS.

7.2. Identification of Virgin Olive Oils vs Other Oils

Although it is less frequently used than NIR or MIR, several authors have studied the potential of Raman spectroscopy to authenticate olive oils (Table 7). In this case, as for MIR, only FT- Raman instruments were used. Baeten et al. [93, 94] demonstrated the ability of Raman spectra to discriminate between various oils and fats, including VOO. SLDA indeed allowed classifying the samples depending on

69

their saturated, mono-unsaturated and poly-unsaturated fatty acid content. In another study [65], SLDA on selected variables related to unsaponifiable matter gave a correct classification of 95% between refined olive oil and hazelnut oil, which is a similar result to that obtained with MIR data. Marigheto et al. [58] reached a correct classification rate of 93% for EVOO versus other vegetable oils with LDA after data compression by PCA, although the same method applied to MIR spectra correctly identified 100% of the samples. Similar results were obtained by Yang et al. [32] using CVA after Raman data treatment by PLS, which gave 94.4% correct classification.

TABLE 7. EXAMPLES OF RAMAN SPECTROSCOPY APPLICATIONS TO DIFFERENTIATE VOOS FROM OTHER OILS Ref Other oils Materials Chemometrics Results [32] Butter, coconut, cod liver oil, FT-Raman, laser: HeNe, CVA after normalisation 94.4% correct lard, maize, 2 W, InGaAs detector, and data compression classification peanut, rapeseed, safflower, range: 3700-400 cm-1, by PLS soya resolution: 32 cm-1 [58] Coconut, grapeseed, hazelnut, FT-Raman, laser: Topaz, LDA after 93% correct maize, mustard, palm, peanut, 1064 nm, 0.9 W, Ge normalisation, classification rapeseed, refined olive, detector, baseline correction and safflower, sesame, soya, range: 3500-500 cm-1, data compression by sunflower, sweet almond, resolution: 4 cm-1 PCA walnut [65] Hazelnut FT-Raman, laser: SLDA after SG 95% correct Nd:YAG, smoothing, classification 1064 nm, 0.6 W, SG 1st derivative and InGaAs detector, selection of variables range: 4000-900 cm-1, related to resolution: 4 cm-1 unsaponifiable matter [93] Almond, Brazil nut, butter, FT-Raman, laser: SLDA after SG Classification by coconut, grapeseed, hazelnut, Nd:YAG, 1064 nm, 0.5 smoothing, type of oil high oleic sunflower, W, Ge detector, normalization and according to hydrogenated fish, maize, range: 3250-0 cm-1, variable selection their fatty acid margarine, palm, peanut, resolution: 4 cm-1 contents rapeseed, safflower, sesame, soya, sunflower, tallow, walnut [94] Coconut, high oleic sunflower, FT-Raman, laser: SLDA Discrimination hydrogenated fish, maize, Nd:YAG, of oils palm, peanut, rapeseed, soya, range: 3250-0 cm-1, depending on sunflower, tallow resolution: 4 cm-1 their fatty acid contents

70

7.3. Adulteration of Virgin Olive Oils with Other Oils

Table 8 presents some articles studying the ability of Raman spectroscopy to detect and quantify the adulteration of VOOs. A majority of these works used FT-Raman, but an interest for confocal benchtop and handheld instruments can be noticed. Marigheto et al. [58] employed Raman spectroscopy to detect the adulteration of EVOOs with different vegetable oils and reached a correct classification of 97% with PLSR, but these results were less satisfactory than with MIR spectra. Baeten et al. [65, 94] also showed that SLDA could discriminate genuine olive oil from adulterated samples, and even obtained a correct classification of 97.5% for samples of refined olive oil adulterated with as little as 2% of hazelnut oil. A method involving Raman measurements at increasing temperatures to enhance spectral differences between pure and adulterated samples was successfully tested by Kim et al. [95]. Temperatures of 80 and 90°C allowed a correct classification of 100% by applying LDA on the PCA scores of the spectra. Regarding quantitative analyses, several authors such as Mendes et al. [40], Yang and Irudayaraj [35], El-Abassy et al. [96], Davies et al. [97], Lopez-Diez et al. [98] or Heise et al. [99], applied PLS regression to Raman spectra to predict the concentrations of added sunflower, soya, hazelnut or olive pomace oils to VOO. They obtained quite satisfactory results, with R2 over 0.97 and SEP below 3.6%. Yang and Irudayaraj [35] concluded that Raman spectroscopy was slightly more efficient that NIR and MIR to quantify the adulteration of EVOO with olive pomace oil, whereas Mendes et al. [40] detected no statistically significant difference between the three techniques for the analysis of soya and olive oil mixtures. Baeten et al. [100] used stepwise linear regression analysis (SLRA) to measure the amount of trilinolein added to VOO, yielding an R2 of 0.998 for concentrations between 1 and 10% of adulterant. The same method applied to VOOs adulterated with maize, soya or olive pomace oils gave an R2 of 0.92. Zhang et al. [101] developed an external standard method (ESM) resulting in R2 over 0.99 and RMSE below 3.2%, while Dong et al. [102] generated an LS-SVM model after parameter optimization by Bayesian framework that gave an R2 of 0.997 and RMSEP of 0.051.

71

TABLE 8. EXAMPLES OF RAMAN SPECTROSCOPY APPLICATIONS TO ANALYSE VOOS ADULTERATED WITH OTHER OILS Ref Adulterants Materials Chemometrics Results [40] Soya oil (1.5 to FT-Raman, laser: Nd:YAG, PLS regression R2 = 0.998, 100%) 1064 nm, 0.2 W, Ge detector, RMSECV = 1.61, range: 3500-50 cm-1, RMSEP = 1.57 resolution: 4 cm-1 [35] Olive pomace oil FT-Raman, laser: 1064 nm, PLS regression after R2 = 0.997, (0 to 100% in 5% 0.5 W, InGaAs detector, MSC SECV = 2.23%, increments) range: 4000-400 cm-1, SEP = 1.72% resolution: 8 cm-1 [58] Refined olive oil, FT-Raman, laser: Topaz, 1064 PLS after normalisation, 97% correct sunflower, nm, 0.9 W, Ge detector, baseline correction and classification rapeseed, peanut, range: 3500-500 cm-1, data compression by LOD = 45% for refined soya, maize oils resolution: 4 cm-1 PCA olive oil, 5% for others (5 to 45%) [65] Hazelnut oil FT-Raman, laser: Nd:YAG, SLDA after SG 97.5% correct (2 to 20%) 1064 nm, 0.6 W, InGaAs smoothing, SG 1st classification detector, range: 4000-900 derivative and selection cm-1, resolution: 4 cm-1 of variables related to unsaponifiable matter [94] Olive pomace oil, FT-Raman, laser: Nd:YAG, SLDA discrimination of maize, sunflower, range: 3250-0 cm-1, genuine vs adulterated soya oils (1 to 10%) resolution: 4 cm-1 samples [95] Soya oil (5%) Raman, laser: 785 nm, 0.1 W, LDA after 80 or 90°C gives 100% 8 temperatures (20 to 90°C), normalisation, baseline correct classification range: 1500-690 cm-1, correction and data resolution: 4 cm-1 compression by PCA [96] Sunflower oil Raman, laser: Ar, 514 nm, PLS regression after R2 = 0.971 to 0.988, (5 to 100%) 0.01 W, CCD detector, baseline correction RMSECV = 1.33 to 3.59 range: 3100-700 cm-1 LOD = 0.05% [97] Sunflower Oil FT-Raman, laser: Nd:YAG, PLS regression RMSEC = 2.40%, (2 to 10%) 1064 nm, 1 W, range: 3600- RMSEP = 2.86% 100 cm-1 [98] Hazelnut oil Raman, laser: 780 nm, 0.02 PLS regression after R2 = 0.979, (5 to 100%) W, range: 3000-1000 cm-1, baseline correction, RMSEP = 0.94 resolution: 6 cm-1 normalisation and SG smoothing [99] Sunflower oil FT-Raman, laser: Nd:YAG, PLS regression after SG SEP = 1.26% (1 to 10%) 1064 nm, 1 W, resolution: 4 1st derivative and cm-1 variable selection [100] Trilinolein, olive FT-Raman, laser: Nd:YAG, SLRA after SG R2 = 0.998 for pomace, maize, 1064 nm, 0.5 W, Ge detector, smoothing, SG 1st trilinolein soya oils (1 to 10%) range: 3250-100 cm-1, derivative and variable R2 = 0.92 for oils resolution: 4 cm-1 selection

72

TABLE 8. (CONTINUED) Ref Adulterants Materials Chemometrics Results [101] Soya, sunflower, Handheld Raman, laser: 785 External standard R2 = 0.996 to 0.991, maize oils nm, 0.2 W, range: 2000-200 method after RMSE = 1.40 to 3.13% (1 to 100%) cm-1, resolution: 8 cm-1 normalisation [102] Soya, maize, Handheld Raman, laser: 785 LS-SVM with Bayesian R2 = 0.997, sunflower oils nm, 0.375 W, 10 mm quartz network RMSEC = 0.020, (2 to 100%) cell, range: 2100-150 cm-1, RMSEP = 0.051 resolution: 6 cm-1

7.4. Authentication of Geographical or Varietal Origins

Few studies have been published regarding the confirmation of VOOs declared geographical origin or cultivar with Raman spectroscopy, all of them using confocal instruments, as shown in Table 9. Korifi et al. [2] applied PLS-DA to Raman spectra, yielding a correct classification of 92.3% for the six French PDOs with samples collected over several harvest years. A similar method gave Sanchez-Lopez et al. [103] a correct classification of 86.6% for three Andalusian PDOs. In this study, PLS-DA on Raman data was also able to discriminate the EVOOs based on their harvest year, region of origin and olive variety with correct results of 94.3%, 89% and 84% respectively. Finally, Gouvinhas et al. [104] used LDA to correctly classify 81.9% of Portuguese EVOO samples depending on their maturation stages.

TABLE 9. EXAMPLES OF RAMAN SPECTROSCOPY APPLICATIONS TO DETERMINE THE ORIGIN OF VOOS Ref Origins Materials Chemometrics Results [2] 6 French PDOs, Raman, laser: Nd:YVO4 DPSS, PLS-DA after SNV and 92.3% correct 6 harvest years 532 nm, 0.15 W, CCD MSC classification for detector, PDOs range: 1800-440 cm-1 [103] 3 Andalusian Raman, laser: Nd:YAG, PLS-DA after SG 94.3% correct PDOs and other 1064 nm, 0.3 W, smoothing and classification for Spanish regions, range: 3100-100 cm-1, normalisation harvest year, 89% for 6 harvest years resolution: 4 cm-1 geographical origin, 86.6% for PDOs, 84% for olive variety [104] 3 Portuguese Raman, laser: Ar, 488 nm, LDA after SNV and data 81.9% correct cultivars, 0.1 W, CCD detector, compression by PCA classification for 3 maturity stages range: 3050-250 cm-1 maturation stage

73

8. Multiblock analysis - concatenation of spectral data

8.1. Adulteration of Virgin Olive Oils with Other Oils

A couple of studies focusing on the combination of data from several analytical methods have recently been published and are presented in Table 10. Wojcicki et al. [41] applied PLS regression to concatenated data from NIR, MIR, visible and fluorescence spectra, yielding an R2 of 0.96 and RMSEP of 4.1%. However, these results showed no significant improvement compared to those obtained with separate spectra. On the other hand, Nigri and Oumeddour [105] obtained better results with concatenated MIR and fluorescence data than with individual datasets. In this case, PLS regression gave an R2 of 0.992 and RMSECV of 2.67.

TABLE 10. EXAMPLES OF CONCATENATED DATA APPLICATIONS TO ANALYSE VOOS ADULTERATED WITH OTHER OILS Ref Adulterants Materials Chemometrics Results [41] Mild deodorised NIR, 2 mm quartz cell, PLS regression No improvement and refined olive range: 6150-4500 cm-1 vs separate oils (2.5 to 75%) FT-MIR, ATR crystal, range: 4000- spectra 650 cm-1, resolution: 4 cm-1 R2 = 0.96, Fluorescence, 10 mm quartz cell, RMSEP = 4.1% range: 40000-14285 cm-1 [105] Sunflower, olive FT-MIR, DTGS detector, KBr disks, PLS regression after Better results vs pomace oils range: 4000-450 cm-1, resolution: normalisation and SG 1st separate spectra (5 to 50%) 4 cm-1 derivative R2 = 0.992, Fluorescence, xenon lamp source, RMSECV = 2.67 10 mm quartz cell, range: 45455- 11110 cm-1

8.2. Authentication of Geographical or Varietal Origins

Diverging conclusions have been drawn regarding the usefulness of spectral data concatenation for the authentication of virgin olive oils, as can be seen in the articles from Table 11. Harrington et al. [106] reached 100% of correct classification between oils from five French PDOs by applying Principal-Component Orthogonal Signal Correction (PC-OSC) and PLS-DA to fused NIR and MIR data. However, this result was not compared to that obtained with each technique alone. In another study, Dupuy et al. [1] obtained 99% of correct classification for six French PDOs with PLS-DA on concatenated NIR and MIR spectra, but this did not significantly

74

improve the result compared to MIR data alone. On the contrary, in three different articles [48, 53, 107], Casale et al. obtained an improved rate of correct classification by combining data from various analytical instruments. For instance, LDA on fused NIR and MIR spectra gave a correct classification rate of 90.2% for three Italian cultivars, versus respectively 82.9% and 86.6% for NIR and MIR data alone [48]. UNEQ class modelling applied to combined NIR, MIR and UV-visible spectral data gave a correct classification of 100% for PDO olive oil Chianti Classico and improved the predictive ability of the model [53]. Concatenation of NIR, UV-visible and MS data also resulted in 100% discrimination between EVOOs from Liguria and other Italian regions, which was not possible with each separate technique [107].

TABLE 11. EXAMPLES OF CONCATENATED DATA APPLICATIONS TO DETERMINE THE ORIGIN OF VOOS Ref Origins Materials Chemometrics Results [1] 6 French PDOs, FT-NIR, 2 mm quartz cell, PLS-DA after mean No improvement vs 5 harvest years range: 10000-4500 cm-1, centring and separate spectra resolution: 4 cm-1 normalisation 99% correct FT-MIR, DTGS detector, classification for PDO diamond ATR crystal, range: 4000-600 cm-1, resolution: 4 cm-1 [48] 3 cultivars from FT-NIR, 8mm vials, LDA after SNV, SG Better results vs 3 Italian regions range: 12500-4500 cm-1, 1st derivative and separate spectra resolution: 8 cm-1 variable selection 90.2% of correct FT-MIR, DTGS detector, Ge ATR (SELECT) classification for crystal, range: 4000-700 cm-1, cultivars resolution: 4 cm-1 [53] PDO Chianti FT-NIR, 5mm quartz cell, UNEQ after SNV, SG Better results vs Classico and range: 10000-4000 cm-1, 1st derivative and separate spectra other Italian resolution: 4 cm-1 variable selection 100% correct regions FT-MIR, DTGS detector, Ge ATR (SELECT) classification crystal, range: 4000-700 cm-1, resolution: 4 cm-1 UV-Visible, 5 mm quartz cell, range: 52360-9090 cm-1 [106] 5 French PDOs, FT-NIR, 2 mm quartz cell, PLS2-DA after PC- 100% correct 5 harvest years range: 10000-4500 cm-1, OSC classification for PDO resolution: 4 cm-1 FT-MIR, DTGS detector, diamond ATR crystal, range: 4000-600 cm-1, resolution: 4 cm-1

75

TABLE 11. (CONTINUED) Ref Origins Materials Chemometrics Results [107] Liguria and other FT-NIR, 5 mm quartz cell, UNEQ-DA after SG Better results vs regions range: 10000-4000 cm-1, 1st derivative and separate data resolution: 4 cm-1, variable selection 100% correct transmittance mode (SELECT) classification Headspace mass spectrometer UV-Visible, 10 mm quartz cell, range: 52630-9090 cm-1

9. Conclusion

Bibliometric results show a growing interest in vibrational spectroscopic techniques as an alternative method for the authentication of VOOs and EVOOs. The ability of NIR, MIR and Raman spectroscopies to detect and quantify the adulteration of VOOs with cheaper oils, and to identify the geographical or varietal origins of samples, has been demonstrated in numerous research works. Even though MIR is more often studied, no significant difference appears in the quality of the results obtained with the three techniques. Thus, this apparent preference may be due to the greater availability of MIR instruments. Despite these promising results, vibrational spectroscopic techniques are not currently recognised as reference analytical methods by international standards and regulations. The importance of chemometrics pre-treatment and modelling, allowing to treat the large amount of complex data generated by the vibrational spectroscopic analyses, should also be noted. Indeed, NIR, MIR and Raman spectra represent “fingerprints” of the samples, and only chemometrics can reveal the slight differences between two VOOs spectra. In the future, more studies could be focused on the use of multiblock models to explore the interest of combining complementary information from several analytical techniques. The use of on-line instruments, for instance with fibre optic probes, could be an interesting way to monitor the varietal origin and quality parameters during . However, the issue of NIR, MIR and Raman instrumental drift should be addressed if they are to be used on a routine basis.

76

References

[1] Dupuy, N., Galtier, O., Ollivier, D., Vanloot, P., & Artaud, J. (2010). Comparison between NIR, MIR, concatenated NIR and MIR analysis and hierarchical PLS model. Application to virgin olive oil analysis. Anal. Chim. Acta, 666:23–31. [2] Korifi, R., Le Dréau, Y., Molinet, J., Artaud, J., & Dupuy, N. (2011). Composition and authentication of virgin olive oil from French PDO regions by chemometric treatment of Raman spectra. J. Raman Spectrosc., 42:1540–1547. [3] Gómez-Caravaca, A.M., Maggio, R.M., & Cerretani, L. (2016). Chemometric applications to assess quality and critical parameters of virgin and extra-virgin olive oil. A review. Anal. Chim. Acta, 913:1–21. [4] Messai, H., Farman, M., Sarraj-Laabidi, A., Hammami-Semmar, A., & Semmar, N. (2016). Chemometrics Methods for Specificity, Authenticity and Traceability Analysis of Olive Oils: Principles, Classifications and Applications. Foods, 5:77. [5] Nenadis, N., & Tsimidou, M.Z. (2016). Perspective of vibrational spectroscopy analytical methods in on-field/official control of olives and virgin olive oil. Eur. J. Lipid Sci. Technol. [6] Sørensen, K.M., Khakimov, B., & Engelsen, S.B. (2016). The use of rapid spectroscopic screening methods to detect adulteration of food raw materials and ingredients. Curr. Opin. Food Sci., 10:45–51. [7] Valli, E., Bendini, A., Berardinelli, A., Ragni, L., Riccò, B., Grossi, M. et al. (2016). Rapid and innovative instrumental approaches for quality and authenticity of olive oils: Innovative approaches for quality of virgin olive oils. Eur. J. Lipid Sci. Technol., 118:1601–1619. [8] Wang, P., Sun, J., Zhang, T., & Liu, W. (2016). Vibrational spectroscopic approaches for the quality evaluation and authentication of virgin olive oil. Appl. Spectrosc. Rev., 51:763–790. [9] Downey, G. (2016). Advances in food authenticity testing. Boston, MA: Elsevier. [10] Sun, D.-W. (2008). Modern techniques for food authentication. Boston, MA: Elsevier/Academic Press. [11] Aparicio, R., & Harwood, J. (2013). Handbook of Olive Oil. Boston, MA: Springer US. [12] Wold, S. (1974). Spline Functions in Data Analysis. Technometrics, 16:1–11. [13] Kowalski, B.R. (1975). Chemometrics: Views and Propositions. J. Chem. Inf. Model., 15:201–203.

77

[14] Barclay, V.J., Bonner, R.F., & Hamilton, I.P. (1997). Application of Wavelet Transforms to Experimental Spectra: Smoothing, Denoising, and Data Set Compression. Anal. Chem., 69:78– 90. [15] Savitzky, A., & Golay, M.J.E. (1964). Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal. Chem., 36:1627–1639. [16] Barnes, R.J., Dhanoa, M.S., & Lister, S.J. (1989). Standard Normal Variate Transformation and De-trending of Near-Infrared Diffuse Reflectance Spectra. Appl. Spectrosc., 43:772–777. [17] Geladi, P., MacDougall, D., & Martens, H. (1985). Linearization and Scatter-Correction for Near-Infrared Reflectance Spectra of Meat. Appl. Spectrosc., 39:491–500. [18] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. J. Educ. Psychol., 24:417–441. [19] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemom. Intell. Lab. Syst., 2:37–52. [20] Bridges, C.C. (1966). Hierarchical Cluster Analysis. Psychol. Rep., 18:851–854. [21] Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen., 7:179–188. [22] Wold, S., & SjöStröM, M. (1977). SIMCA: A Method for Analyzing Chemical Data in Terms of Similarity and Analogy. In: Kowalski BR, editor. Chemom. Theory Appl., Washington, D.C.: American Chemical Society, 52:243–282. [23] Kohonen, T. (1990). The self-organizing map. Proc. IEEE, 78:1464–1480. [24] Brown, C.W., Lynch, P.F., Obremski, R.J., & Lavery, D.S. (1982). Matrix representations and criteria for selecting analytical wavelengths for multicomponent spectroscopic analysis. Anal. Chem., 54:1472–1479. [25] Geladi, P., & Kowalski, B.R. (1986). Partial least-squares regression: a tutorial. Anal. Chim. Acta, 185:1–17. [26] Jolliffe, I.T. (1982). A Note on the Use of Principal Components in Regression. Appl. Stat., 31:300. [27] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Mach. Learn., 20:273–297. [28] Engel, J., Gerretzen, J., Szymańska, E., Jansen, J.J., Downey, G., Blanchet, L. et al. (2013). Breaking with trends in pre-processing? Trends Anal. Chem., 50:96–106. [29] Baeten, V., & Dardenne, P. (2002). Spectroscopy: developments in instrumentation and analysis. Grasas Aceites, 53:45–63.

78

[30] Bertrand, D., & Dufour, É. (2006). La spectroscopie infrarouge et ses applications analytiques. 2e ed. Paris, France: Lavoisier [Infrared spectroscopy and its analytical applications. 2nd ed. Paris, France: Lavoisier]. [31] Hourant, P., Baeten, V., Morales, M.T., Meurens, M., & Aparicio, R. (2000). Oil and Fat Classification by Selected Bands of Near-Infrared Spectroscopy. Appl. Spectrosc., 54:1168– 1174. [32] Yang, H., Irudayaraj, J., & Paradkar, M.M. (2005). Discriminant analysis of edible oils and fats by FTIR, FT-NIR and FT-Raman spectroscopy. Food Chem., 93:25–32. [33] Mignani, A.G., Ciaccheri, L., Ottevaere, H., Thienpont, H., Conte, L., Marega, M. et al. (2011). Visible and near-infrared absorption spectroscopy by an integrating sphere and optical fibers for quantifying and discriminating the adulteration of extra virgin olive oil from Tuscany. Anal. Bioanal. Chem., 399:1315–1324. [34] Azizian, H., Mossoba, M.M., Fardin-Kia, A.R., Delmonte, P., Karunathilaka, S.R., & Kramer, J.K.G. (2015). Novel, Rapid Identification, and Quantification of Adulterants in Extra Virgin Olive Oil Using Near-Infrared Spectroscopy and Chemometrics. Lipids, 50:705–718. [35] Yang, H., & Irudayaraj, J. (2001). Comparison of near-infrared, Fourier transform- infrared, and Fourier transform-Raman methods for determining olive pomace oil adulteration in extra virgin olive oil. J. Am. Oil Chem. Soc., 78:889–895. [36] Downey, G., McIntyre, P., & Davies, A.N. (2002). Detecting and Quantifying Sunflower Oil Adulteration in Extra Virgin Olive Oils from the Eastern Mediterranean by Visible and Near- Infrared Spectroscopy. J. Agric. Food Chem., 50:5520–5525. [37] Karunathilaka, S.R., Kia, A.-R.F., Srigley, C., Chung, J.K., & Mossoba, M.M. (2016). Nontargeted, Rapid Screening of Extra Virgin Olive Oil Products for Authenticity Using Near- Infrared Spectroscopy in Combination with Conformity Index and Multivariate Statistical Analyses. J. Food Sci., 81:C2390–2397. [38] Wesley, I.J., Barnes, R.J., & McGill, A.E.J. (1995). Measurement of adulteration of olive oils by near-infrared spectroscopy. J. Am. Oil Chem. Soc., 72:289–292. [39] Christy, A.A., Kasemsumran, S., Du, Y., & Ozaki, Y. (2004). The Detection and Quantification of Adulteration in Olive Oil by Near-Infrared Spectroscopy and Chemometrics. Anal. Sci., 20:935–940. [40] Mendes, T.O., da Rocha, R.A., Porto, B.L.S., de Oliveira, M.A.L., dos Anjos, V. de C., & Bell, M.J.V. (2015). Quantification of Extra-virgin Olive Oil Adulteration with Soybean Oil: a

79

Comparative Study of NIR, MIR, and Raman Spectroscopy Associated with Chemometric Approaches. Food Anal. Methods, 8:2339–2346. [41] Wójcicki, K., Khmelinskii, I., Sikorski, M., Caponio, F., Paradiso, V.M., Summo, C. et al. (2015). Spectroscopic techniques and chemometrics in analysis of blends of extra virgin with refined and mild deodorized olive oils: Spectroscopic techniques and chemometrics in analysis of blends of olive oils. Eur. J. Lipid Sci. Technol., 117:92–102. [42] Ozdemir, D., & Ozturk, B. (2007). Near infrared spectroscopic determination of olive oil adulteration with sunflower and corn oil. J. Food Drug Anal., 15:40. [43] Sinelli, N., Casiraghi, E., Tura, D., & Downey, G. (2008). Characterisation and classification of Italian virgin olive oils by near- and mid-infrared spectroscopy. J. Infrared Spectrosc., 16:335. [44] Woodcock, T., Downey, G., & O’Donnell, C.P. (2008). Confirmation of Declared Provenance of European Extra Virgin Olive Oil Samples by NIR Spectroscopy. J. Agric. Food Chem., 56:11520–11525. [45] Galtier, O., Abbas, O., Le Dréau, Y., Rebufa, C., Kister, J., Artaud, J. et al. (2011). Comparison of PLS1-DA, PLS2-DA and SIMCA for classification by origin of crude petroleum oils by MIR and virgin olive oils by NIR for different spectral regions. Vib. Spectrosc., 55:132–140. [46] Bevilacqua, M., Bucci, R., Magrì, A.D., Magrì, A.L., & Marini, F. (2012). Tracing the origin of extra virgin olive oils by infrared spectroscopy and chemometrics: A case study. Anal. Chim. Acta, 717:39–51. [47] Downey, G., McIntyre, P., & Davies, A.N. (2003). Geographic Classification of Extra Virgin Olive Oils From the Eastern Mediterranean by Chemometric Analysis of Visible and Near- Infrared Spectroscopic Data. Appl. Spectrosc., 57:158–163. [48] Casale, M., Sinelli, N., Oliveri, P., Di Egidio, V., & Lanteri, S. (2010). Chemometrical strategies for feature selection and data compression applied to NIR and MIR spectra of extra virgin olive oils for cultivar identification. Talanta, 80:1832–1837. [49] Sinelli, N., Casale, M., Di Egidio, V., Oliveri, P., Bassi, D., Tura, D. et al. (2010). Varietal discrimination of extra virgin olive oils by near and mid infrared spectroscopy. Food Res. Int., 43:2126–2131. [50] Casale, M., Casolino, C., Ferrari, G., & Forina, M. (2008). Near infrared spectroscopy and class modelling techniques for the geographical authentication of Ligurian extra virgin olive oil. J. Infrared Spectrosc., 16:39.

80

[51] Oliveri, P., Di Egidio, V., Woodcock, T., & Downey, G. (2011). Application of class- modelling techniques to near infrared data for food authentication purposes. Food Chem., 125:1450–1456. [52] Laroussi-Mezghani, S., Vanloot, P., Molinet, J., Dupuy, N., Hammami, M., Grati-Kamoun, N. et al. (2015). Authentication of Tunisian virgin olive oils by chemometric analysis of fatty acid compositions and NIR spectra. Comparison with Maghrebian and French virgin olive oils. Food Chem., 173:122–132. [53] Casale, M., Oliveri, P., Casolino, C., Sinelli, N., Zunin, P., Armanino, C. et al. (2012). Characterisation of PDO olive oil Chianti Classico by non-selective (UV–visible, NIR and MIR spectroscopy) and selective (fatty acid composition) analytical techniques. Anal. Chim. Acta, 712:56–63. [54] Forina, M., Oliveri, P., Bagnasco, L., Simonetti, R., Casolino, M.C., Nizzi Grifi, F. et al. (2015). Artificial nose, NIR and UV–visible spectroscopy for the characterisation of the PDO Chianti Classico olive oil. Talanta, 144:1070–1078. [55] Oliveri, P., Casale, M., Casolino, M.C., Baldo, M.A., Nizzi Grifi, F., & Forina, M. (2011). Comparison between classical and innovative class-modelling techniques for the characterisation of a PDO olive oil. Anal. Bioanal. Chem., 399:2105–2113. [56] Devos, O., Downey, G., & Duponchel, L. (2014). Simultaneous data pre-processing and SVM classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils. Food Chem., 148:124–130. [57] Lai, Y.W., Kemsley, E.K., & Wilson, R.H. (1994). Potential of Fourier Transform Infrared Spectroscopy for the Authentication of Vegetable Oils. J. Agric. Food Chem., 42:1154–1159. [58] Marigheto, N.A., Kemsley, E.K., Defernez, M., & Wilson, R.H. (1998). A comparison of mid-infrared and Raman spectroscopies for the authentication of edible oils. J. Am. Oil Chem. Soc., 75:987–992. [59] Tay, A., Singh, R.K., Krishnan, S.S., & Gore, J.P. (2002). Authentication of Olive Oil Adulterated with Vegetable Oils Using Fourier Transform Infrared Spectroscopy. Leb.-Wiss U- Technol., 35:99–103. [60] Obeidat, S.M., Khanfar, M.S., & Obeidat, W.M. (2009). Classification of edible oils and uncovering adulteration of virgin olive oil using FTIR with the aid of chemometrics. Aust. J. Basic Appl. Sci., 3:2048–2053.

81

[61] Lerma-García, M.J., Ramis-Ramos, G., Herrero-Martínez, J.M., & Simó-Alfonso, E.F. (2010). Authentication of extra virgin olive oils by Fourier-transform infrared spectroscopy. Food Chem., 118:78–83. [62] de la Mata, P., Dominguez-Vidal, A., Bosque-Sendra, J.M., Ruiz-Medina, A., Cuadros- Rodríguez, L., & Ayora-Cañada, M.J. (2012). Olive oil assessment in edible oil blends by means of ATR-FTIR and chemometrics. Food Control, 23:449–55. [63] Javidnia, K., Parish, M., Karimi, S., & Hemmateenejad, B. (2013). Discrimination of edible oils and fats by combination of multivariate pattern recognition and FT-IR spectroscopy: A comparative study between different modeling methods. Spectrochim. Acta. A. Mol. Biomol. Spectrosc., 104:175–181. [64] Baeten, V., & Novi, M. (2008). The use of FT-MIR spectroscopy and counter-propagation artificial neural networks for tracing the adulteration of olive oil. Acta Chim. Slov., 55:935–941. [65] Baeten, V., Fernández Pierna, J.A., Dardenne, P., Meurens, M., García-González, D.L., & Aparicio-Ruiz, R. (2005). Detection of the Presence of Hazelnut Oil in Olive Oil by FT-Raman and FT-MIR Spectroscopy. J. Agric. Food Chem., 53:6201–6206. [66] Oussama, A., Elabadi, F., Platikanov, S., Kzaiber, F., & Tauler, R. (2012). Detection of Olive Oil Adulteration Using FT-IR Spectroscopy and PLS with Variable Importance of Projection (VIP) Scores. J. Am. Oil Chem. Soc., 89:1807–1812. [67] Rohman, A., & Man, Y.B.C. (2010). Fourier transform infrared (FTIR) spectroscopy for analysis of extra virgin olive oil adulterated with palm oil. Food Res. Int., 43:886–892. [68] Rohman, A., Che Man, Y.B., Hashim, P., & Ismail, A. (2011). FTIR spectroscopy combined with chemometrics for analysis of lard adulteration in some vegetable oils. CyTA - J. Food, 9:96– 101. [69] Rohman, A., & Man, Y.B.C. (2012). The chemometrics approach applied to FTIR spectral data for the analysis of rice bran oil in extra virgin olive oil. Chemom. Intell. Lab. Syst., 110:129– 1234. [70] Rohman, A., & Che Man, Y.B. (2012). Quantification and Classification of Corn and Sunflower Oils as Adulterants in Olive Oil Using Chemometrics and FTIR Spectra. Sci. World J., 2012:1–6. [71] Rohman, A., Che Man, Y.B., & Yusof, F.M. (2014). The Use of FTIR Spectroscopy and Chemometrics for Rapid Authentication of Extra Virgin Olive Oil. J. Am. Oil Chem. Soc., 91:207– 213.

82

[72] Sun, X., Lin, W., Li, X., Shen, Q., & Luo, H. (2015). Detection and quantification of extra virgin olive oil adulteration with edible oils by FT-IR spectroscopy and chemometrics. Anal. Methods, 7:3939–3945. [73] Ozen, B.F., & Mauer, L.J. (2002). Detection of Hazelnut Oil Adulteration Using FT-IR Spectroscopy. J. Agric. Food Chem., 50:3898–3901. [74] Georgouli, K., Martinez Del Rincon, J., & Koidis, A. (2017). Continuous statistical modelling for rapid detection of adulteration of extra virgin olive oil using mid infrared and Raman spectroscopic data. Food Chem., 217:735–742. [75] Rohman, A., Che Man, Y.B., Ismail, A., & Hashim, P. (2010). Application of FTIR Spectroscopy for the Determination of Virgin Coconut Oil in Binary Mixtures with Olive Oil and Palm Oil. J. Am. Oil Chem. Soc., 87:601–606. [76] Lai, Y.W., Kemsley, E.K., & Wilson, R.H. (1995). Quantitative analysis of potential adulterants of extra virgin olive oil using infrared spectroscopy. Food Chem., 53:95–98. [77] Küpper, L., Heise, H.M., Lampen, P., Davies, A.N., & McIntyre, P. (2001). Authentication and Quantification of Extra Virgin Olive Oils by Attenuated Total Reflectance Infrared Spectroscopy Using Silver Halide Fiber Probes and Partial Least-Squares Calibration. Appl. Spectrosc., 55:563–570. [78] Gurdeniz, G., & Ozen, B. (2009). Detection of adulteration of extra-virgin olive oil by chemometric analysis of mid-infrared spectral data. Food Chem., 116:519–525. [79] Nigri, S., & Oumeddour, R. (2013). Fourier transform infrared and fluorescence spectroscopy for analysis of vegetable oils. MATEC Web Conf., 5:4028. [80] Maggio, R.M., Cerretani, L., Chiavaro, E., Kaufman, T.S., & Bendini, A. (2010). A novel chemometric strategy for the estimation of extra virgin olive oil adulteration with edible oils. Food Control, 21:890–895. [81] Jović, O., Smolić, T., Primožič, I., & Hrenar, T. (2016). Spectroscopic and Chemometric Analysis of Binary and Ternary Edible Oil Mixtures: Qualitative and Quantitative Study. Anal. Chem., 88:4516–4524. [82] Vlachos, N., Skopelitis, Y., Psaroudaki, M., Konstantinidou, V., Chatzilazarou, A., & Tegou, E. (2006). Applications of Fourier transform-infrared spectroscopy to edible oils. Anal. Chim. Acta, 573–574:459–465.

83

[83] Poiana, M.A., Alexa, E., Munteanu, M.F., Gligor, R., Moigradean, D., & Mateescu, C. (2015). Use of ATR-FTIR spectroscopy to detect the changes in extra virgin olive oil by adulteration with soybean oil and high temperature heat treatment. Open Chem., 13. [84] Allam, M.A., & Hamed, S.F. (2007). Application of FTIR Spectroscopy in the Assessment of Olive Oil Adulteration. J. Appl. Sci. Res., 3:102–108. [85] Galtier, O., Le Dréau, Y., Ollivier, D., Kister, J., Artaud, J., & Dupuy, N. (2008). Lipid Compositions and French Registered Designations of Origins of Virgin Olive Oils Predicted by Chemometric Analysis of Mid-Infrared Spectra. Appl. Spectrosc., 62:583–590. [86] De Luca, M., Terouzi, W., Ioele, G., Kzaiber, F., Oussama, A., Oliverio, F. et al. (2011). Derivative FTIR spectroscopy for cluster analysis and classification of morocco olive oils. Food Chem., 124:1113–1118. [87] Tapp, H.S., Defernez, M., & Kemsley, E.K. (2003). FTIR Spectroscopy and Multivariate Analysis Can Distinguish the Geographic Origin of Extra Virgin Olive Oils. J. Agric. Food Chem., 51:6110–6115. [88] Abdallah, M., Vergara-Barberán, M., Lerma-García, M.J., Herrero-Martínez, J.M., Simó- Alfonso, E.F., & Guerfel, M. (2016). Cultivar discrimination and prediction of mixtures of Tunisian extra virgin olive oils by FTIR: EVOO prediction of cultivar and mixtures by FTIR and LDA. Eur. J. Lipid Sci. Technol., 118:1236–1242. [89] Gurdeniz, G., Tokatli, F., & Ozen, B. (2007). Differentiation of mixtures of monovarietal olive oils by mid-infrared spectroscopy and chemometrics. Eur. J. Lipid Sci. Technol., 109:1194– 1202. [90] Gurdeniz, G., Ozen, B., & Tokatli, F. (2008). Classification of Turkish olive oils with respect to cultivar, geographic origin and harvest year, using fatty acid profile and mid-IR spectroscopy. Eur. Food Res. Technol., 227:1275–1281. [91] Gurdeniz, G., Ozen, B., & Tokatli, F. (2010). Comparison of fatty acid profiles and mid- infrared spectral data for classification of olive oils. Eur. J. Lipid Sci. Technol., 112:218–226. [92] Caetano, S., Üstün, B., Hennessy, S., Smeyers-Verbeke, J., Melssen, W., Downey, G. et al. (2007). Geographical classification of olive oils by the application of CART and SVM to their FT-IR. J. Chemom., 21:324–334. [93] Baeten, V., Hourant, P., Morales, M.T., & Aparicio, R. (1998). Oil and Fat Classification by FT-Raman Spectroscopy. J. Agric. Food Chem., 46:2638–2646.

84

[94] Baeten, V., & Aparicio, R. (2000). Edible oils and fats authentication by Fourier transform Raman spectrometry. Biotechnol. Agron. Soc. Environ., 4:196–203. [95] Kim, M., Lee, S., Chang, K., Chung, H. ; & Jung, Y.M. (2012). Use of temperature dependent Raman spectra to improve accuracy for analysis of complex oil-based samples: Lube base oils and adulterated olive oils. Anal. Chim. Acta, 748:58–66. [96] El-Abassy, R.M., Donfack, P., & Materny, A. (2009). Visible Raman spectroscopy for the discrimination of olive oils from different vegetable oils and the detection of adulteration. J. Raman Spectrosc., 40:1284–1289. [97] Davies, A.N., McIntyre, P., & Morgan, E. (2000). Study of the Use of Molecular Spectroscopy for the Authentication of Extra Virgin Olive Oils. Part I: Fourier Transform Raman Spectroscopy. Appl. Spectrosc., 54:1864–1867. [98] López-Díez, E.C., Bianchi, G., & Goodacre, R. (2003). Rapid Quantitative Assessment of the Adulteration of Virgin Olive Oils with Hazelnut Oils Using Raman Spectroscopy and Chemometrics. J. Agric. Food Chem., 51:6145–6150. [99] Heise, H.M., Damm, U., Lampen, P., Davies, A.N., & McIntyre, P.S. (2005). Spectral variable selection for partial least squares calibration applied to authentication and quantification of extra virgin olive oils using Fourier transform Raman spectroscopy. Appl. Spectrosc., 59:1286–1294. [100] Baeten, V., Meurens, M., Morales, M.T., & Aparicio, R. (1996). Detection of Virgin Olive Oil Adulteration by Fourier Transform Raman Spectroscopy. J. Agric. Food Chem., 44:2225– 2230. [101] Zhang, X.F., Zou, M.Q., Qi, X.H., Liu, F., Zhang, C., & Yin, F. (2011). Quantitative detection of adulterated olive oil by Raman spectroscopy and chemometrics. J. Raman Spectrosc., 42:1784–1788. [102] Dong, W., Zhang, Y., Zhang, B., & Wang, X. (2012). Quantitative analysis of adulteration of extra virgin olive oil using Raman spectroscopy improved by Bayesian framework least squares support vector machines. Anal. Methods, 4:2772. [103] Sánchez-López, E., Sánchez-Rodríguez, M.I., Marinas, A., Marinas, J.M., Urbano, F.J., Caridad, J.M. et al. (2016). Chemometric study of Andalusian extra virgin olive oils Raman spectra: Qualitative and quantitative information. Talanta, 156–157:180–190. [104] Gouvinhas, I., Machado, N., Carvalho, T., de Almeida, J.M.M.M., & Barros, A.I.R.N.A. (2015). Short wavelength Raman spectroscopy applied to the discrimination and

85

characterization of three cultivars of extra virgin olive oils in different maturation stages. Talanta, 132:829–835. [105] Nigri, S., & Oumeddour, R. (2016). Detection of extra virgin olive oil adulteration using Fourier transform infrared, synchronous fluorescence spectroscopy and multivariate analysis. Riv. Ital. Delle Sostanze Grasse, 93:125–131. [106] de B. Harrington, P., Kister, J., Artaud, J., & Dupuy, N. (2009). Automated Principal Component-Based Orthogonal Signal Correction Applied to Fused Near Infrared−Mid-Infrared Spectra of French Olive Oils. Anal. Chem., 81:7160–7169. [107] Casale, M., Casolino, C., Oliveri, P., & Forina, M. (2010). The potential of coupling information using three analytical techniques for identifying the geographical origin of Liguria extra virgin olive oil. Food Chem., 118:163–170.

86

CHAPITRE 2 : PROPOSITION D’UNE NOUVELLE APPROCHE APPLIQUANT LE PRINCIPE DES CARTES DE CONTRÔLE AUX MODÈLES CHIMIOMÉTRIQUES

L’exploration du contexte bibliographique a montré que les méthodes chimiométriques d’analyse discriminante, et notamment PLS-DA, sont utilisées depuis de nombreuses années dans les études scientifiques pour établir des modèles de reconnaissance variétale. Or, l’application de ces modèles dans un contexte industriel demeure limitée du fait de leur complexité de développement et d’interprétation. C’est pourquoi il semble judicieux d’apporter une approche métrologique selon le principe des cartes de contrôles, qui sont des outils statistiques bien connus dans l’industrie depuis leur création dans les années 1920.19 Ces cartes de contrôles, basées sur le calcul d’intervalles de confiance, sont appliquées notamment pour détecter des dérives par rapport à une valeur de référence lors du contrôle qualité de produits formulés ou de matières premières par des analyses chimiques de routine. 20 Le développement d’une carte de contrôle nécessite de disposer de références, ou dans notre cas d’échantillons typiques de chaque cultivar. De plus, ces échantillons doivent être récoltés pendant un grand nombre d’années afin d’obtenir des modèles robustes prenant en compte les variations annuelles des conditions de culture qui influencent la composition chimique des huiles. Ce chapitre se concentre sur les proportions d’acides gras analysés par chromatographie gazeuse, puisqu’une importante base de données obtenues en appliquant cette méthode de référence à des échantillons collectés pendant plus de dix années de récolte est disponible au laboratoire IMBE.21 Un système de carte de contrôle a été développé pour l’interprétation des modèles PLS-DA dans le but de confirmer l’origine d’huiles monovariétales issues de cinq cultivars français. Cet article a fait l’objet d’une publication dans le journal Food Control.

19 Bayart, D. Des objets qui solidifient une théorie : l’histoire du contrôle statistique de fabrication, in Des savoirs en action - Contributions de la recherche en gestion, L’Harmattan (Fr), 1995, 139-173. 20 Vasconcellos, J.A., Tamborero-Arnal, J.F., Araiz-Jam, A. Statistical methods of quality control in the food industry, in Quality assurance for the food industry: a practical approach, CRC Press (US), 2003, 141-174. 21 Pinatel, C., Oliivier, D., Ollivier V., Artaud, J. New approach to the determination of the origin of olive oils : morphograms and morphotypes (Part II). Olivae, 2014, 119, 48-62.

87

Discrimination of extra virgin olive oils from five French cultivars: en route to a control chart approach

Astrid Maléchaux, Yveline Le Dréau, Pierre Vanloot, Jacques Artaud, Nathalie Dupuy Aix Marseille Univ, Avignon Université, CNRS, IRD, IMBE, Marseille, France

Abstract

The control of varietal origin is an important issue to insure the authenticity of olive oils. In this study, extra virgin olive oils from five French cultivars were discriminated by applying partial least square discriminant analysis (PLS1-DA) to their fatty acid and squalene compositions obtained by gas chromatography. Two decision rules were compared to determine the varietal origin of predicted samples: either a classical PLS-DA approach with uncertainty zones, or a control chart approach with warning and control limits. The control chart approach, being focused on characteristic samples from each modelled cultivar, is able to deal with classes having unbalanced number of samples and to identify atypical samples.

Keywords

olive oil, fatty acids, cultivars, quality control, control chart, chemometrics

88

1. Introduction

Olive oil is an emblematic product of the Mediterranean area, which has gained an increasing worldwide popularity due to its sensory and nutritional properties. In France, to make the most of the very small production volumes and answer the consumers’ demands for quality and authenticity, the producers put forth high-value extra virgin olive oils (EVOO) made from specific cultivars and possibly certified by a protected designation of origin. However, these products are an attractive target for fraudsters and their origin claims must thus be verified [1]. Several analytical methods have been developed for this purpose, studying either the global composition of oils with spectroscopic techniques or specific markers with genetic or chromatographic techniques [2,3]. Multivariate statistical analyses, known as chemometrics, are often necessary to extract the relevant information from this complex data and discriminate authentic from non-authentic samples [4]. Among the molecular markers, fatty acids are important for the determination of the purity of olive oils, with acceptable contents defined by trade standards [5]. Moreover, beyond these purity criteria, fatty acid composition can be associated with chemometrics for the determination of the varietal origin of olive oils. For instance, linear discriminant analysis models were developed to classify Sicilian and French cultivars respectively [6,7], while another study used SIMCA models to discriminate between Turkish cultivars [8]. However, to our knowledge, partial least square discriminant analysis (PLS1-DA) has yet to be applied to predict the varietal origin of EVOO based on their fatty acid compositions. This algorithm requires the assignment of a binary coding to indicate if each sample belongs or not to the modelled cultivar. However, since PLS was originally built for the quantitative analysis of continuous variables, the predicted values are not binary and so it is necessary to define a rule indicating whether the predicted sample can be attributed or not to the modelled cultivar. A recent review presents different methods for determining the classification threshold, such as the choice of an arbitrary value, the determination of an optimal value using receiver operating characteristic curves, or the estimation of a probability density function [9]. The latter is more flexible and can deal with unbalanced class sizes but requires more complex calculations. In this study, PLS1-DA was applied to predict the varietal origin of EVOO samples from five French cultivars. Two kinds of thresholds were considered. The first one is a classical approach currently used in chemometrics, defining an arbitrary threshold with an uncertainty zone between the target values 0 or 1. The second one is a novel

89

approach based on quality control charts. Indeed, control charts are a common statistical tool for monitoring the conformity of products or processes with a reference value [10,11]. This approach may be more user-friendly than the probability density function since control charts are built using the simple computation of confidence intervals and are already a common tool in the food industry [12].

2. Material and methods

2.1. Samples

Three hundred monovarietal EVOO samples produced between 2002 and 2017 were used for this study. An equal number of samples (n=60) came from each of these five French cultivars, which are among the most typical in the Provence region: Aglandau (AG), Cailletier (CA), Picholine (PI), Salonenque (SA) and Tanche (TA). For each cultivar, samples were obtained from several harvest years to represent the annual variability due to external parameters such as climatic conditions: - AG: 2006 (n=6), 2007 (n=3), 2008 (n=1), 2011 (n=4), 2013 (n=1), 2016 (n=20), 2017 (n=25) - CA: 2002 (n=1), 2006 (n=7), 2007 (n=8), 2008 (n=4), 2009 (n=2), 2010 (n=4), 2011 (n=4), 2012 (n=4), 2016 (n=16), 2017 (n=10) - PI: 2003 (n=9), 2005 (n=7), 2006 (n=7), 2007 (n=5), 2008 (n=2), 2010 (n=1), 2011 (n=1), 2016 (n=13), 2017 (n=15) - SA: 2002 (n=10), 2003 (n=12), 2004 (n=5), 2005 (n=2), 2006 (n=1), 2008 (n=1), 2011 (n=2), 2016 (n=14), 2017 (n=13) - TA: 2003 (n=7), 2004 (n=7), 2005 (n=8), 2006 (n=4), 2007 (n=4), 2008 (n=3), 2011 (n=1), 2016 (n=15), 2017 (n=11)

2.2. Sample preparation

The transmethylation of the triacylglycerols from the extra virgin olive oil (EVOO) samples was conducted following the method described in a previous article prior to GC analysis [13].

90

2.3. Gas chromatography

GC analyses of the fourteen fatty acid methyl esters and squalene were performed using an Agilent gas chromatograph 7890A (Agilent Technologies Inc., Santa Clara, California). Hydrogen was used as a carrier gas with a flow of 1 mL/min. The instrument was equipped with a split/split-less injector (split ratio 1:60), a flame ionization detector and a Supelcowax 10 (Merck KGaA, Darmstadt, Germany) silica capillary column coated with polyethylene glycol (L×I.D. 60 m × 0.25 mm, df 0.25 μm). The following temperature gradient was applied: 210 °C during 20 min, then from 210 to 245 °C at 6 °C/min, and 245 °C for 20 min. The fourteen fatty acids and squalene percentages were weighted by their respective standard deviation and mean-centred before chemometric analysis.

2.4. Chemometric analysis

The Unscrambler X software (version 10.4, CAMO Software) was used to conduct the chemometric processing. First, principal component analysis (PCA) was used as an exploratory tool to represent the dispersion of the samples and identify outliers. Indeed, this unsupervised pattern recognition technique projects the data from a large number of variables in a space defined by a small number of principal components (PCs) which describe most of the variance from the dataset. This results in scores plots indicating the similarities and differences among the samples, and loadings indicating which initial variables contribute to the construction of each PC [14,15]. Then, partial least square discriminant analysis (PLS1-DA) models were developed to predict the varietal origin of the samples. In this supervised patter-recognition method, a different model is built to discriminate each cultivar against all the others. This method was chosen over PLS2-DA predicting all cultivars simultaneously and SIMCA class analogy model, since a previous article comparing the results of these three algorithms indicated that PLS1-DA gave more satisfying results [16]. The main sources of variability from the dataset are modelled by latent variables (LVs) and the scores are computed to maximize their covariance with the predicted variables. A full cross-validation procedure is applied during the calibration of each model in order to select the optimal number of LVs that minimizes the root mean square error of cross validation (RMSECV). Moreover, PLS1-DA derives from the PLS

91

regression built for quantitative analysis, so it is necessary to assign a binary coding to indicate if a sample belongs (value of 1) of not (value of 0) to the modelled cultivar. Since the predicted values are not binary but rather continuous, a predicted sample is recognized as belonging to the modelled cultivar if its value is above a determined prediction threshold or belonging to the other cultivars otherwise [9,17]. The quality of the models was evaluated by the root mean square error of calibration (RMSEC) and coefficient of determination R2 for the calibration models, and root mean square error of prediction (RMSEP) and coefficient of determination Q2 for the predictions [18].

3. Results and discussion

3.1. Fatty acid and squalene compositions

As can be seen in Table 1, each of the five studied cultivars has a characteristic composition with some fatty acids or squalene in higher or lower proportions than the other cultivars. - Aglandau oils have higher margaric (17:0) and margaroleic (17:1ω8) acid contents - Cailletier oils differ by their lower squalene (Squa) content - Picholine oils have a higher linolenic (18:3ω3) acid content - Salonenque oils contain more palmitic (16:0), linoleic (18:2ω6) and arachidic (20:0) acids but less oleic (18:1ω9) acid - Tanche oils are richer in oleic (18:1ω9) acid but poorer in palmitic (16:0), palmitoleic (16:1ω7) and vaccenic (18:1ω7) acids These results support and complete the findings from a previous study [7].

92

TABLE 1. MEAN, MAXIMUM AND MINIMUM FATTY ACID AND SQUALENE PERCENTAGES IN THE FIVE FRENCH OLIVE OIL CULTIVARS (AG: AGLANDAU, CA: CAILLETIER, PI: PICHOLINE, SA:

SALONENQUE, TA: TANCHE, N: NUMBER OF SAMPLES)

16:0 16:1ω9 16:1ω7 17:0 17:1ω8 18:0 18:1ω9 18:1ω7 18:2ω6 18:3ω3 20:0 20:1ω9 22:0 24:0 Squa Mean 14.05 0.12 1.16 0.20 0.35 2.65 69.81 2.54 7.64 0.65 0.42 0.24 0.12 0.06 0.87 AG Max 16.07 0.17 1.56 0.48 0.74 3.74 75.14 3.20 10.20 0.98 0.50 0.29 0.15 0.07 1.32 (n=60) Min 11.29 0.08 0.69 0.12 0.18 2.21 65.51 1.96 5.88 0.49 0.37 0.19 0.10 0.04 0.55 Mean 11.93 0.09 0.82 0.05 0.09 2.39 73.31 2.34 7.54 0.62 0.38 0.29 0.12 0.05 0.39 CA Max 14.51 0.12 1.37 0.06 0.11 3.08 76.83 2.75 10.62 0.84 0.42 0.36 0.13 0.06 0.60 (n=60) Min 10.32 0.05 0.53 0.04 0.06 1.83 68.53 1.87 5.81 0.50 0.34 0.22 0.09 0.02 0.19 Mean 11.32 0.12 0.62 0.06 0.09 2.39 73.02 1.91 8.84 0.82 0.37 0.30 0.09 0.05 0.72 PI Max 14.40 0.15 0.91 0.10 0.16 3.03 77.52 2.61 13.22 1.06 0.44 0.35 0.12 0.07 0.93 (n=60) Min 9.17 0.09 0.38 0.04 0.06 1.74 67.60 1.46 6.39 0.55 0.28 0.26 0.07 0.04 0.47 Mean 15.41 0.10 1.10 0.06 0.09 2.82 63.33 2.41 13.19 0.58 0.46 0.24 0.13 0.07 0.65 SA Max 18.03 0.16 1.56 0.13 0.22 4.06 70.75 3.15 19.47 0.96 0.58 0.28 0.16 0.09 0.96 (n=60) Min 11.91 0.06 0.70 0.04 0.05 2.14 54.11 1.77 9.21 0.46 0.39 0.19 0.10 0.05 0.39 Mean 8.83 0.14 0.43 0.04 0.06 2.81 78.21 1.55 6.49 0.62 0.38 0.29 0.10 0.04 0.79 TA Max 10.88 0.17 0.61 0.05 0.09 3.71 81.49 1.97 9.64 0.78 0.43 0.32 0.12 0.05 0.98 (n=60) Min 7.38 0.11 0.31 0.03 0.05 2.24 74.46 1.28 4.92 0.52 0.31 0.25 0.07 0.02 0.51 With 16:0: palmitic acid, 16:1 ω9: hypogeic acid, 16:1 ω7: palmitoleic acid, 17:0: margaric acid, 17:1 ω8: margaroleic acid, 18:0: stearic acid, 18:1 ω9: oleic acid, 18:1 ω7: vaccenic acid, 18:2 ω6: linoleic acid, 18:3 ω3: linolenic acid, 20:0: arachidic acid, 20:1 ω9: gondoic acid, 22:0: behenic acid, 24:0: lignoceric acid and Squa: squalene

93

3.2. Principal component analysis

The studied samples being commercial EVOOs, they are subject to annual variations resulting from uncontrolled weather and farming conditions. It is thus important to assess the variability of the samples and select only those that are representative of the typical composition of each cultivar to obtain reliable control charts. For this purpose, outliers were removed from the samples used for the calibration and validation of the models. Only the samples from the most recent production year (2017), including possible outliers, were kept as a final control set. In order to identify outliers among the samples from all production years but the last, PCA were built separately for each cultivar. Based on the influence plots representing the F-residuals versus Hotelling’s T2 statistics [14], three Aglandau, four Cailletier, six Picholine, six Salonenque and one Tanche samples that did not comply with the characteristics of their cultivar were removed. A global PCA was then realized to represent the repartition of the samples from all years but last without outliers. Figure 1-A shows the scores obtained on the first two PCs, representing 62% of the variability, with the samples from all years but the last after removal of the outliers. The samples are grouped according to their varietal origin. However, Picholine and Tanche are slightly overlapping on these two PCs and the Aglandau and Salonenque groups display a rather large dispersion. These observations indicate that chemometric models should be able to discriminate the varietal origin of these oils, but some samples may be more difficult to classify. The corresponding loadings (Figure 1-B) indicating the most influential variables give complementary information in relation to Table 1. The characteristic fatty acids of each cultivar have a significant influence on the first two PCs. For instance, palmitic (16:0), palmitoleic (16:1ω7), vaccenic (18:1ω7), linoleic (18:2ω6) and arachidic (20:0) acids are all positively correlated to PC1, while oleic (18:1ω9) acid is negatively correlated. This is why Salonenque samples have positive scores while Tanche samples have negative scores on PC1. Similarly, margaric (17:0) and margaroleic (17:1ω8) acids are negatively correlated to PC2, which is consistent with the negative scores of Aglandau samples on PC2. Squalene is also negatively correlated to PC2, explaining the positive scores of Cailletier samples. Furthermore, behenic (22:0) and lignoceric (24:0) acids, that were not considered characteristic in Table 1, appear to bring some significant information to the first PC.

94

FIGURE 1. SCORES (A) AND LOADINGS (B) FOR THE FIRST TWO PCS OF THE PCA USING FATTY ACIDS AND SQUALENE COMPOSITION (■: AGLANDAU, ●: CAILLETIER, ∆: PICHOLINE, □:

SALONENQUE, ▲: TANCHE)

95

3.3. Prediction of varietal origin by partial least square discriminant analysis

In order to build the models with a sufficient number of representative samples from each cultivar, two thirds of the samples without outliers, from all harvest years but the last, were randomly selected and used as a calibration set to train the models (Aglandau, n=22; Cailletier, n=31; Picholine, n=26; Salonenque, n=27; Tanche, n=32). The remaining third was used as a prediction set to test the performances of the models with samples typical from each cultivar (Aglandau, n=11; Cailletier, n=15; Picholine, n=13; Salonenque, n=14; Tanche, n=16). The final control set from the last production year was used to assess the performances of the models with sample sets containing possible outliers (Aglandau, n=25; Cailletier, n=10; Picholine, n=15; Salonenque, n=13; Tanche, n=11). Different models were built to predict each cultivar against all the others with a binary coding indicating if a sample belonged (value of 1) or not (value of 0) to the modelled cultivar. However, when using these “full” calibration sets, the number of samples from the modelled cultivar was much smaller than the sum of samples from the other cultivars. This situation causes the mean calibration scores from the modelled cultivar to be lower than the expected value of 1, as can be seen in Table 2. This result concurs with the observation from Borras et al., who reported the difficulty of PLS-DA models to accurately recognize the class with fewer samples [19]. Thus, to avoid the issues caused by unbalanced classes, other models were developed using “balanced” calibration sets, in which samples were randomly selected from the four other cultivars to reach the same number than that of the modelled cultivar (Aglandau model: 22 AG, 6 CA, 5 PI, 5 SA, 6 TA ; Cailletier model: 7 AG, 31 CA, 8 PI, 8 SA, 8 TA ; Picholine model: 6 AG, 7 CA, 26 PI, 6 SA, 7 TA ; Salonenque model: 6 AG, 7 CA, 7 PI, 27 SA, 7 TA ; Tanche model: 8 AG, 8 CA, 8 PI, 8 SA, 32 TA). However, in this case the variability of the four other cultivars in each model is not so well represented, which tends to increase the dispersion of the predicted scores. In order to overcome this other issue, the random selection was conducted five times for each model and the final results were obtained by averaging the five predicted scores and quality parameters. Moreover, two kinds of thresholds were tested to determine the attribution of the predicted samples to one class or the other, as illustrated in Figure 2 :

96

- The first approach was a classical PLS-DA arbitrary threshold with uncertainty zones taking into account the samples that were not clearly recognized by the model (Figure 2-A). Considering that fatty acid proportions are good markers of the varietal origin of olive oils [7], conservative thresholds close to the expected values of 1 and 0 were selected. Samples were recognized as belonging to the modelled cultivar if their predicted value was between 0.7 and 1.3, or belonging to the other cultivars if their predicted value was between -0.3 and 0.3. Samples predicted outside of these zones could not be clearly assigned to either the modelled cultivar or the other cultivars and were thus considered as uncertain. These uncertain samples should be analysed again with a different technique to confirm their origin.

- The second threshold was built as a control chart to verify if a sample labelled as belonging to the modelled cultivar was authentic or not (Figure 2-B). For this purpose, warning limits and control limits were established as confidence intervals at 95% and 99% respectively around the mean calibration scores (MCS) for the modelled cultivar only. 95% 푤푎푟푛푖푛푔 푙푖푚푖푡푠 = 푀퐶푆 ± (2 × 푠푡푎푛푑푎푟푑 푑푒푣푖푎푡푖표푛)

99% 푐표푛푡푟표푙 푙푖푚푖푡푠 = 푀퐶푆 ± (3 × 푠푡푎푛푑푎푟푑 푑푒푣푖푎푡푖표푛)

Samples were accepted as belonging to the modelled cultivar if their predicted value was inside the 95% warning limits, and rejected if their predicted value was outside the 99% control limits. Samples with a predicted value in the warning zone (between the 95% and 99% limits) were considered as uncertain and should also be analysed again.

97

FIGURE 2. PREDICTED Y SCORES WITH DECISION RULES FOR THE TWO THRESHOLDS (A: PLS-DA CLASSICAL APPROACH, B: CONTROL CHART APPROACH, M: MODELLED CULTIVAR, O: OTHER

CULTIVARS)

98

With the full calibration sets four LVs are sufficient to build the models discriminating the Cailletier, Picholine, Salonenque and Tanche samples, and only two LVs for the Aglandau model (Table 2). Indeed, the Aglandau samples may be easier to recognize as indicated by their good separation from the other cultivars on both PC1 and PC2 of the exploratory PCA analysis (Figure 1-A). When using the balanced calibration sets, the optimal number of LVs varies depending on the randomly selected samples used to build each model (Table 2). The models built using the full calibration set have satisfying quality parameters, as shown in Table 2, with RMSEC ranging from 0.09 for Salonenque to 0.16 for Picholine and R2 between 0.94 for Salonenque and 0.83 for Picholine. Moreover, due to the smaller number of samples from the modelled cultivar compared to the other ones, the mean of the Y scores for each modelled cultivar tends to be lower than the expected value of 1. This shift is less marked when using the models built with balanced calibration sets, which could thus improve the results. Using the balanced calibration sets does not significantly influence the RMSEC but improves the R2, which become greater than 0.90 for all the cultivars.

TABLE 2. STATISTICAL PARAMETERS, MEAN AND STANDARD DEVIATION (SD) OF THE Y SCORES OF THE MODELLED CULTIVAR

FOR EACH PLS1-DA CALIBRATION MODEL (AG: AGLANDAU, CA: CAILLETIER, PI: PICHOLINE, SA: SALONENQUE, TA:

TANCHE, N: NUMBER OF SAMPLES) Modelled Y scores Y scores Calibration LV RMSEC R2 cultivar mean SD AG Full 2 0.10 0.93 0.94 0.14 (n=22) Balanced 2 to 3 0.11 0.95 0.97 0.12

CA Full 4 0.12 0.91 0.93 0.11 (n=31) Balanced 3 to 5 0.11 0.95 0.97 0.11

PI Full 4 0.16 0.83 0.85 0.20 (n=26) Balanced 4 to 10 0.15 0.90 0.94 0.20

SA Full 4 0.09 0.94 0.95 0.14 (n=27) Balanced 3 to 4 0.11 0.95 0.98 0.14

TA Full 4 0.15 0.87 0.90 0.19 (n=32) Balanced 2 to 5 0.15 0.91 0.95 0.18

99

The results obtained when the models are applied to the prediction set without outliers are presented in Table 3. For the models built with the full calibration set the quality parameters are still good, with RMSEP between 0.11 for Aglandau and 0.19 for Picholine, and Q2 between 0.91 for Aglandau and 0.77 for Picholine. Using the models with balanced calibration sets does not bring significant changes to the RMSEP and Q2, except for the model predicting Picholine which has lower quality parameters (RMSEP of 0.22 and Q2 of 0.72). Thus, the smaller number of samples used in balanced calibration sets has a limited impact on the classification accuracy of the models. When using the classical approach, the confusion matrices indicate good prediction results for the models built with full calibration set, as could be expected after removal of the outliers. There were no misclassified samples with any of the models, only some samples in the uncertainty zones. The model predicting the Picholine cultivar was the least satisfying, with a total of eight uncertain samples. The other models had fewer uncertain samples: five for the model predicting the Tanche cultivar, three for the Cailletier cultivar, two for the Salonenque and one for the Aglandau model. Using the models built with balanced calibration sets brings little improvement to the recognition of the samples from each modelled cultivar. However, even with the repeated random selection, the variability of the other samples is not so well taken into account and more uncertain samples are detected from the other cultivars. For the models with full calibration sets, the control chart limits are more tightly centred around the mean predicted value of each modelled cultivar, which allows a better recognition of the samples for the Picholine and Tanche models. Moreover, the identification of samples deviating from the average composition of the modelled cultivar is facilitated. Thus, one Cailletier sample and one Salonenque sample are considered as being too far from the average of their group although they were not identified as outliers in their respective PCA. On the other hand, the removal of the limits around the other cultivars results in overall fewer uncertain samples compared to the classical threshold: none with the Cailletier model, one with the Aglandau, Salonenque and Tanche models, and six in the Picholine model. With the control chart approach, the balanced calibration set does not seem to improve the results since the threshold is already centred around the mean predicted value of the modelled class rather than the expected value of 1.

100

TABLE 3. CONFUSION MATRICES AND STATISTICAL PARAMETERS OF THE PLS1-DA MODELS PREDICTING THE ORIGIN OF EACH CULTIVAR FOR THE SAMPLES FROM ALL YEARS BUT LAST WITHOUT

OUTLIERS (AG: AGLANDAU, CA: CAILLETIER, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE, N: NUMBER OF SAMPLES) Modelled Predicted class Predicted class Calibration Real class 95% limits 99% limits RMSEP Q2 cultivar [-0.3 ; 0.3] / [0.7 ; 1.3] Control chart AG Other Uncertain Accepted Rejected Uncertain Full AG (n=11) 10 0 1 10 0 1 0.65 ; 1.22 0.51 ; 1.36 0.11 0.91 Other (n=58) 0 58 0 0 58 0 AG AG Other Uncertain Accepted Rejected Uncertain Balanced AG (n=11) 10 0 1 10 0 1 0.73 ; 1.21 0.61 ; 1.33 0.12 0.89 Other (n=58) 0 57 1 0 58 0 CA Other Uncertain Accepted Rejected Uncertain Full CA (n=15) 14 0 1 14 1 0 0.71 ; 1.14 0.60 ; 1.25 0.16 0.84 Other (n=54) 0 52 2 0 54 0 CA CA Other Uncertain Accepted Rejected Uncertain Balanced CA (n=15) 14 0 1 14 1 0 0.75 ; 1.19 0.64 ; 1.30 0.18 0.83 Other (n=54) 0 50 4 0 54 0 PI Other Uncertain Accepted Rejected Uncertain Full PI (n=13) 9 0 4 13 0 0 0.45 ; 1.26 0.25 ; 1.46 0.19 0.77 Other (n=56) 0 52 4 1 49 6 PI PI Other Uncertain Accepted Rejected Uncertain Balanced PI (n=13) 11 0 2 12 0 1 0.55 ; 1.34 0.35 ; 1.53 0.22 0.72 Other (n=56) 0 49 7 1 51 4 SA Other Uncertain Accepted Rejected Uncertain Full SA (n=14) 12 0 2 12 1 1 0.68 ; 1.22 0.54 ; 1.36 0.13 0.89 Other (n=55) 0 55 0 0 55 0 SA SA Other Uncertain Accepted Rejected Uncertain Balanced SA (n=14) 12 0 2 13 1 0 0.69 ; 1.26 0.55 ; 1.40 0.14 0.87 Other (n=55) 0 53 2 0 55 0 TA Other Uncertain Accepted Rejected Uncertain Full TA (n=16) 14 0 2 16 0 0 0.52 ; 1.28 0.32 ; 1.47 0.16 0.85 Other (n=53) 0 50 3 0 52 1 TA TA Other Uncertain Accepted Rejected Uncertain Balanced TA (n=16) 15 0 1 15 0 1 0.60 ; 1.30 0.42 ; 1.48 0.18 0.83 Other (n=53) 0 49 4 0 53 0

101

Looking at the results obtained for the prediction of the varietal origin of the samples from the final year (Table 4), the quality parameters are slightly lower. This was expected since the final control set has not been cleared of its outlying samples. Models built with the full calibration set yield RMSEP between 0.14 for Cailletier and 0.21 for Tanche, and Q2 between 0.91 for Aglandau and 0.69 for Tanche. Using models with balanced calibration sets slightly worsens the quality parameters, but they remain acceptable, with RMSEP between 0.17 for Cailletier and 0.26 for Picholine and Q2 between 0.90 for Aglandau and 0.70 for Tanche. Confusion matrices obtained with the classical threshold indicate fairly good predicting ability of the models based on the full calibration set, with no misclassified samples but some uncertain samples for each model: four in the Salonenque model, five in the Aglandau, and Cailletier models. Picholine and Tanche models give less satisfactory results, with twelve and thirteen uncertain samples respectively. Using models built with balanced calibration sets improves the results for the Cailletier and Tanche models, giving only two and seven uncertain samples respectively. However, more samples are found in the uncertainty zone for the Aglandau, Picholine and Salonenque models. Similarly to what was observed with the prediction set cleared of outliers, the use of the control chart approach allows a better prediction of the Tanche cultivar, and less uncertain samples for all models but the Picholine. Indeed, with the full calibration model for Picholine twelve uncertain samples are found, in addition to one outlying Picholine sample and one other sample falsely predicted as Picholine. The Aglandau model also finds three uncertain and four outlying samples. The other models present no misclassified samples, but only three uncertain samples for Cailletier and one for Tanche. The Salonenque model results in perfect prediction with no misclassified or uncertain sample. Finally, contrary to what was observed with the prediction set, the results obtained with the control set indicate that using balanced calibration models could further improve the prediction with the control chart thresholds. The combination of both gives fewer uncertain samples for all the models, even if three outliers are still detected with the Aglandau model. Perfect predictions are obtained for the Cailletier, Salonenque and Tanche cultivars when applying the control chart approach to the models built with balanced calibration sets.

102

TABLE 4. CONFUSION MATRICES AND STATISTICAL PARAMETERS OF THE PLS1-DA MODELS PREDICTING THE ORIGIN OF EACH CULTIVAR FOR THE SAMPLES FROM THE FINAL YEAR (AG:

AGLANDAU, CA: CAILLETIER, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE, N: NUMBER OF SAMPLES) Modelled Predicted class Predicted class Calibration Real class 95% limits 99% limits RMSEP Q2 cultivar [-0.3 ; 0.3] / [0.7 ; 1.3] Control chart AG Other Uncertain Accepted Rejected Uncertain Full AG (n=25) 20 0 5 18 4 3 0.65 ; 1.22 0.51 ; 1.36 0.17 0.91 Other (n=49) 0 49 0 0 49 0 AG AG Other Uncertain Accepted Rejected Uncertain Balanced AG (n=25) 20 0 5 19 3 3 0.73 ; 1.21 0.61 ; 1.33 0.20 0.90 Other (n=49) 0 44 5 0 49 0 CA Other Uncertain Accepted Rejected Uncertain Full CA (n=10) 7 0 3 7 0 3 0.71 ; 1.14 0.60 ; 1.25 0.14 0.85 Other (n=64) 0 62 2 0 64 0 CA CA Other Uncertain Accepted Rejected Uncertain Balanced CA (n=10) 10 0 0 10 0 0 0.75 ; 1.19 0.64 ; 1.30 0.17 0.82 Other (n=64) 0 62 2 0 64 0 PI Other Uncertain Accepted Rejected Uncertain Full PI (n=15) 14 0 1 14 1 0 0.45 ; 1.26 0.25 ; 1.46 0.20 0.76 Other (n=59) 0 48 11 1 46 12 PI PI Other Uncertain Accepted Rejected Uncertain Balanced PI (n=15) 13 0 2 13 0 2 0.55 ; 1.34 0.35 ; 1.53 0.26 0.73 Other (n=59) 0 48 11 1 54 4 SA Other Uncertain Accepted Rejected Uncertain Full SA (n=13) 13 0 0 13 0 0 0.68 ; 1.22 0.54 ; 1.36 0.16 0.84 Other (n=61) 0 57 4 0 61 0 SA SA Other Uncertain Accepted Rejected Uncertain Balanced SA (n=13) 13 0 0 13 0 0 0.69 ; 1.26 0.55 ; 1.40 0.19 0.82 Other (n=61) 0 54 7 0 61 0 TA Other Uncertain Accepted Rejected Uncertain Full TA (n=11) 5 0 6 11 0 0 0.52 ; 1.28 0.32 ; 1.47 0.21 0.69 Other (n=63) 0 56 7 0 62 1 TA TA Other Uncertain Accepted Rejected Uncertain Balanced TA (n=11) 10 0 1 11 0 0 0.60 ; 1.30 0.42 ; 1.48 0.21 0.70 Other (n=63) 0 57 6 0 63 0

103

4. Conclusion

PLS1-DA models can predict the varietal origin of olive oils from five main French cultivars based on their fatty acid and squalene percentages obtained by GC analysis. The classical PLS-DA approach is not well suited to unbalanced classes, which create a shift in the predicted values of the modelled cultivar. Building the calibration models with balanced classes results in better prediction of the modelled cultivar, however the variability of the other cultivars is not so well taken into account. This issue can be avoided by using the control chart approach proposed in this article. This approach focuses only on the recognition of the modelled cultivar, thus resulting in a more accurate discrimination. Samples that deviate from the typical characteristic of their cultivar can be uncovered by the control chart. These samples should be analysed again with a different method, such as sensory analysis, infrared spectroscopy or genotyping, to confirm their origins. Moreover, future studies could focus on the application of the control chart approach to the detection of monovarietal olive oil adulteration with cheaper oils.

Acknowledgements

The authors thank Christian Pinatel from the AFIDOL (French interprofessional association of olive, Aix en Provence, France) for providing the olive oil samples. Thank-you also to the trainee students, Cécile Grapeloup and Théo Brunet, for their invaluable help in carrying out the GC analyses.

Funding

This work was supported by the French National Agency for Research (ANR) as part of the European Union’s Seventh Framework Program for research, technological development and demonstration (grant agreement number 618127).

104

References

[1] Garcia-Gonzalez, D.L., & Aparicio, R. (2010). Research in Olive Oil: Challenges for the Near Future. J. Agric. Food Chem., 58:12569-12577. [2] Bajoub, A., Bendini, A., Fernadez-Gutierrez, A., & Carrasco-Pancorbo, A. (2018). Olive oil authentication: A comparative analysis of regulatory frameworks with especial emphasis on quality and authenticity indices, and recent analytical techniques developed for their assessment. A review. Crit. Rev. Food Sci. Nutr., 58:832-857. [3] Kontominas, M.G. (2019). Authentication and Detection of the Adulteration of Olive Oil. New York: Nova Science Publishers, Inc. [4] Callao, M.P., & Ruisanchez, I. (2018). An overview of qualitative methods for food fraud detection. Food Control, 86:283-293. [5] International Olive Council. (2018). Trade Standard Applying to Olive Oils and Olive Pomace Oils, COI/T.15/NC No 3/Rev. 12. [6] Mannina, L., Dugo, G., Salvo, F., Cicero, L., Ansanelli, G., Calcagni, C., & Segre, A. (2003). Study of the Cultivar-Composition Relationship in Sicilian Olive Oils by GC, NMR, and Statistical Methods. J. Agric. Food Chem., 51:120-127. [7] Ollivier, D., Artaud, J., Pinatel, C., Durbec, J.P., & Guérère, M. (2003). Triacylglycerol and Fatty Acid Compositions of French Virgin Olive Oils. Characterization by Chemometrics. J. Agric. Food Chem., 51:5723-5731. [8] Gurdeniz, G., Ozen, B., & Tokatli, F. (2008). Classification of Turkish olive oils with respect to cultivar, geographic origin and harvest year, using fatty acid profile and mid-IR spectroscopy. Eur. Food Res. Technol., 227:1275-1281. [9] Lee, L.C., Liong, C-Y., & Jemain, A.A. (2018). Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategies and knowledge gaps. Analyst, 143:3526-3539. [10] Shewhart, W.A. (1926). Quality Control Charts. Bell Syst. Tech. J., 5:593-603. [11] Kourti T., & MacGregor J. F. (1995). Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometr. Intell. Lab. Syst., 28:3-21. [12] Alli, I. (2004). Vocabulary of food quality assurance. In Food quality assurance: principles and practices. Boca Raton: CRC Press LLC. 1-26.

105

[13] Laroussi-Mezghani, S., Vanloot, P., Molinet, J., Dupuy, N., Hammami, M., Grati-Kamoun, N., & Artaud, J. (2015). Authentication of Tunisian virgin olive oils by chemometric analysis of fatty acid compositions and NIR spectra. Comparison with Maghrebian and French virgin olive oils. Food Chem., 173:122-132. [14] Bro, R., & Smilde, A. K. (2014). Principal component analysis. Anal. Methods, 6:2812- 2831. [15] Wold, S. (1987). Principal Component Analysis. Chemometr. Intell. Lab. Syst., 2:37-52. [16] Galtier, O., Abbas, O., Le Dréau, Y., Rébufa, C., Kister, J., Artaud, J., & Dupuy, N. (2011). Comparison of PLS1-DA PLS2-DA and SIMCA for classification by origin of crude petroleum oils by MIR and virgin olive oils by NIR for different spectral regions. Vib. Spectrosc., 55:132-140. [17] Barker, M., & Rayens, W. (2003). Partial least squares for discrimination. J. Chemometr., 17:166-173. [18] CAMO Software AS. (2016). The Unscrambler® Appendices: Method References https://www.camo.com/TheUnscrambler/Appendices/ Accessed 07 March 2019. [19] Borràs, E., Ferré, J., Boqué, R., Mestres, M., Aceña, L., Calvo, A., & Busto, O. (2016). Olive oil sensory defects classification with data fusion of instrumental techniques and multivariate analysis (PLS-DA). Food Chem., 203:314-322.

106

CHAPITRE 3 : APPLICATION DE DIFFÉRENTES STRATÉGIES DE FUSION DES DONNÉES ISSUES DE PLUSIEURS TECHNIQUES D’ANALYSE

Les méthodes de référence actuelles utilisent les analyses spécifiques par chromatographie pour confirmer l’authenticité des huiles d’olive, cependant les analyses globales par spectroscopie vibrationnelle offrent des alternatives intéressantes en terme de rapidité, de coût et d’impact environnemental. 22 De plus, grâce à l’évolution des capacités de calcul informatique, il est possible de combiner de grandes bases de données issues de différentes méthodes d’analyse pour améliorer les performances des modèles chimiométriques.23 Dans cette optique, la première partie de ce chapitre porte sur l’exploitation des résultats d’analyses globales par MIR et d’analyses spécifiques par CG d’échantillons d’huiles d’olive de trois variétés tunisiennes. 24 L’objectif de cette étude est de développer un modèle chimiométrique multiblock pour exploiter la complémentarité des informations contenues dans les données spectroscopiques et chromatographiques et ainsi améliorer la discrimination des origines variétales. Cet article a été accepté pour une publication dans le journal Food Chemistry. La deuxième partie du chapitre présente une étude centrée sur les analyses des échantillons d’huiles d’olive monovariétales françaises obtenues au cours de cette thèse, lors des récoltes de 2016, 2017 et 2018. La composition globale des échantillons a été analysée par spectroscopies PIR et MIR. Les performances de plusieurs algorithmes de fusion des données sont comparées pour optimiser la discrimination de l’origine variétale des huiles d’olive : simple concaténation des matrices individuelles, modèles hiérarchiques avec un premier niveau de réduction de dimension ACP ou PLS suivi d’un deuxième niveau PLS, modèle multiblock présenté dans la première partie, ou enfin vote majoritaire combinant les résultats des modèles individuels. Cet article a été soumis au journal Analytica Chimica Acta.

22 Valli, E., Bendini, A., Berardinelli, A., Ragni, L., Riccò, B., Grossi, M., Gallina Toschi, T. Rapid and innovative instrumental approaches for quality and authenticity of olive oils. European journal of lipid science and technology, 2016, 118, 1601-1619. 23 Borràs, E., Ferré, J., Boqué, R., Mestres, M., Aceña, L., Busto, O. Data fusion methodologies for food and beverage authentication and quality assessment–A review. Analytica Chimica Acta, 2015, 891, 1-14. 24 Laroussi-Mezghani, S. Biodiversité des huiles d'olive vierges tunisiennes : valorisation à travers une démarche de qualité (Tunolival). https://www.theses.fr/2015AIXM4393.

107

Multiblock chemometrics for the discrimination of three extra virgin olive oil varieties

Astrid Maléchaux, Sonda Laroussi-Mezghani, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy Aix Marseille Univ, Univ Avignon, CNRS, IRD, IMBE, Marseille, France

Abstract

To discriminate samples from three varieties of Tunisian extra virgin olive oils, weighted and non-weighted multiblock partial least squares – discriminant analysis (MB-PLS1-DA) models were compared to simple PLS1-DA models using data from either specific fatty acid and squalene contents obtained by gas chromatography (GC), or global composition through mid- infrared spectra (MIR). The performance of each model was determined using statistical parameters and percentages of sensitivity, specificity and total correct classification. The choice of threshold level for the interpretation of PLS1-DA results was considered. Overall, PLS1-DA models using GC data gave better results than those using MIR data. Indeed, even with the most conservative threshold, PLS1-DA on GC data allowed very good predictions for the Chemlali variety (99% of correct classification), but had more difficulty to discriminate Chetoui and Oueslati samples (95% and 84% of correct classification respectively). Furthermore, non- weighted MB-PLS1-DA models benefiting from the synergy between the two sources of data were slightly more discriminative than simple PLS1-DA, yielding better prediction for the Chetoui and Oueslati varieties (98% and 90% of correct classification respectively).

Keywords cultivars, data fusion, MB-PLS-DA, gas chromatography, fatty acids, mid-infrared

108

1. Introduction

Olive oil is known for displaying health-promoting effects that depend, among other factors, on its cultivar. Therefore, olive oil authentication has been a growing concern for consumers for many years. As a result, food fraud is a major challenge for both regulatory agencies and producers, as it can negatively impact consumer trust and cause important losses of revenue [1]. High-value products benefiting from quality or origin certifications are an especially attractive target for fraudsters. This is the case of extra virgin olive oils (EVOO) with a Protected Designation of Origin (PDO), which must comply with defined specifications regarding their varietal and geographic origins. Studies aiming to determine the compliance of an EVOO with a reference constituted by the characteristics of a cultivar or the specifications of a PDO can be divided into two main categories. In the first one, samples are treated in order to determine their composition in specific constituents such as triacylglycerols, fatty acids, sterols, volatile compounds, etc [2]. The second approach is based on spectroscopic analyses requiring no sample treatment, namely 1H and 13C nuclear magnetic resonance (NMR) [3], near infrared (NIR), mid infrared (MIR) and Raman spectroscopies [4], or fluorescence spectroscopy [5]. Furthermore, over the past few decades, the increasing amount of data available from more and more sophisticated analytical techniques, associated with the improvement of computational power allowing to treat this information with multivariate statistical analyses, has spurred the development of methods capable of simultaneously analysing several blocks of data [6]. In the field of food chemistry, data can be obtained from different techniques such as electronic sensors, mass spectrometry, gas or liquid chromatography or vibrational spectroscopy. Combining information from complementary analyses can be a way to obtain more reliable classification and prediction results [7]. In this regard, three types of data fusion strategies are described in a recent review by Borràs et al. [8]: low-level with simple concatenation of data, mid-level using hierarchical models introduced by Wold, Kettaneh & Tiessem [9] or multiblock models developed by Wangen & Kowalski [10], and high-level combining results from separate models to provide a final prediction using probability estimations. Most of the articles applying chemometrics to olive oil authenticity consider each analytical technique separately [11], but to this day few studies have applied data fusion to the

109

discrimination of EVOO origin and even fewer have combined data from spectroscopic and chromatographic analyses together. Some articles have studied the simple concatenation of data from two or more sources [12-21]. Hierarchical models have also been developed based on data from NIR and MIR [15], spectroscopy and mass spectrometry [14], artificial nose, NIR and UV-visible [22] or liquid chromatography with two detectors [21]. To our knowledge, mid-level data fusion approaches using multiblock models have not yet been applied to the discrimination of EVOO varietal origin. Moreover, the combination of GC data giving specific information on the major compounds with MIR data taking into account the global composition of oils, should provide complementary information and is expected to be able to refine the EVOO origin discrimination. This is the purpose of this work: multiblock partial least squares – discriminant analysis models (MB-PLS1-DA) were developed from GC and MIR datasets, with and without weighting the block scores, in order to evaluate their performance against those of the PLS1-DA models applied separately to each dataset. The study was conducted using Tunisian monovarietal EVOO from three cultivars.

2. Materials and methods

2.1. Extra virgin olive oil samples

Sampling was carried out during the 2011-2012 and 2012-2013 harvest years. Three hundred and thirty-four monovarietal EVOO samples from three Tunisian varieties were used for this study: Chemlali (n=187), Chetoui (n=102) and Oueslati (n=45). Tunisian EVOO were obtained in laboratory by oleodoseur extraction system, from handpicked olives and without storage time before the extraction [23]. The quality criteria for all samples were comprised within the ranges established for the “Extra Virgin Olive Oil” category by the trade standard of the International Olive Council [24].

110

2.2. Gas chromatography

The transmethylation of the EVOO triacylglycerols and subsequent GC analyses using an Agilent Technology gas chromatograph 7890A equipped with a split/split-less injector, a flame ionization detector and a Supelcowax silica capillary column coated with polyethylene glycol (60 m × 0.25 mm i.d., 0.25 μm film thickness) were conducted following the method described by Laroussi-Mezghani et al. [23].

2.3. Mid infrared spectroscopy

MIR spectra were recorded between 700 and 4000 cm-1 by the accumulation of 64 scans per spectrum with a resolution of 4 cm-1 on a Thermo Nicolet Avatar spectrometer equipped with an ATR accessory (Goldengate, Specac), using the same protocol as Galtier et al. [25].

2.4. Chemometrics

2.4.1. Pre-treatment Prior to data analysis, the noisy and noninformational regions between 1880-2600 cm-1 and 3200-4000 cm-1 were removed from the MIR spectra. In order to optimize the models, normalisation followed by standard normal variate (SNV) pre-treatments were applied to the spectra to correct the distortions caused by additive and multiplicative effects. The GC data were also normalised.

2.4.2. Data analysis PLS1-DA models were applied to GC and MIR data separately. Each sample is assigned a binary coding indicating its membership or non-membership of each class, and a different model is built to predict each class against all the others. During the calibration process, the PLS1-DA method is trained to compute the “membership values” and a sample is then assigned to the modelled class when its value is above a determined threshold [26]. However, due to the initial design of PLS for continuous variables, the predicted values are not binary and thus several methods have been proposed to select a threshold that discriminates the results between the

111

expected values 1 or 0: using an arbitrary value of 0.5, determining the optimal threshold with receiver operating characteristic curves, estimating a probability density function to handle unbalanced groups sizes, or defining an interval to take into account the uncertainty of PLS predictions [27]. In this study three thresholds were considered for the calculation of the percentage of correct classification, as presented in Figure 1 (a, b and c). First, samples with a predicted value over 0.5 were considered positive (belonging to the modelled class) and those with a predicted value under 0.5 negative (outside of the modelled class). However, some samples may have predicted values close to 0.5 or too different from the reference values 0 and 1, indicating that they are not clearly recognised by the model. Thus, in a second approach uncertainty zones were defined to address this issue. Samples were considered positive if their predicted value was between 0.6 and 1.4, negative if predicted between -0.4 and 0.4, and uncertain if predicted between 0.4 and 0.6, under -0.4 or over 1.4. Finally, following the same reasoning, more conservative uncertainty zones were tested with samples considered positive if predicted between 0.7 and 1.3, negative between -0.3 and 0.3, and uncertain otherwise. A sample was considered as true positive, or true negative, if its predicted value was consistent with its expected value of 1, or 0. On the contrary, if the predicted value did not match the expected value the sample was considered as false negative, or false positive (or uncertain, if applicable). The total percentage of correct classification, as well as sensitivity and specificity were calculated according to Equations (1) to (3).

푡푟푢푒 푝표푠푖푡푖푣푒+푡푟푢푒 푛푒푔푎푡푖푣푒 % 푐표푟푟푒푐푡 푐푙푎푠푠푖푓푖푐푎푡푖표푛 = × 100 (1) 푡표푡푎푙 푛푢푚푏푒푟 표푓 푝푟푒푑푖푐푡푒푑 푠푎푚푝푙푒푠

푡푟푢푒 푝표푠푖푡푖푣푒 % 푠푒푛푠푖푡푖푣푖푡푦 = × 100 (2) 푒푥푝푒푐푡푒푑 푝표푠푖푡푖푣푒

푡푟푢푒 푛푒푔푎푡푖푣푒 % 푠푝푒푐푖푓푖푐푖푡푦 = × 100 (3) 푒푥푝푒푐푡푒푑 푛푒푔푎푡푖푣푒

112

FIGURE 1 DEFINITION OF THE THRESHOLDS INDICATING THE TRUE, FALSE OR UNCERTAIN ATTRIBUTION OF PREDICTED

SAMPLES TO THE MODELLED VARIETY (A: 0.5 THRESHOLD, B: 0.4-0.6 THRESHOLD, C: 0.3-0.7 THRESHOLD)

MB-PLS1-DA was then applied with one predictor block X1 consisting of the 15 variables of GC data and a second predictor block X2 comprising the 948 variables of MIR data, after their respective pre-treatments and mean-centring. Autoscaling was not applied since it could cause the information from the large MIR block to be preponderant over the small GC block [28]. However, two scaling strategies were tested: one with a weighting of the block scores to take into account the number of variables in each block as indicated in Equation (7), and the other without any weighting. The response matrix Y contained the 3 varietal origins of the 334 EVOO samples, and three independent models were built to predict each origin against the other two combined. The MB-PLS algorithm used is the one developed by Westerhuis, Kourti & MacGregor [29], detailed in Equations (4) to (13).

퐓 퐗퐢 = 퐓퐢퐏퐢 + 퐄퐢 (4) 퐓 퐘 = 퐓퐬퐐 + 퐅 (5)

113

With Xi the matrix of predictors for block i, Ti the block scores matrix, Pi the block loadings matrix, Ei the residuals, Y the response matrix, Ts the super-scores matrix, Q the response loadings matrix and F the residuals.

The variable weights (wi) in Equation (6) are calculated separately for each block using the response scores (u) and then normalised. 퐓 퐰퐢 = 퐗퐢 퐮 (6)

The scores (ti) in Equation (7) are also computed for each block Xi so that the covariance between the response Y and the scores is maximized, and scaled by the square root of the number of variables in the block (mi).

퐗퐢퐰퐢 퐓 2 퐭퐢 = to maximise |퐘 ∑ 퐭퐢| (7) √퐦풊 The block scores are then combined into a super-matrix (T) and a PLS is performed between T and Y. The super-weights (ws) are normalised before calculation of the super-scores (ts), response loadings (q) and response scores (u), in Equations (8) to (11). 퐓 퐰퐬 = 퐓 퐮 (8)

퐭퐬 = 퐓퐰퐬 (9)

퐓 퐘 퐭퐬 퐪 = 퐓 (10) 퐭퐬 퐭퐬 퐘퐪 퐮 = (11) 퐪퐓퐪

Equations (6) to (11) are repeated until convergence of u. Then the block loadings (pi) are calculated, and the Xi and Y matrices are deflated using the super-scores as indicated in Equations (12) to (14). This process, from Equation (6) to (14), is repeated for the next latent variable (LV) with the new Xi and Y.

퐓 퐗퐢 퐭퐬 퐩퐢 = 퐓 (12) 퐭퐬 퐭퐬 퐓 퐗퐢 ← 퐗퐢 − 퐭퐬퐩퐢 (13) 퐓 퐘 ← 퐘 − 퐭퐬퐪 (14)

For both PLS1-DA and MB-PLS1-DA models, considering the large number of samples these were randomly and equally divided into a calibration set and a prediction set (167 samples each). A first version of the models was constructed with these sets, then a second version was built after permuting the sets. For each version of the PLS1-DA and MB-PLS1-DA models, a “leave one out” cross validation procedure was used on the calibration set to find the optimal number of LV. It had to be large enough to minimise the root mean square error of cross

114

validation (RMSECV), but not too large in order to avoid over-fitting. Thus, the optimal number was chosen as the highest number of LV (n) meeting the criterion from Equation (15).

퐑퐌퐒퐄퐂퐕(퐧)−퐑퐌퐒퐄퐂퐕(퐧−ퟏ) > ퟓ% (15) 퐑퐌퐒퐄퐂퐕(퐧)

The calibration model was then computed with the selected number of LV. Finally, the prediction set was used to calculate the predicted response according to this calibration model. In addition to the number of LV, determination coefficients of calibration (R2) and prediction (Q2), as well as root mean square errors of calibration (RMSEC) and prediction (RMSEP) were calculated.

2.4.3. Software Chemometrics pre-treatments were applied using The Unscrambler® X (version 10.4, CAMO Software). The pre-treated matrices were then imported to Matlab® (version 7.8 R2009a, MathWorks) for data analysis. MB-PLS1-DA routines, including a calibration with cross- validation step followed by a prediction step, were developed based on the Multi-block Toolbox by Frans van den Berg [30].

3. Results and discussion

Tunisian EVOO from the three varieties Chemlali, Chetoui and Oueslati were analysed by GC and MIR spectroscopy.

3.1. Gas chromatography profiles

The GC profiles obtained after the transmethylation of triacylglycerols, which represent around 98% of the total content of olive oils, indicate the relative proportions of major compounds (fourteen fatty acids and squalene). An example of chromatogram, with peaks identification from Ollivier et al.[31], is shown in Figure 2. Predominant peaks are due to oleic (18:1ω9), palmitic (16:0) and linoleic (18:2ω6) acids, and two other fatty acids, namely stearic (18:0) and z-vaccenic (18:1ω7) acids, are present in intermediate amounts. Moreover, even other fatty acids present in lesser amounts may play an important part in the discrimination of varietal

115

origin. The mean, maximum and minimum percentages of the fifteen major compounds, as well as the sum of saturated fatty acids (SFA), monounsaturated fatty acids (MUFA) and polyunsaturated fatty acids (PUFA) for each of the three VOO varieties are compiled in Table 1. Most of the SFA percentages, namely margaric (17:0), stearic (18:0), arachidic (20:0), behenic (22:0) and lignoceric (24:0) acids, do not differ significantly between the three studied varieties. Margaroleic (17:1ω8) acid and the only measured ω3, linolenic acid (18:3ω3), are not discriminant either. Nevertheless, Chemlali samples are characterised by a generally higher SFA content, mainly due to their high levels of palmitic (16:0) acid. They contain more palmitoleic (16:1ω7) and z-vaccenic (18:1ω7) acids, but less oleic (18:1ω9) and gondoic (20:1ω9) acids than the other two varieties, resulting in an overall lower MUFA content. The amount of squalene in Chemlali samples is also lower. As for Chetoui samples, they have average values of total SFA, MUFA and PUFA contents compared to the other two varieties, but they are distinguished by their higher levels of hypogeic (16:1ω9) acid and lower levels of palmitoleic (16:1ω7) acid. Finally, Oueslati samples contain less PUFA because of their low levels of linoleic (18:2ω6) acid.

FIGURE 2 EXAMPLE OF A CHROMATOGRAM FROM VOO WITH IDENTIFICATION OF THE PEAKS

1: PALMITIC ACID (16:0), 2: HYPOGEIC ACID (16:1 ω9), 3: PALMITOLEIC ACID (16:1 ω7), 4: MARGARIC ACID (17:0), 5:

MARGAROLEIC ACID (17:1 ω8), 6: STEARIC ACID (18:0), 7: OLEIC ACID (18:1 ω9), 8: Z-VACCENIC ACID (18:1 ω7), 9:

LINOLEIC (18:2 ω6), 10: LINOLENIC ACID (18:3 ω3), 11: ARACHIDIC ACID (20:0), 12: GONDOIC ACID (20:1 ω9), 13:

BEHENIC ACID (22:0), 14: LIGNOCERIC ACID , 24:0) AND 15: SQUALENE

116

TABLE 1 MEAN, MAXIMUM AND MINIMUM PROPORTIONS (%) OF FATTY ACIDS AND SQUALENE FOR THE THREE VARIETIES (CM: CHEMLALI, CT: CHETOUI, OU: OUESLATI)

16:0 16:1w9 16:1w7 17:0 17:1w8 18:0 18:1w9 18:1w7 18:2w6 18:3w3 20:0 20:1w9 22:0 24:0 Squa SFA MUFA PUFA Mean 17.40 0.06 2.20 0.04 0.07 2.44 57.61 3.11 15.28 0.67 0.44 0.20 0.12 0.07 0.28 20.52 62.38 15.95 CM Max 22.75 0.12 3.44 0.07 0.10 2.99 64.32 3.82 21.31 1.17 0.54 0.27 0.16 0.09 0.46 25.77 68.41 22.48 Min 13.13 0.02 1.57 0.03 0.05 1.99 47.59 2.47 10.75 0.51 0.36 0.14 0.10 0.05 0.11 15.80 53.10 11.47 Mean 11.37 0.13 0.30 0.05 0.05 2.82 66.46 1.31 15.28 0.69 0.48 0.39 0.13 0.05 0.49 14.90 69.77 15.97 CT Max 14.47 0.17 0.56 0.09 0.11 3.43 73.53 1.71 21.16 0.90 0.54 0.47 0.16 0.08 0.67 17.76 76.86 21.87 Min 8.71 0.10 0.17 0.04 0.03 2.34 59.39 0.90 10.14 0.54 0.42 0.31 0.10 0.03 0.37 12.35 62.94 10.92 Mean 11.50 0.09 0.62 0.04 0.05 2.36 70.79 1.79 10.59 0.63 0.44 0.35 0.14 0.07 0.53 14.55 73.92 11.22 OU Max 15.75 0.12 0.99 0.04 0.06 3.22 74.47 2.40 15.47 0.75 0.50 0.40 0.15 0.09 0.67 18.63 77.48 16.09 Min 9.61 0.07 0.43 0.03 0.04 1.88 64.41 1.31 7.97 0.50 0.39 0.28 0.12 0.05 0.36 13.25 67.67 8.72 16:0: palmitic acid, 16:1 ω9: hypogeic acid, 16:1 ω7: palmitoleic acid, 17:0: margaric acid, 17:1 ω8: margaroleic acid, 18:0: stearic acid, 18:1 ω9: oleic acid, 18:1 ω7: z-vaccenic acid, 18:2 ω6: linoleic acid, 18:3 ω3: linolenic acid, 20:0: arachidic acid, 20:1 ω9:gondoic acid, 22:0: behenic acid, 24:0: lignoceric acid, Squa: squalene, SFA: saturated fatty acids, MUFA: monounsaturated fatty acids, PUFA: polyunsaturated fatty acids

117

3.2. MIR spectra

MIR spectra contain information on the global composition of the samples, including potential variations in the concentrations of triacylglycerols but also of different families of minor compounds. Some of these constituents, such as squalene, carotenoids, tocopherols, phytosterols and phenolic compounds, have beneficial nutritional properties. An example of MIR spectrum is presented in Figure 3, with bands attribution according to Aparicio & Harwood [32]. The bands do not result from a single molecule but rather from the vibration of chemical bonds that are present in all the compounds of the sample so that the interpretation is less straightforward than for GC peaks. Well defined bands between 3100 and 1700 cm-1 are attributed to C-H, C=O and C=C stretching vibrations, whereas some overlapping bands between 1500 and 700 cm-1 have been assigned to C-H, C-O and C-C bending vibrations. Contrary to the noticeable differences in the fatty acid profiles, variations in the global composition are not readily perceptible since the VOO samples have, to the naked eye, similar MIR spectra. The use of chemometrics pre-treatments and modelling is therefore necessary to extract the relevant information from this data.

FIGURE 3 EXAMPLE OF A MIR SPECTRUM FROM VOO WITH IDENTIFICATION OF THE BANDS

1: =C-H CIS STRETCHING, 2: C-H STRETCHING, 3: C=O STRETCHING, 4: C=C CIS STRETCHING, 5: C-H BENDING, 6: C-O

AND C-C BENDING, 7: C-H BENDING (LONG CHAINS)

118

TABLE 2 STATISTICAL PARAMETERS AND RESULTS (SENSITIVITY, SPECIFICITY AND CORRECT CLASSIFICATION RATES) OF THE PLS1-DA MODELS USING THE FIRST VERSION OF THE CALIBRATION

AND PREDICTION SETS OF EITHER GC, MIR, WEIGHTED MULTIBLOCK OR NON-WEIGHTED MULTIBLOCK DATA TO DISCRIMINATE THE THREE EVOO VARIETIES (CM: CHEMLALI, CT: CHETOUI,

OU: OUESLATI)

CM (Cal: 93, Pred: 94) CT (Cal: 51, Pred: 51) OU (Cal: 23, Pred: 22) MB MB no MB MB no MB MB no GC MIR GC MIR GC MIR weight weight weight weight weight weight LV 3 4 3 4 6 7 6 7 5 7 5 7

RMSEC 0.10 0.21 0.09 0.09 0.12 0.21 0.12 0.11 0.18 0.19 0.17 0.16

RMSEP 0.11 0.26 0.11 0.11 0.16 0.26 0.15 0.13 0.24 0.21 0.23 0.20

R2 0.98 0.91 0.98 0.98 0.97 0.89 0.97 0.97 0.86 0.84 0.87 0.89

Q2 0.97 0.85 0.97 0.97 0.94 0.84 0.95 0.96 0.74 0.79 0.77 0.82

%Sens 100 94 100 100 100 90 100 100 95 95 95 100

%Spec 100 92 100 100 99 96 99 100 100 100 100 100 0.5

Threshold: %CC 100 93 100 100 99 94 99 100 99 99 99 100

%Sens 100 91 100 100 98 86 98 100 91 82 91 86

0.6

- %Spec 99 89 99 99 97 93 97 100 93 98 94 94 0.4

Threshold: %CC 99 90 99 99 98 91 98 100 93 96 94 93

%Sens 100 78 100 98 94 80 94 98 55 36 64 77

0.7

- %Spec 99 84 99 99 95 83 97 97 89 89 89 92 0.3

Threshold: %CC 99 80 99 98 95 82 96 98 84 82 86 90 LV: number of latent variables, RMSEC: root mean square error of calibration, RMSEP: root mean square error of prediction, R2: determination coefficient of calibration, Q2: determination coefficient of prediction, %Sens: sensitivity rate, %Spec: specificity rate, %CC: correct classification rate

119

3.3. Prediction of olive oil variety

The statistical parameters (number of LV, RMSEC, RMSEP, R2 and Q2) and results (sensitivity, specificity and total correct classification percentages) from the PLS1-DA prediction models developed for each variety based on GC, MIR and Multiblock data with the first version of the calibration and prediction sets are presented in Table 2. Statistical parameters and results obtained with the second version of the calibration and prediction sets can be found in the Supporting Information (SI 1).

3.3.1. PLS1-DA on GC data The GC model for the Chemlali variety performs very well, with a RMSEP of 0.11 and Q2 of 0.97 for 3 LV and can perfectly discriminate Chemlali samples from the others using the 0.5 thresholds. Regarding the Chetoui variety, the GC models also give good results with a RMSEP of 0.16 and Q2 of 0.94 for 6 LV, and 99% correct classification with the 0.5 threshold. The model for Oueslati samples is slightly less efficient, yielding a RMSEP of 0.24 and Q2 of 0.74 for 5 LV, but still reaches a correct classification rate of 99% with the 0.5 threshold. However, when looking at the results obtained with the 0.4-0.6 and 0.3-0.7 threshold, a drop in the percentages of correctly classified samples indicates that some of them are actually in the uncertainty zones. This can be especially observed for Oueslati samples, with a dramatic decrease of sensitivity from 95% to 91% with the 0.4-0.6 threshold, and even to 55% with the 0.3-0.7 threshold, while the specificity is less impacted but goes from 100% to 93% and to 89% for each threshold respectively. The models predicting other two cultivars appear to be more discriminative. Indeed, for Chetoui the sensitivity and specificity are still of 98% and 97% with the 0.4-0.6 threshold, and only drop to 94% and 95% respectively for the 0.3-0.7 threshold. The Chemlali model is still very good with 100% sensitivity and 99% specificity with both thresholds, showing that most samples are predicted close to their expected values. The poorer performance of the Oueslati model may be due to the smaller number of samples from this cultivar, which creates a strong imbalance between the positive and negative classes in this model. The permutation of calibration and prediction sets brings some modifications to these results, especially for the Chetoui and Oueslati models, which do not use the same number of LV. The

120

calibration quality parameters are slightly lower, but the prediction parameters are improved. The Chetoui model gives slightly better results but with only 3 LV, instead of 6 LV for the previous version. The Oueslati model is built with 6 LV, versus 5 LV in the first version, and also has overall better results. Thus, there is some influence of the samples selected in the calibration and prediction sets on the performance of the models.

3.3.2. PLS1-DA on MIR data PLS1-DA on MIR data is less satisfactory than that using GC data, especially for the Chemlali and Chetoui models which were very good with GC data. Indeed, the model predicting the Chemlali variety has a RMSEP of 0.26 and Q2 of 0.85 with 4 LV, while for Chetoui with 7 LV the RMSEP and Q2 are respectively of 0.26 and 0.84. The Oueslati model also uses 7 LV but its quality parameters are better than that of the model based on GC data, with a RMSEP of 0.21 and Q2 of 0.79. Using the 0.5 threshold, the three varieties are still quite well predicted, with total correct classification rates of 93% for Chemlali, 94% for Chetoui and 99% for Oueslati. The 0.4-0.6 and 0.3-0.7 thresholds identify even more uncertain samples with MIR than with GC data. The sensitivity drops to 82% with the former and to 36% with the latter for Oueslati samples, while the specificity is less impacted and only decreases to 98% and to 89%. When using MIR data as opposed to GC data, the model predicting the Chemlali origin is more impacted by the change of threshold than the Chetoui model. The sensitivity and specificity of the Chemlali model decrease to 91% and 89% respectively with the 0.4-0.6 threshold, then to 78% and 84% with the 0.3-0.7 threshold. For the Chetoui model the sensitivity and specificity decrease first to 86% and 93% respectively, and then to 80% and 83%. These results indicate that GC data is more discriminative than MIR, since samples are more clearly predicted as belonging or not to the modelled variety with the former data. However, MIR data still contains valuable information that could be used in case GC-based models do not yield good results, as evidenced by the performance of the MIR-based Oueslati model. The permutation of calibration and prediction sets also leads to variations in the performance of the models. As for the GC-based models, the permutation worsens the quality parameters for the calibration but improves them for the prediction. I this case, the prediction results are better for the Oueslati and Chemlali models, but not for the Chetoui model.

121

3.3.3. Multiblock PLS1-DA When applying the MB-PLS1-DA models with the scores weighting, the results are similar to those obtained with GC data for all three varieties with the different thresholds. The models are built with the same number of LV (3 for Chemlali, 6 for Chetoui and 8 for Oueslati) and have the same value of quality parameters. The only noticeable difference is an improvement in the Q2 for the Oueslati model, from 0.74 with GC to 0.77 with MB-PLS. Moreover, the sensitivity with the 0.3-0.7 threshold is improved and goes from 55% with GC alone and 36% with MIR alone to 64% with the multiblock model. Thus, the use of MB-PLS could reduce the influence of the imbalanced number of samples. The Chemlali and Chetoui models, for which the GC data alone gave better results, are not negatively influenced by the addition of the MIR data and the Oueslati model, for which the MIR data alone gave slightly better results, beneficiates from the combination of the two sources of information. The permutation of calibration and prediction sets slightly changes the results of the individual models, but for the multiblock models with weighting of the scores the results remain close to those of the GC models. These results suggest that the GC data, containing information on the major compounds of olive oil, has a stronger influence than the MIR data, representing all the major and minor compounds, on the weighted multiblock models. This can be highlighted by a study of the contribution of each block to the final model, as presented in Figure 4 (a, b and c). Indeed, GC data is predominant on all the LVs for each variety and is the main contributor to the first LV. MIR data brings some contribution to the following LV, which indicates a possible synergy between the two sources of information. However, the scaling realized by the MB-PLS algorithm to compensate for the much larger number of variables in the MIR block (948 variables for MIR versus 15 variables for GC) strongly reduces the influence of MIR data on the models. Block weights for the Chemlali variety are not affected by the permutation of calibration and prediction sets. However, for the Chetoui variety the second version of the model only uses three LV which mostly contain information from the GC block. On the contrary, for the Oueslati variety MIR data has more influence on the second version of the model.

122

FIGURE 4 WEIGHTS OF THE GC (BLUE) AND MIR (YELLOW) BLOCKS FOR EACH LATENT VARIABLE OF THE MB-PLS MODELS

WITH THE FIRST VERSION OF THE CALIBRATION AND PREDICTION SETS, WITH WEIGHTED BLOCK SCORES (A: CHEMLALI, B:

CHETOUI, C: OUESLATI) AND NON-WEIGHTED BLOCK SCORES (D: CHEMLALI, E: CHETOUI, F: OUESLATI)

In order to take better advantage of the complementary information brought by the MIR data, MB-PLS1-DA models without any scores weighting have been performed to give more importance to this additional source. Indeed, the number of selected LV for these multiblock models are the same as for the MIR-based models for each cultivar. However, the results are improved compared to the previous models, especially for the Oueslati and Chetoui cultivars. For the Chemlali model, the results are close to that of the already effective GC-based model, with only a slightly lower sensitivity of 98% (versus 100%) for the 0.3-0.7 threshold. The prediction quality parameters are improved for the Chetoui model, with a RMSEP of 0.13 and Q2 of 0.96. Moreover, the MB-PLS model without weighting gives perfect predictions with both the 0.5 and the 0.-0.6 thresholds for this cultivar. Using the 0.3-0.7 threshold, the sensitivity and specificity are also better than with GC or MIR data alone, reaching a total correct classification of 98% versus 95% for the GC-based model. The quality parameters are also improved with the MB-PLS model predicting the Oueslati cultivar, reaching a RMSEP of 0.20

123

and Q2 of 0.82. This model also results in a perfect prediction with the 0.5 threshold. The sensitivity and specificity observed with the 0.4-0.6 threshold are intermediate between those of the GC-based and MIR-based models. Nevertheless, with the 0.3-0.7 threshold the sensitivity and specificity are much better with this MB-PLS model, reaching respectively 77% and 92%. Again, the multiblock approach seems able to correct the issue caused by the imbalanced number of samples in the classes. After the permutation of the calibration and prediction sets the improvement brought by the non-weighted MB-PLS1-DA models is somewhat lost since the results appear to be in-between those obtained with GC and MIR data alone. The contributions of the blocks without weighting presented in Figure 4 (d, e and f) show that, despite its much larger number of variables, the MIR block does not overshadow the GC block. On the contrary, there seems to be a good balance between the two sources of information. GC data is still predominant on the first LV for each model, but MIR data has a more important influence on the latter components. A similar pattern can be observed after the permutation of the calibration and prediction sets (SI 2), but with less influence of the GC data on the latter LVs, which might contribute to the poorer results. Finally, the study of block weights confirms that although variations in the contents of the major compounds of olive oils measured by GC can discriminate most of the samples from the three studied varieties, complementary information about the global composition of the samples detected in their MIR spectra can play a part in the improvement of the prediction models. Thus, the MB-PLS1-DA models without any weighting of the scores can be useful for samples whose origin is difficult to certify from their MIR spectrum or their fatty acid profile only.

4. Conclusion

This study shows that PLS1-DA models using GC data alone give very good results for the discrimination of olive oil origin, especially for the Chemlali variety. Using MIR data alone is less efficient, even though it reaches more than 80% of correct classification with the most conservative threshold. Moreover, combining specific information on the major compounds from GC with global information on all major and minor compounds from MIR data can improve the prediction results for the varieties that were not well discriminated with GC data only. In this regard, scaling the block scores to take into account their number of variables strongly

124

reduces the influence of the MIR data. Thus, the results from the weighted MB-PLS1-DA models are close to those of the GC-based models. On the contrary, using non-weighted MB-PLS1-DA models allows for a synergy between the two sources of information and results in better quality parameters and higher sensitivity, specificity and total correct classification percentages for the Chetoui and Oueslati varieties. These results should nevertheless be considered with caution since the permutation of calibration and prediction sets indicated that the performance of the different models depend on the samples used to develop and test these models. From a food control perspective, MIR analysis is cheaper and faster than GC and could be used as a first screening device. In a second phase, multiblock models combining MIR and GC data can strongly improve the discrimination for the samples that were in the uncertainty zones with the first model.

Acknowledgements

The authors thank the Olive Tree Institute of Sfax, Tunisia, for providing the olive oil samples.

Funding

This work was financially supported by the French National Agency for Research (ANR) as part of the MedOOmics project, included in the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement number 618127 (ARIMNet2). Financial support was also obtained from the “PHC Utique” program of the Tunisian Ministry of Higher Education and Scientific Research and the French Ministry of Foreign Affairs, in the Committee for University Cooperation (CMCU) project number 11G1214.

Conflict of interest

The authors have no conflict of interest to declare.

125

References

[1] Charlebois, S., Schwab, A., Henn, R., & Huck, C. V. (2016). Food fraud: An exploratory study for measuring consumer perception towards mislabeled food products and influence on self-authentication intentions. Trends in Food Science & Technology, 50:211-218. [2] Tena, N., Wang, S. C., Aparicio-Ruiz, R., García-González, D. L., & Aparicio, R. (2015). In-Depth Assessment of Analytical Methods for Olive Oil Purity, Safety, and Quality Characterization. Journal of Agricultural and Food Chemistry, 63:4509-4526. [3] Dais, P., & Hatzakis, E. (2013). Quality Assessment and Authentication of Virgin Olive Oil by NMR Spectroscopy: A Critical Review. Analytica Chimica Acta, 765:1-27. [4] Nenadis, N., & Tsimidou, M. Z. (2017). Perspective of vibrational spectroscopy analytical methods in on-field/official control of olives and virgin olive oil. European Journal of Lipid Science and Technology, 119:1600148. [5] Guzmán, E., Baeten, V., Pierna, J. A. F., & García-Mesa, J. A. (2015). Evaluation of the Overall Quality of Olive Oil Using Fluorescence Spectroscopy. Food Chemistry, 173:927-934. [6] Brereton, R. G., Jansen, J., Lopes, J., Marini, F., Pomerantsev, A., Rodionova, O., Roger, J. M., Walczak, B., & Tauler, R. (2017). Chemometrics in analytical chemistry-part I: history, experimental design and data analysis tools. Analytical and Bioanalytical Chemistry, 409:5891-5899. [7] Callao, M. P., & Ruisánchez, I. (2018). An overview of multivariate qualitative methods for food fraud detection. Food Control, 86:283-293. [8] Borràs, E., Ferré, J., Boqué, R., Mestres, M., Aceña, L., & Busto, O. (2015). Data Fusion Methodologies for Food and Beverage Authentication and Quality Assessment – A Review. Analytica Chimica Acta, 891:1-14. [9] Wold, S., Kettaneh, N., & Tjessem, K. (1996). Hierarchical Multiblock PLS and PC Models for Easier Model Interpretation and as an Alternative to Variable Selection. Journal of Chemometrics, 10:463-482. [10] Wangen, L. E., & Kowalski, B. R. (1989). A Multiblock Partial Least Squares Algorithm for Investigating Complex Chemical Systems. Journal of Chemometrics, 3:3-20. [11] Gómez-Caravaca, A. M., Maggio, R. M., & Cerretani, L. (2016). Chemometric applications to assess quality and critical parameters of virgin and extra-virgin olive oil. A review. Analytica Chimica Acta, 913:1-21.

126

[12] de B Harrington, P., Kister, J., Artaud, J., & Dupuy, N. (2009). Automated Principal Component-Based Orthogonal Signal Correction Applied to Fused Near Infrared−Mid- Infrared Spectra of French Olive Oils. Analytical Chemistry, 81:7160-7169. [13] Casale, M., Sinelli, N., Oliveri, P., Di Egidio, V., & Lanteri, S. (2010). Chemometrical Strategies for Feature Selection and Data Compression Applied to NIR and MIR Spectra of Extra Virgin Olive Oils for Cultivar Identification. Talanta, 80:1832-1837. [14] Casale, M., Casolino, C., Oliveri, P., & Forina, M. (2010). The Potential of Coupling Information Using Three Analytical Techniques for Identifying the Geographical Origin of Liguria Extra Virgin Olive Oil. Food Chemistry, 118:163-170. [15] Dupuy, N., Galtier, O., Ollivier, D., Vanloot, P., & Artaud, J. (2010). Comparison between NIR, MIR, Concatenated NIR and MIR Analysis and Hierarchical PLS Model. Application to Virgin Olive Oil Analysis. Analytica Chimica Acta, 666:23-31. [16] Casale, M., Oliveri, P., Casolino, C., Sinelli, N., Zunin, P., Armanino, C., Forina, M., & Lanteri, S. (2012). Characterisation of PDO Olive Oil Chianti Classico by Non-Selective (UV–Visible, NIR and MIR Spectroscopy) and Selective (Fatty Acid Composition) Analytical Techniques. Analytica Chimica Acta, 712:56-63. [17] Haddi, Z., Alami, H., El Bari, N., Tounsi, M., Barhoumi, H., Maaref, A., Jaffrezic-Renault, N., & Bouchikhi, B. (2013). Electronic Nose and Tongue Combination for Improved Classification of Moroccan Virgin Olive Oil Profiles. Food Research International, 54:1488- 1498. [18] Pizarro, C., Rodríguez-Tecedor, S., Pérez-del-Notario, N., Esteban-Díez, I., & González-Sáiz, J. M. (2013). Classification of Spanish Extra Virgin Olive Oils by Data Fusion of Visible Spectroscopic Fingerprints and Chemical Descriptors. Food Chemistry, 138:915-922. [19] Dias, L. G., Rodrigues, N., Veloso, A. C. A., Pereira, J. A., & Peres, A. M. (2016). Monovarietal extra-virgin olive oil classification: a fusion of human sensory attributes and an electronic tongue. European Food Research and Technology, 242:259-270. [20] Kosma, I., Badeka, A., Vatavali, K., Kontakos, S., & Kontominas, M. (2016). Differentiation of Greek Extra Virgin Olive Oils According to Cultivar Based on Volatile Compound Analysis and Fatty Acid Composition: Differentiation of Greek Extra Virgin Olive Oils. European Journal of Lipid Science and Technology, 118:849-861. [21] Bajoub, A., Medina-Rodríguez, S., Gómez-Romero, M., Bagur-González, M. G., Fernández- Gutiérrez, A., & Carrasco-Pancorbo, A. (2017). Assessing the Varietal Origin of Extra-Virgin

127

Olive Oil Using Liquid Chromatography Fingerprints of Phenolic Compound, Data Fusion and Chemometrics. Food Chemistry, 215:245-255. [22] Forina, M., Oliveri, P., Bagnasco, L., Simonetti, R., Casolino, M. C., Grifi, F. N., & Casale, M. (2015). Artificial nose, NIR and UV–visible spectroscopy for the characterisation of the PDO Chianti Classico olive oil. Talanta, 144:1070-1078. [23] Laroussi-Mezghani, S., Vanloot, P., Molinet, J., Dupuy, N., Hammami, M., Grati-Kamoun, N., & Artaud, J. (2015). Authentication of Tunisian Virgin Olive Oils by Chemometric Analysis of Fatty Acid Compositions and NIR Spectra. Comparison with Maghrebian and French Virgin Olive Oils. Food Chemistry, 173:122–132. [24] International Olive Council (2016). COI/T.15/NC No 3/Rev. 11 - Trade Standard Applying to Olive Oils and Olive Pomace Oils. http://www.internationaloliveoil.org/estaticos/view/222-standards. Accessed July 04, 2018. [25] Galtier, O., Le Dréau, Y., Ollivier, D., Kister, J., Artaud, J., & Dupuy, N. (2008). Lipid Compositions and French Registered Designations of Origins of Virgin Olive Oils Predicted by Chemometric Analysis of Mid-Infrared Spectra. Applied Spectroscopy, 62:583-590. [26] Granato, D., Putnik, P., Kovačević, D. B., Santos, J. S., Calado, V., Rocha, R. S., Da Cruz, A. G., Jarvis, B., Rodionova, O. Y., & Pomerantsev, A. (2018). Trends in chemometrics: Food authentication, microbiology, and effects of processing. Comprehensive Reviews in Food Science and Food Safety, 17:663-677. [27] Lee, L. C., Liong, C.-Y., & Jemain, A. A. (2018). Partial Least Squares-Discriminant Analysis (PLS-DA) for Classification of High-Dimensional (HD) Data: A Review of Contemporary Practice Strategies and Knowledge Gaps. The Analyst, 143:3526-3539. [28] Westerhuis, J. A., & Coenegracht P. M. J. (1997). Multivariate Modelling of the Pharmaceutical Two-Step Process of Wet Granulation and Tableting with Multiblock Partial Least Squares. Journal of Chemometrics, 11:379-392. [29] Westerhuis, J. A., Kourti, T., & MacGregor, J. F. (1998). Analysis of Multiblock and Hierarchical PCA and PLS Models. Journal of Chemometrics, 12:301-321. [30] van den Berg, F. (2004). Multi-block Toolbox for MATLAB. http://www.models.life.ku.dk/mbtoolbox. Accessed July 18, 2018.

128

[31] Ollivier, D., Artaud, J., Pinatel, C., Durbec, J. P., & Guérère, M. (2003). Triacylglycerol and Fatty Acid Compositions of French Virgin Olive Oils. Characterization by Chemometrics. Journal of Agricultural and Food Chemistry, 51:5723-5731. [32] Aparicio, R., & Harwood, J. (2013). Handbook of Olive Oil. 2nd ed. Boston, MA: Springer US.

129

Supporting Information

SI 1. STATISTICAL PARAMETERS AND RESULTS (SENSITIVITY, SPECIFICITY AND CORRECT CLASSIFICATION RATES) OF THE PLS1-DA MODELS USING THE SECOND VERSION

OF THE CALIBRATION AND PREDICTION SETS OF EITHER GC, MIR, WEIGHTED MULTIBLOCK OR NON-WEIGHTED MULTIBLOCK DATA TO DISCRIMINATE THE THREE EVOO

VARIETIES

CM (Cal: 93, Pred: 94) CT (Cal: 51, Pred: 51) OU (Cal: 23, Pred: 22) MB MB no MB MB no MB MB no GC MIR GC MIR GC MIR weight weight weight weight weight weight LV 3 4 3 3 3 7 3 5 6 8 6 5

RMSEC 0.11 0.24 0.10 0.11 0.15 0.23 0.15 0.13 0.20 0.19 0.19 0.19

RMSEP 0.10 0.22 0.09 0.11 0.16 0.24 0.16 0.15 0.18 0.19 0.19 0.20

R2 0.98 0.88 0.98 0.97 0.94 0.87 0.95 0.96 0.80 0.83 0.83 0.82

Q2 0.98 0.90 0.98 0.98 0.94 0.86 0.94 0.95 0.85 0.84 0.85 0.82

%Sens 100 96 100 100 100 88 100 100 96 100 100 100

%Spec 100 96 100 100 99 97 99 99 100 99 100 100 0.5

Threshold: %CC 100 96 100 100 99 95 99 99 99 99 100 100

%Sens 100 94 100 100 100 80 100 98 83 96 87 83

0.6

- %Spec 100 89 100 100 98 94 99 99 99 98 98 97 0.4

Threshold: %CC 100 92 100 100 99 90 99 99 96 98 96 95

%Sens 100 87 100 100 94 71 96 92 70 70 70 65

0.7

- %Spec 100 81 100 100 96 86 97 97 94 90 94 92 0.3

Threshold: %CC 100 84 100 100 95 81 96 95 90 87 91 88

130

SI 2. WEIGHTS OF THE GC (BLUE) AND MIR (YELLOW) BLOCKS FOR EACH LATENT VARIABLE OF THE MB-PLS MODELS WITH

THE SECOND VERSION OF THE CALIBRATION AND PREDICTION SETS, WITH WEIGHTED BLOCK SCORES (A: CHEMLALI, B:

CHETOUI, C: OUESLATI) AND NON-WEIGHTED BLOCK SCORES (D: CHEMLALI, E: CHETOUI, F: OUESLATI)

131

Comparison of near- and mid-infrared data fusion strategies: are two better than one?

Astrid Maléchaux, Yveline Le Dréau, Jacques Artaud, Nathalie Dupuy Aix Marseille Univ, Avignon Université, CNRS, IRD, IMBE, Marseille, France

Abstract

Combining data from different analytical sources could be a way to improve the performances of chemometric models by extracting the relevant and complementary information for food authentication. In this study, several data fusion strategies including concatenation (low-level), multiblock and hierarchical models (mid-level), and majority vote (high-level) are applied to near- and mid-infrared (NIR and MIR) spectral data for the varietal discrimination of olive oils from six French cultivars by partial least square discriminant analysis (PLS1-DA). The performances of the data fusion models are compared to each other and to the results obtained with NIR or MIR data alone. Concatenation and multiblock PLS1-DA fail to improve the prediction results compared to individual models since the complementary information appears to be lost in the very large number of variables. Hierarchical models with a dimension reduction involving unsupervised PCA projections deteriorate the predictions, whereas hierarchical models using a PLS1-DA step for dimension reduction provide a more efficient differentiation for most, but not all, of the cultivars. The high-level models using a majority vote benefit from the complementary results of the individual NIR and MIR models leading to less strongly but more consistently improved results for all cultivars. This strategy supports the approach to combine analytical techniques to achieve synergies for an optimized discrimination of origin.

Keywords

Data fusion, chemometrics, vibrational spectroscopy, olive oil, cultivars

132

1. Introduction

The increasing availability of multivariate data from various analytical sources and large numbers of samples makes it necessary to develop statistical tools that can extract the relevant information from these big datasets. For this purpose, a variety of chemometric models have been developed to predict either quantitative or qualitative properties. In the field of food authentication, the goal is often to determine if the actual characteristics of the samples are in agreement with the information provided by their label. Thus, supervised methods can be used to classify new samples as authentic or non-authentic based on the known characteristics of previous samples. Several approaches have been developed including class-modelling algorithms, which focus on the similarities within a class, and discriminant analysis algorithms, which focus on the differences between classes [1,2]. Moreover, since a lot of different analytical techniques can be applied to ass the authenticity of food products [2,3], combining data from complementary analyses is expected to improve the performances of the statistical models. Data fusion strategies are divided into three categories: low-level, mid-level and high-level [4]. Low-level fusion consists in the simple concatenation of the matrices containing the data from the different sources, followed by the analysis of the concatenated data by the chosen chemometric model [5-7]. However, this method suffers from the very large number of variables in the concatenated matrix with the risk of increasing noise which may cancel out the advantages of adding sources of information. To solve this issue, mid-level fusion uses a first step of dimension reduction to extract the relevant information from the original data matrices and only the selected features are then combined to build the chemometric model [8-11]. Finally, high-level fusion builds separate models on each original dataset and these prediction results are then combined for the final decision where the class assignment is made according to probability rules or majority vote [12- 14]. Previous studies have reported mixed outcomes from the use of data fusion. In most cases the results were improved by low-level [5-7, 9, 10, 14], mid-level [9, 10, 14] or high level [12- 14] strategies, but some studies have also reported that data fusion failed to improve the results compared to models built with individual datasets [8, 10, 11]. In this article, one low-level strategy with concatenation, three mid-level strategies with hierarchical and multiblock models, and one high-level strategies with majority vote are applied to the discrimination of olive oil varietal origin by partial least square discriminant analysis

133

(PLS1-DA) using near- and mid-infrared (NIR and MIR) spectral data. The performances of the data fusion models are compared to each other and to individual models using only NIR or MIR data.

2. Material and methods

2.1. Olive oil samples

A total of 218 samples from six monovarietal extra-virgin olive oils produced over three harvest years (2016, 2017 and 2018) were used for this study. The samples came from six typical French cultivars: Aglandau (AG, n=61), Cailletier (CA, n=27), Olivière (OL, n=28), Picholine (PI, n=32), Salonenque (SA, n=36) and Tanche (TA, n=34).

2.2. Near-infrared spectroscopy

FT-NIR spectra were obtained with an Antaris II spectrometer (Thermo Scientific, Waltham, MA, USA) in transmission mode, in a temperature-controlled room at 21°C. The oil was poured in a QX Quartz Suprasil 300 cell (Hellma Analytics, Mülheim, Germany) with an optical path of 2 mm and an empty quartz cell was used to take a background reference before each measurement. Between each sample, the quartz cell was cleaned with isooctane, dried with air, rinsed with dichloromethane and dried again with air. Each spectrum was recorded between 10000 and 4500 cm-1 by the accumulation of 16 scans with a resolution of 4 cm-1. The analysis was repeated two time for each sample and the resulting spectra were averaged. The NIR range between 10000 and 6100 cm-1 was not included in the chemometric models, thus the remaining NIR spectra consisted of 831 variables between 6100 and 4500 cm-1.

2.3. Mid-infrared spectroscopy

FT-MIR spectra were obtained using a Nicolet Avatar spectrometer (Thermo Scientific, Waltham, MA, USA) with a nitrogen-cooled MCT detector, Ever-Glo source and KBr/Ge beam splitter. The measurements were conducted in a temperature-controlled room at 21°C and air

134

was taken as a background reference before each spectrum. A drop of EVOO was placed on the diamond crystal of a Golden Gate ATR accessory (Specac, Orpington, UK). Its spectrum was recorded between 4000 and 600 cm-1 by the accumulation of 64 scans with a resolution of 4 cm-1. The ATR plate was cleaned with ethanol between two acquisitions. This process was repeated three times for each EVOO sample and the three resulting spectra were averaged prior to data analysis. The MIR range between 4000 and 1800 cm-1 was not included in the models, so the remaining MIR spectra consisted of 571 variables between 1800 and 700 cm-1.

2.4. Chemometrics

Exploratory analyses were conducted with the Unscrambler X software (CAMO Software, Oslo, Norway), to visualise the repartition of the samples with a principal component analysis (PCA) [15]. Multivariate statistical analyses were performed using several models developed with Matlab R2014b software (The MathWorks, Natick, MA, USA). The spectra were normalised and mean- centred before being used in several variations of the partial least square – discriminant analysis (PLS1-DA) [16]. For each model, the samples were assigned a binary coding indicating if they belonged (value of 1) or not (value of 0) to the modelled cultivar. Two thirds of the samples from each cultivar and each harvest year were randomly selected to compose a calibration set and the remaining third served as a validation set to test the predictive abilities of the models. A sample was considered as belonging to the modelled cultivar if its predicted value was between 0.6 and 1.4, belonging to the other cultivars if predicted between -0.4 and 0.4, and suspect if predicted outside of these boundaries. First, individual PLS1-DA models were developed using either the NIR or the MIR data. Then, several data fusion strategies were applied:

- Low-level: NIR and MIR data were appended in a single matrix and PLS1-DA models were developed using this concatenated dataset.

135

- Mid-level: o In the multiblock models (MB-PLS1-DA), for each iteration the PLS scores were calculated from the individual NIR and MIR data blocks and combined into a “super-matrix” used in the final PLS1-DA step, with the super-matrix and the two blocks being deflated using the “super-scores” instead of the individual scores [17]. o In the first hierarchical models (PCA-PLS1-DA), NIR and MIR data were subjected to a first step of dimension reduction using PCA and PLS1-DA models were then applied to the fused PCA scores [8]. o In the second hierarchical models (PLS-PLS1-DA), the dimension reduction was conducted by PLS1-DA on the individual NIR and MIR data, and the final PLS1- DA models were built using the fused PLS scores [9].

- High-level: Separate PLS1-DA models were developed with the NIR and MIR data, and the minimum, maximum and average of the predicted values from both models were included in a majority vote to reach the final prediction [12].

In order to explore the influence of both NIR and MIR data and their respective contribution to the discrimination of the studied olive oil varietal origin, variable importance in projection (VIP) values were calculated for each data fusion model using the formula from Mehmood et al. [18]. Since the average of the squared VIP values is equal to 1, variables with a VIP greater than 1 are usually considered as more relevant to the model, but some studies suggest that this threshold could vary [18, 19].

3. Results and discussion

The NIR and MIR spectra obtained from all the studied olive oil samples, with identification of the spectral bands, are shown as Supporting Information (SI 1).

136

FIGURE 1. A: PCA SCORES, B: LOADINGS FOR PC1 AND C: LOADINGS FOR PC3, OBTAINED WITH THE CONCATENATED NIR (6100-4500 CM-1) AND MIR (1800-700 CM-1) DATA, WITH

SAMPLES REPRESENTED ACCORDING TO THEIR CULTIVAR ON THE SCORE PLOT (▲: AGLANDAU, Δ: CAILLETIER, ■: OLIVIÈRE, ●: PICHOLINE, □: SALONENQUE, ○: TANCHE) AND MOST

INFLUENTIAL BANDS IDENTIFIED ON THE LOADING PLOTS

137

3.1. Exploratory analysis

Figure 1 presents the scores (Figure 1-A) and loadings (Figures 1-B and 1-C) of the PCA on concatenated NIR and MIR data. The best separation according to cultivars was obtained using the first and third PCs. The first component (PC1) represents 60% of the information. It is strongly influenced by the NIR band around 5865 cm-1 that corresponds to the first overtone of C-H bond vibrations (-CH3, -CH2) and could be related to the degree of unsaturation of triacylglycerols (-CH=CH-) [20]. PC1 separates OL and AG samples, which have mostly negative scores, from PI, SA and TA samples, which have mostly positive scores on this component, especially for PI. The third component (PC3) represents 12% of the information and is more influenced by MIR data, especially by the region between 1135 and 1082 cm-1 attributed to the fundamental vibrations of C-O (ester) and C-C bonds, and could also be influenced by the degree and type of unsaturation of fatty acids [20]. There is also a significant contribution from the NIR band around 5772 cm-1, which is in the area of the C-H first overtone vibration. PC3 separates SA samples, which have positive scores, from TA samples, which have negative scores, while the other cultivars have rather medium values on this component. Thus, even though the groups of cultivars are overlapping on these two PCs, chemometric models should be able to discriminate the samples according to their varietal origin. Samples from the CA cultivar may be more difficult to identify since their characteristics place them in the middle of all the other cultivars.

3.2. PLS1-DA on individual data

The condensed results of the prediction models are presented in Table 1, and more detailed results with confusion matrices are available in the Supplementary Information (SI 2). Using individual datasets, NIR gives slightly better results than MIR but often with a higher optimal number of LVs. Indeed, PLS1-DA models built with NIR data use 13 LVs to predict each cultivar, except CA for which 14 LVs are needed. The quality parameters are satisfying, with SEP from 0.19 for the AG model down to 0.13 for the OL model and Q2 from 0.84 for the CA model up to 0.92 for the OL model. Only a few samples are not recognized as belonging to their proper cultivar, resulting in 95% correct classification rates for AG and PI, 96% for OL and SA, and 97%

138

for CA and TA. PLS1-DA models built with MIR data use 11 LVs to predict CA and TA, 12 LVs for PI and SA, and 13 LVs for AG and OL. The quality parameters are a little worse than with NIR, with SEP between 0.22 for the CA model and 0.15 for the OL model and Q2 between 0.74 for CA and 0.90 for AG, OL and SA. As a result, there are more misclassified samples and the correct classification rates are slightly lower than with NIR data with 93% for AG and PI, 95% for CA and TA. Only OL still has 96% correct classification, and SA is improved with 97% correct classification.

TABLE 1. NUMBER OF FACTORS, QUALITY PARAMETERS AND CORRECT CLASSIFICATION RATE FOR THE PLS1-DA MODELS

ON INDIVIDUAL DATASETS (LV: NUMBER OF LATENT VARIABLES, SEP: STANDARD ERROR OF PREDICTION, Q2:

DETERMINATION COEFFICIENT, CC: CORRECT CLASSIFICATION RATE, AG: AGLANDAU, CA: CAILLETIER, OL: OLIVIÈRE, PI:

PICHOLINE, SA: SALONENQUE, TA: TANCHE)

Model NIR (6100-4500 cm-1) MIR (1800-700 cm-1) LV SEP Q2 CC LV SEP Q2 CC AG 13 0.19 0.91 95% 13 0.20 0.90 93% CA 14 0.17 0.84 97% 11 0.22 0.74 95% OL 13 0.13 0.92 96% 13 0.15 0.90 96% PI 13 0.17 0.89 95% 12 0.21 0.82 93% SA 13 0.16 0.90 96% 12 0.16 0.90 97% TA 13 0.18 0.87 97% 11 0.20 0.86 95%

3.3. PLS1-DA with data fusion

NIR and MIR spectra contain some redundant but also some complementary information and combining both data is expected to improve the prediction of olive oil cultivars. The condensed results of the prediction models are presented in Table 2, and confusion matrices are available in the Supporting Information (SI 3).

139

TABLE 2. NUMBER OF FACTORS, QUALITY PARAMETERS AND CORRECT CLASSIFICATION RATE FOR THE PLS1-DA MODELS ON FUSED DATASETS (LV: NUMBER OF LATENT VARIABLES, SEP:

STANDARD ERROR OF PREDICTION, Q2: DETERMINATION COEFFICIENT, CC: CORRECT CLASSIFICATION RATE, AG: AGLANDAU, CA: CAILLETIER, OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE,

TA: TANCHE) Mid-level Hierarchical 1 Mid-level Hierarchical 2 Model Low-level Mid-level Multiblock High-level (1st level PCA: 30 var.) (1st level PLS: 30 var.) LV SEP Q2 CC LV SEP Q2 CC LV SEP Q2 CC LV SEP Q2 CC LV SEP Q2 CC AG 13 0.21 0.89 97% 15 0.20 0.90 96% 11 0.26 0.82 89% 8 0.17 0.92 99% * * * 97% CA 13 0.24 0.69 91% 13 0.23 0.72 91% 6 0.25 0.63 89% 7 0.23 0.72 92% * * * 97% OL 15 0.17 0.88 96% 14 0.17 0.87 96% 6 0.23 0.75 93% 6 0.16 0.89 99% * * * 96% PI 12 0.20 0.83 96% 12 0.21 0.83 96% 6 0.22 0.79 91% 7 0.18 0.87 97% * * * 96% SA 11 0.17 0.89 97% 10 0.16 0.90 97% 5 0.21 0.82 93% 5 0.18 0.88 97% * * * 97% TA 12 0.20 0.85 95% 11 0.20 0.85 95% 6 0.21 0.82 93% 6 0.18 0.89 96% * * * 96% *No quality parameters were calculated for the high-level model, see the quality parameters of the individual models (Table 1) var. = variables (PCs or LVs) selected in the dimension reduction step

140

3.3.1. Low-level concatenation Applying simple PLS1-DA to a low-level concatenated dataset containing both NIR and MIR data does not bring much improvement to the results compared to the individual datasets. Indeed, the optimal numbers of LVs are still high, ranging from 11 for SA to 15 for OL, and the quality parameters remain similar, with SEP values between 0.24 for CA and 0.17 for OL and SA and Q2 between 0.69 for CA and 0.89 for AG and SA. The prediction results are slightly better for the AG and PI cultivars, reaching 97% and 96% of correct classification respectively, but worse for the CA cultivar with only 91% correct classification. The results after concatenation are identical to those obtained with MIR data alone for the SA, OL and TA cultivars (97%, 96% and 95% of correct classification respectively). The VIP scores for the PLS1-DA models using concatenated NIR and MIR data (Figure 2) indicate that both sources provide some useful information, but the most important variables differ depending on the modelled cultivar. MIR data seems to have more influence of the CA model and to a lesser extent on the AG and PI models, while NIR data appears to be preponderant in the OL model and to a lesser extent in the SA and TA models. The CA model, in particular, uses less information from the region of the NIR spectra between 4700 and 4550 cm-1, characteristic of the combination of C-H vibrations from unsaturation [20], which could partly explain its poorer results.

FIGURE 2. VIP SCORES OF THE PLS1-DA MODELS USING THE CONCATENATED NIR (6100-4500 CM-1) AND MIR (1800-

700 CM-1) DATA. AG: AGLANDAU, CA: CAILLETIER, OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE.

141

The lack of improvement observed with the low-level concatenation model may be due to the fact that the relevant information is concealed in the very large dataset, since only a few of the 1402 variables are really useful in the discrimination of the cultivars. Thus, mid-level models with an additional step condensing the appropriate information could have better prediction abilities. 3.3.2. Mid-level multiblock MB-PLS1-DA Applying a more complex mid-level MB-PLS1-DA algorithm gives very similar results to those from low-level concatenation (Table 2). The number of LVs ranges from 10 for SA to 15 for AG, while SEP values are between 0.23 for CA and 0.16 for SA, and Q2 between 0.72 for CA and 0.90 for AG and SA. The correct classification rates also reach 97% for SA, 96% for AG, OL and PI, 95% for TA and 91% for CA. The VIP scores for the MB-PLS1-DA models calculated using initial variable weights (Figure 3-A) give similar profiles to those obtained with concatenated data. Moreover, the VIP scores using block weights (Figure 3-B) confirm the good balance between NIR and MIR data, especially for the PI model, while MIR is more important in the CA and AG models and NIR more important in the OL, SA and TA models. The mid-level multiblock model does not seem to be able to extract more useful information from the NIR and MIR data.

FIGURE 3. VIP SCORES OF THE MB-PLS1-DA MODELS, CALCULATED FROM A: THE INITIAL NIR (6100-4500 CM-1) AND

MIR (1800-700 CM-1) VARIABLES WEIGHTS, OR B: THE NIR AND MIR BLOCK WEIGHTS. AG: AGLANDAU, CA: CAILLETIER,

OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE.

142

TABLE 3. NUMBER OF FACTORS, QUALITY PARAMETERS AND CORRECT CLASSIFICATION RATE FOR THE HIERARCHICAL MODELS WITH DIFFERENT NUMBER OF VARIABLES IN THE FIRST STEP (LV:

NUMBER OF LATENT VARIABLES, SEP: STANDARD ERROR OF PREDICTION, Q2: DETERMINATION COEFFICIENT, CC: CORRECT CLASSIFICATION RATE, AG: AGLANDAU, CA: CAILLETIER, OL:

OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE)

Model 1st level: 40 var. 1st level: 30 var. 1st level: 20 var. 1st level: 10 var. LV SEP Q2 CC LV SEP Q2 CC LV SEP Q2 CC LV SEP Q2 CC AG 11 0.26 0.82 88% 11 0.26 0.82 89% 6 0.29 0.75 84% 4 0.35 0.61 72% CA 6 0.25 0.64 89% 6 0.25 0.63 89% 6 0.26 0.62 91% 7 0.30 0.39 85%

PCA- OL 6 0.23 0.75 92% 6 0.23 0.75 93% 6 0.23 0.74 92% 3 0.26 0.66 84% PLS1-DA PI 6 0.22 0.79 91% 6 0.22 0.79 91% 6 0.23 0.79 91% 3 0.30 0.57 81% SA 5 0.21 0.82 95% 5 0.21 0.82 93% 5 0.21 0.82 93% 3 0.23 0.77 91% TA 6 0.21 0.82 93% 6 0.21 0.82 93% 6 0.22 0.80 93% 6 0.24 0.69 87% AG 8 0.17 0.93 99% 8 0.17 0.92 99% 8 0.20 0.89 96% 3 0.26 0.77 83% CA 11 0.19 0.81 96% 7 0.23 0.72 92% 4 0.22 0.68 91% 3 0.26 0.60 83%

PLS- OL 8 0.14 0.91 99% 6 0.16 0.89 99% 6 0.18 0.86 96% 3 0.24 0.72 85% PLS1-DA PI 7 0.18 0.87 97% 7 0.18 0.87 97% 7 0.21 0.83 93% 3 0.23 0.77 89% SA 10 0.15 0.91 96% 5 0.18 0.88 97% 5 0.18 0.87 97% 4 0.19 0.85 99% TA 10 0.16 0.90 97% 6 0.18 0.89 96% 4 0.20 0.84 92% 4 0.22 0.81 91% var. = variables (PCs or LVs) selected in the dimension reduction step

143

Other mid-level strategies, involving a hierarchical algorithm with a first step using either PCA or PLS to reduce the dimension of the data, could yield better results. Since each individual PLS1-DA model obtained with either the NIR or the MIR data leads to an optimal number of latent variables around 15, in a first approach 30 variables (15 PCs or LVs from NIR and 15 PCs or LVs from MIR) were retained from the first step of dimension reduction. Then, to be sure that 30 variables was the optimal number, other possibilities keeping a balanced number of reduced variables from each source were tested (20, 10 or 5 PCs or LVs from both NIR and MIR). Detailed confusion matrices for each number of variables are given as Supplementary Information (Tables C and D). 3.3.3. Mid-level hierarchical PCA-PLS1-DA Dimension reduction to 30 variables using unsupervised PCA projection appears to be much less efficient to discriminate olive oils according to their cultivar (Table 2). Although the final number of LVs is reduced (between 5 for SA and 11 for AG), it leads to worse quality parameters and worse predictions than with NIR or MIR alone. The SEP values range from 0.26 for AG to 0.21 for SA and TA, and the Q2 are between 0.63 for CA and 0.82 for AG, SA and TA. The correct classification rates reach only 89% for AG and CA, 91% for PI and 93% for OL, SA and TA. The VIP scores for the PCA-PLS1-DA models (Figure 4) show once more that both NIR and MIR data are useful, with variations of importance depending on the modelled cultivar. Furthermore, most of the important information appears to be present in the first 10 PCs from each block of data, since the last PCs from both NIR and MIR data have very low VIP values.

FIGURE 4. VIP SCORES OF THE PCA-PLS1-DA MODELS WITH 15 PCS FROM NIR AND 15 PCS FROM MIR. AG: AGLANDAU,

CA: CAILLETIER, OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE.

144

The results from the hierarchical PCA-PLS1-DA models are not improved by taking 40 PCs (20 from NIR data and 20 from MIR data) in the first step instead of 30, as can be seen in Table 3. Moreover, decreasing the number of variables in the first step to 20 PCs (10 from NIR and 10 from MIR) does not have much impact on the results from the PCA-PLS1-DA models, since only the model predicting the AG cultivar is negatively impacted. Nevertheless, decreasing to only 5 PCs from NIR and 5 PCs from MIR gives worse results for all the cultivars. Thus, most of the information used by the PCA-PLS1-DA models seems to be present in the scores from the first 10 components of the PCAs for NIR and MIR data. 3.3.4. Mid-level hierarchical PLS-PLS1-DA PLS-PLS1-DA models using 15 LVs from NIR and 15 LVs from MIR lead to better prediction results compared to PLS1-DA models on NIR or MIR individual data for most of the cultivars except CA and TA, although they do not improve the quality parameters (Table 2). Indeed, with final number of LVs between 5 for SA and 8 for AG, the SEP values range from 0.23 for CA to 0.16 for OL and the Q2 range from 0.72 for CA to 0.92 for AG. Only 92% correct classification is obtained for CA, but the rate reaches 96% for TA, 97% for PI and SA, and an excellent 99% for AG and OL. Therefore, PLS-PLS1-DA models give the best results among the mid-level fusion models. The profiles of VIP scores for the PLS-PLS1-DA models (Figure 5) are close to those obtained for the PCA-PLS1-DA models. However, the first LVs of NIR data have more influence in all the models, which could play a part in the improvement of the PLS-PLS1-DA results since better predictions were obtained with the individual NIR models than with the individual MIR models.

FIGURE 5. VIP SCORES OF THE PLS-PLS1-DA MODELS WITH 15 LVS FROM NIR AND 15 LVS FROM MIR. AG: AGLANDAU,

CA: CAILLETIER, OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE

145

Nevertheless, the last LVs from MIR data still seem to bring some information since their VIP values are higher than those of the last LVs from NIR data, and they may thus still be useful to the models. Yet again, the results from hierarchical PLS-PLS1-DA models are not improved by taking 40 variables instead of 30 in the first step (Table 3). The only noticeable change is for the model predicting the CA cultivar, which reaches 96% of correct classification and better quality parameters (lower SEP and higher Q2), but only by using a higher optimal number of LVs (11 vs 7). The models for OL, SA and TA also have slightly better quality parameters with higher optimal numbers of LVs, but this does not improve their correct classification rates. In contrast, reducing the number of variables in the first step has a stronger influence on the results of the PLS-PLS1-DA models since the loss of information is already visible when only 20 LVs (10 from NIR and 10 from MIR) are used, and even more so with only 10 LVs (5 from NIR and 5 from MIR). Thus, 15 LVs from NIR and MIR data are necessary to retain the maximum of information in the first PLS dimension reduction step. The mid-level hierarchical PLS-PLS1-DA model is able to improve the prediction for some of the cultivars, but high-level fusion with a majority vote may be a simpler and faster way to achieve a better performance. 3.3.5. High-level majority vote The high-level models, based on a majority vote between the predicted values from the individual NIR and MIR models and their average, give a good compromise for all the cultivars (Table 2). Indeed, since the fusion takes place at the decision level in these models, the quality parameters remain the same as for the separate NIR and MIR models. The final results are similar to the ones obtained with the best individual model for CA and SA with correct classification rates of 97%, as well as for OL and TA with 96% correct classification. However, the discrimination of the AG and PI cultivars is improved with the high-level model, reaching 97% and 96% correct classification respectively.

4. Conclusion

NIR and MIR spectral data can be used separately to discriminate monovarietal olive oils from six French cultivars. In this case, models using NIR data give slightly better results than with MIR but with higher optimal numbers of latent variables. Combining NIR and MIR data does not

146

always improve the performances of the models. Indeed, when using low-level concatenation, the complementary information appears to be lost in the very large number of variables. Even mid-level multiblock PLS1-DA fails to improve the prediction results. Thus, using mid-level hierarchical PLS1-DA with a dimension reduction first step could be more appropriate, provided that the adequate number of variables are selected in this first step. However, dimension reduction involving unsupervised PCA projections actually worsen the results. The best results for most of the cultivars are obtained with a hierarchical PLS1-DA using a first PLS1-DA for the dimension reduction step. A number of variables close to the optimal number of LVs for each individual model should be selected in this first step to take into account most of the relevant information. However, this strategy gives heterogenous results, improving the discrimination for most cultivars but worsening it for others. Finally, the high-level models using a majority vote benefit from the complementary results of the individual NIR and MIR models. This strategy results in less strongly improved but more homogeneous results for all the cultivars, and is also easier and faster to implement than the mid-level strategies.

Acknowledgements

The authors thank Christian Pinatel from the French interprofessional association of the olive sector (Aix-en-Provence, France) for providing the olive oil samples.

Funding

This work was financially supported by the French National Agency for Research (ANR) as part of the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement number 618127.

Conflict of interest

The authors have no conflict of interest to declare.

147

References

[1] Granato, D., Putnik, P., Kovačević, D.B., Santos, J.S., Calado, V., Rocha, R.S., Da Cruz, A.G., Jarvis, B., Rodionova, O.Y., & Pomerantsev, A. (2018). Trends in Chemometrics: Food Authentication, Microbiology and Effects of Processing. Comprehensive Reviews in Food Science and Food Safety, 17:663-677. [2] Gómez-Caravaca, A.M., Maggio, R.M., & Cerretani, L. (2016). Chemometrics applications to assess quality and critical parameters of virgin and extra-virgin olive oil. A review. Analytica Chimica Acta, 913:1-21. [3] Danezis, G.P., Tsagkaris, A.S., Camin, F., Brusic, V., & Georgiou, C.A. (2016). Food authentication: Techniques, trends & emerging approaches. Trends in Analytical Chemistry, 85:123-132. [4] Borràs, E., Ferré, J., Boqué, R., Mestres, M., Aceña, L., & Busto, O. (2015). Data fusion methodologies for food and beverage authentication and quality assessment – A review. Analytica Chimica Acta, 891:1-14. [5] Casale, M., Armanino, A., Casolino, C., & Forina, M. (2007). Combining information from headspace mass spectrometry and visible spectroscopy in the classification of Ligurian olive oils. Analytica Chimica Acta, 589:89-95. [6] Pizarro, C., Rodríguez-Tecedor, S., Pérez-del-Notario, N., Esteban-Díez, I., & González-Sáiz, J.M. (2013). Classification of Spanish extra virgin olive oils by data fusion of visible spectroscopic fingerprints and chemical descriptors. Food Chemistry, 138:915-922. [7] Dias, L.G., Rodrigues, N., Veloso, A.C., Pereira, J.A., & Peres, A.M. (2016). Monovarietal extra- virgin olive oil classification: a fusion of human sensory attributes and an electronic tongue. European Food Research and Technology, 242:259–270. [8] Dupuy, N., Galtier, O., Ollivier, D., Vanloot, P., & Artaud, J. (2010). Comparisaon between NIR, MIR, concatenated NIR and MIR analysis and hierarchical PLS model. Application to virgin olive oil analysis. Analytica Chimica Acta, 666:23-31. [9] Biancolillo, A., Bucci, R., Magrì, A.L., Magrì, A.D., & Marini, F. (2014). Data-fusion for multiplatfrom characterization of an italian craft beer aimed at its authentication. Analytica Chimica Acta, 820:23-31.

148

[10] Borràs, E., Ferré, J., Boqué, R., Mestres, M., Aceña, L., Calvo, A., & Busto, O. (2016). Olive oil sensory defects classification with data fusion of instrumental techniques and multivariate analysis (PLS-DA). Food Chemistry, 203:314 - 322. [11] Bajoub, A., Medina-Rodríguez, S., Gómez-Romero, M., Bagur-González, M.G., Fernández- Gutiérrez, A., & Carrasco-Pancorbo, A. (2017). Assessing the varietal origin of extra-virgin olive oil using liquid chromatography fingerprints of phenolic compound, data fusion and chemometrics. Food Chemistry, 215:245 - 255. [12] Di Anibal, C.V., Pilar Callao, M., & Ruisánchez, I. (2011). 1H NMR and UV-visible data fusion for determining Sudan dyes in culinary spices. Talanta, 84:829-833. [13] Doeswijk, T.G., Smilde, A.K., Hagerman, J.A., Westerhuis, J.A., & van Eeuwijk, F.A. (2011). On the increase of predictive performance with high-level data fusion. Analytica Chimica Acta, 705:41-47. [14] Ballabio, D., Robotti, E., Grisoni, F., Quasso, F., Bobba, M., Vercelli, S., Gosetti, F., Calabrese, G., Sangiorgi, E., Orlandi, M., & Marengo, E. (2018). Chemical profiling and multivariate data fusion methods for the identification of the botanical origin of honey. Food Chemistry, 266:79- 89. [15] Wold, S. (1987). Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems, 2:37-52. [16] Barker, M., & Rayens, W. (2003). Partial least squares for discrimination. Journal of Chemometrics, 17:166-173. [17] Westerhuis, J.A., & Coenegracht, P.M.J. (1997). Multivariate modelling of the pharmaceutical two-step process of wet granulation and tableting with multiblock partial least squares. Journal of Chemometrics, 11:379-392. [18] Mehmood, T., Liland, K.H., Snipen, L., & Sæbø, S. (2012). A review of variable selection methods in partial least squares regression. Chemometrics and Intelligent Laboratory Systems 118:62-69. [19] Chong, I.G., & Jun, C.H. (2005). Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems, 78:103-112. [20] García-González, D.L., Baeten, V., Fernández Pierna, J.A., & Tena, N. (2013). Infrared, Raman and Fluorescence Spectroscopies: Methodologies and Applications, in: R. Aparicio, J. Harwood (Eds.), Handbook of Olive Oil, Boston, MA: Springer US., pp. 335-394.

149

Supporting Information

SI 1. COMPILATION OF A: NIR SPECTRA (1: 6000-5500 CM-1, C-H STRETCHING 1ST OVERTONE; 2: 5300-5100 CM-1, C=O STRETCHING 2ND OVERTONE; 3: 5000-4500 CM-1, =C-H AND

C=C STRETCHING COMBINATION) AND B: MIR SPECTRA (1: 1750-1740 CM-1, C=O STRETCHING; 2: 1660-1650 CM-1, C=C STRETCHING; 3: 1500-1300 CM-1, C-H BENDING; 4: 1250-

1000CM-1, C-C AND C-O BENDING; 5: 730-700CM-1, C-H BENDING) OF THE OLIVE OIL SAMPLES

150

SI 2. DETAILED CONFUSION MATRICES FOR THE PLS1-DA MODELS ON INDIVIDUAL DATASETS (AG: AGLANDAU, CA:

CAILLETIER, OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE) Predicted class NIR Predicted class MIR Real class [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] AG Other Suspect AG Other Suspect AG (n=20) 19 0 1 18 0 2 Other (n=55) 0 52 3 0 52 3 CA Other Suspect CA Other Suspect CA (n=9) 8 0 1 7 0 2 Other (n=66) 0 65 1 0 64 2 OL Other Suspect OL Other Suspect OL (n=10) 7 0 3 7 0 3 Other (n=65) 0 65 0 0 65 0 PI Other Suspect PI Other Suspect PI (n=12) 9 1 2 7 1 4 Other (n=63) 0 62 1 0 63 0 SA Other Suspect SA Other Suspect SA (n=12) 10 0 2 10 0 2 Other (n=63) 0 62 1 0 63 0 TA Other Suspect TA Other Suspect TA (n=12) 11 0 1 11 0 1 Other (n=63) 0 62 1 0 60 3

151

SI 3. DETAILED CONFUSION MATRICES FOR THE PLS1-DA MODELS ON FUSED DATASETS (AG: AGLANDAU, CA: CAILLETIER, OL: OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE) Predicted class Mid-level Predicted class Mid-level Predicted class Mid-level Predicted class Low-level Predicted class High-level Real class Multiblock 1st level PCA 30 var. 1st level PLS 30 var. [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] AG Other Suspect AG Other Suspect AG Other Suspect AG Other Suspect AG Other Suspect AG (n=20) 20 0 0 20 0 0 20 0 0 19 0 1 20 0 0 Other (n=55) 0 53 2 0 52 3 2 47 6 0 55 0 0 53 2 CA Other Suspect CA Other Suspect CA Other Suspect CA Other Suspect CA Other Suspect CA (n=9) 5 1 3 7 1 1 2 2 5 6 1 2 8 0 1 Other (n=66) 0 63 3 0 61 5 0 65 1 0 63 3 0 65 1 OL Other Suspect OL Other Suspect OL Other Suspect OL Other Suspect OL Other Suspect OL (n=10) 8 0 2 8 0 2 6 0 4 9 0 1 7 0 3 Other (n=65) 0 64 1 0 64 1 0 64 1 0 65 0 0 65 0 PI Other Suspect PI Other Suspect PI Other Suspect PI Other Suspect PI Other Suspect PI (n=12) 9 1 2 9 1 2 7 2 3 10 0 2 9 1 2 Other (n=63) 0 63 0 0 63 0 0 61 2 0 63 0 0 63 0 SA Other Suspect SA Other Suspect SA Other Suspect SA Other Suspect SA Other Suspect SA (n=12) 10 0 2 10 0 2 8 1 3 10 0 2 10 0 2 Other (n=63) 0 63 0 0 63 0 0 62 1 0 63 0 0 63 0 TA Other Suspect TA Other Suspect TA Other Suspect TA Other Suspect TA Other Suspect TA (n=12) 11 0 1 11 0 1 9 0 3 11 0 1 11 0 1 Other (n=63) 0 60 3 0 60 3 0 61 2 0 61 2 0 61 2 var. = variables (PCs or LVs) selected in the dimension reduction step

152

SI 4. DETAILED CONFUSION MATRICES FOR THE HIERARCHICAL PCA-PLS1-DA MODELS WITH DIFFERENT NUMBER OF VARIABLES IN THE FIRST STEP (AG: AGLANDAU, CA: CAILLETIER, OL:

OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE) Predicted class Predicted class Predicted class Predicted class Real class 1st level PCA 40 var. 1st level PCA 30 var. 1st level PCA 20 var. 1st level PCA 10 var. [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] AG Other Suspect AG Other Suspect AG Other Suspect AG Other Suspect AG (n=20) 19 0 1 20 0 0 15 1 4 9 2 9 Other (n=55) 2 47 6 2 47 6 4 48 3 5 45 5 CA Other Suspect CA Other Suspect CA Other Suspect CA Other Suspect CA (n=9) 2 2 5 2 2 5 2 3 4 0 6 3 Other (n=66) 0 65 1 0 65 1 0 66 0 0 64 2 OL Other Suspect OL Other Suspect OL Other Suspect OL Other Suspect OL (n=10) 6 0 4 6 0 4 6 1 3 0 3 7 Other (n=65) 0 63 2 0 64 1 0 63 2 0 63 2 PI Other Suspect PI Other Suspect PI Other Suspect PI Other Suspect PI (n=12) 7 2 3 7 2 3 7 2 3 0 4 8 Other (n=63) 0 61 2 0 61 2 0 61 2 0 61 2 SA Other Suspect SA Other Suspect SA Other Suspect SA Other Suspect SA (n=12) 9 1 2 8 1 3 8 1 3 6 0 6 Other (n=63) 0 62 1 0 62 1 0 62 1 0 62 1 TA Other Suspect TA Other Suspect TA Other Suspect TA Other Suspect TA (n=12) 9 0 3 9 0 3 9 1 2 6 1 5 Other (n=63) 0 61 2 0 61 2 0 61 2 1 59 3 var. = variables (PCs or LVs) selected in the dimension reduction step

153

SI 5. DETAILED CONFUSION MATRICES FOR THE HIERARCHICAL PLS-PLS1-DA MODELS WITH DIFFERENT NUMBER OF VARIABLES IN THE FIRST STEP (AG: AGLANDAU, CA: CAILLETIER, OL:

OLIVIÈRE, PI: PICHOLINE, SA: SALONENQUE, TA: TANCHE) Predicted class Predicted class Predicted class Predicted class Real class 1st level PLS 40 var. 1st level PLS 30 var. 1st level PLS 20 var. 1st level PLS 10 var. [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] [-0.4 ; 0.4] / [0.6 ; 1.4] AG Other Suspect AG Other Suspect AG Other Suspect AG Other Suspect AG (n=20) 19 0 1 19 0 1 19 0 1 15 0 5 Other (n=55) 0 55 0 0 55 0 0 53 2 2 47 6 CA Other Suspect CA Other Suspect CA Other Suspect CA Other Suspect CA (n=9) 8 0 1 6 1 2 3 1 5 0 2 7 Other (n=66) 0 64 2 0 63 3 0 65 1 0 62 4 OL Other Suspect OL Other Suspect OL Other Suspect OL Other Suspect OL (n=10) 9 0 1 9 0 1 7 0 3 2 0 8 Other (n=65) 0 65 0 0 65 0 0 65 0 0 62 3 PI Other Suspect PI Other Suspect PI Other Suspect PI Other Suspect PI (n=12) 10 0 2 10 0 2 7 0 5 6 1 5 Other (n=63) 0 63 0 0 63 0 0 63 0 0 61 2 SA Other Suspect SA Other Suspect SA Other Suspect SA Other Suspect SA (n=12) 11 0 1 10 0 2 10 1 1 11 1 0 Other (n=63) 0 61 2 0 63 0 0 63 0 0 63 0 TA Other Suspect TA Other Suspect TA Other Suspect TA Other Suspect TA (n=12) 11 0 1 11 0 1 9 0 3 9 1 2 Other (n=63) 0 62 1 0 61 2 0 60 3 0 59 4 var. = variables (PCs or LVs) selected in the dimension reduction step

154

CONCLUSION GÉNÉRALE ET PERSPECTIVES

Bien que l’authenticité et de la traçabilité des huiles d’olive soient des préoccupations anciennes, l’évolution des enjeux économiques, de santé publique et de perception des consommateurs contribuent à renouveler l’intérêt pour ce sujet. Ainsi, comme l’a montré l’exploration du contexte scientifique présenté dans le premier chapitre, diverses méthodes d’analyse visant à confirmer l’origine géographique ou variétale des huiles d’olive sont développées pour exploiter davantage ou compléter les techniques de référence actuelles utilisées seulement pour le contrôle des paramètres de qualité. Une attention particulière est notamment portée aux techniques spectroscopiques permettant des analyses globales et rapides, ainsi qu’aux approches intégrées de type métabolomique. De plus, l’optimisation et la diffusion des outils chimiométriques sont essentielles au traitement des importants volumes de données complexes issus de ces techniques analyses. D’une part, l’application des modèles chimiométriques dans le domaine industriel pourrait être favorisée par le développement de méthodes facilitant l’interprétation de leurs résultats. Dans ce but, la combinaison de modèles d’analyse discriminante avec l’approche par carte de contrôle proposée dans le deuxième chapitre de cette thèse s’adapte bien à une démarche de contrôle qualité. En effet, pour les cinq cultivars français étudiés (Aglandau, Cailletier, Picholine, Salonenque et Tanche), la règle de décision par carte de contrôle a permis une amélioration des taux de classification correcte par rapport à la règle fixant un seuil arbitraire. En s’intéressant aux valeurs réellement calculées par les modèles plutôt qu’aux valeurs théoriques, cette méthode améliore la discrimination de l’origine variétale des huiles d’olive en présence de groupes avec des nombres d’échantillons déséquilibrés. De plus, les prédictions sont encore améliorées lorsque la calibration du modèle est réalisée avec des groupes équilibrés. Dans les deux cas, le principal avantage de la carte de contrôle réside dans la définition de limites d’acceptation et de rejet basée sur le calcul d’intervalles de confiance, qui permettent d’identifier plus facilement les échantillons dont les caractéristiques ne sont pas typiques du cultivar d’intérêt. D’autre part, lorsque les données issues d’une seule technique d’analyse ne suffisent pas à discriminer efficacement l’origine variétale, les résultats peuvent être améliorés en exploitant la synergie de plusieurs sources analytiques complémentaires. Ainsi, les travaux présentés dans

155

le troisième chapitre de cette thèse confirment que les résultats des analyses globales et rapides par spectroscopies proche- et moyen-infrarouge peuvent être utilisés individuellement dans des premiers modèles de dépistage, puis combinés entre eux, ou avec les résultats de l’analyse spécifique des acides gras par chromatographie gazeuse, dans un modèle permettant de préciser l’origine des échantillons douteux. En effet, le modèle de fusion multiblock a permis d’améliorer la reconnaissance de trois cultivars tunisiens (Chemlali, Chetoui et Oueslati) en combinant les informations obtenues par les analyses GC et MIR. De plus, l’application d’une normalisation de chaque bloc de données sans ajout de pondération supplémentaire a apporté un meilleur équilibre entre les deux sources d’information. L’importance de développer des modèles adaptés pour éviter que les informations pertinentes soient masquées par un trop grand volume de données sans lien avec le paramètre à prédire est aussi soulignée par la comparaison des différentes stratégies de fusion des données MIR et PIR pour la discrimination de six cultivars français (Aglandau, Cailletier, Olivière, Picholine, Salonenque et Tanche). Dans ce cas, la fusion par concaténation et la fusion multiblock se révèlent peu efficaces, ne parvenant pas à améliorer les prédictions par rapport aux modèles construits sur les jeux de données MIR et PIR séparés. La discrimination est même dégradée par la fusion avec une première réduction de dimension non supervisée par ACP. La fusion des données après une première étape de sélection de variables supervisée par PLS semble plus efficace en moyenne mais produit des résultats hétérogènes selon les cultivars, tandis que la fusion avec un vote majoritaire au niveau de la prise de décision est plus simple à mettre en œuvre et permet une amélioration moins forte mais plus homogène des prédictions pour tous les cultivars. Ainsi, les outils chimiométriques développés au cours de cette thèse sont adaptés pour des applications à des cas concrets : - Chaque modèle développé prend compte plusieurs dizaines d’échantillons de chaque cultivar étudié, obtenus lors de plusieurs années de récolte. Bien que les performances du modèle PLS1-DA puissent être négativement impactées par un fort déséquilibre des effectifs d’échantillons entre les classes à prédire, la règle de décision par carte de contrôle permet de pallier cet inconvénient. - La carte de contrôle apporte également une dimension métrologique, avec l’identification simple et rapide des échantillons déviant du profil de référence, qui rend le modèle PLS1-DA mieux adapté pour des applications lors d’analyses de routine.

156

- La comparaison de différentes stratégies de fusion des données permet de mettre en évidence l’efficacité variable du modèle multiblock (MB-PLS1-DA) selon les sources de données considérées (CG et MIR, ou MIR et PIR). La fusion avec réduction de dimension par PLS (PLS-PLS1-DA) et la fusion avec vote majoritaire sont identifiées comme étant les plus pertinentes pour bénéficier de la complémentarité des techniques d’analyses MIR et PIR. Cependant, les améliorations apportées demeurent limitées au regard des performances déjà satisfaisantes obtenues par les modèles utilisant les données de chaque technique d’analyse séparément.

De futurs travaux pourraient s’intéresser au développement des applications de la PLS-DA avec carte de contrôle à d’autres cas d’authentification. Ce modèle pourrait être adapté pour la discrimination de l’adultération d’huile d’olive vierge extra avec différentes proportions d’huiles moins chères, afin d’établir ses limites de détection par exemple pour l’ajout d’huile d’olive lampante ou d’huile de noisette qui sont souvent difficiles à mettre en évidence. La détection de mélanges ternaires, qui a rarement été étudiée à ce jour, pourrait aussi être testée. La carte de contrôle pourrait aussi être appliquée sur des huiles d’olive commerciales pour tenter de reconnaitre une huile monovariétale issue d’un cultivar particulier par rapport à une huile standard issue d’un mélange d’origines variétales inconnues. Une autre perspective concerne une des tendances identifiées par l’étude bibliométrique, à savoir l’analyse des composés mineurs des huiles d’olive. Cet axe de recherche encore peu exploré permettrait d’étudier les relations entre origine variétale et qualités organoleptiques ou nutritionnelles des huiles d’olive. Dans ce but, l’application d’analyses métabolomiques pourrait amener à identifier quels composés d’intérêt (acides gras poly-insaturés ω3 et ω6, composés phénoliques, vitamines,…) sont présents en plus grande quantité, afin d’établir le profil nutritionnel spécifique d’une huile d’olive selon son origine.

157

158

ANNEXES

Annexe 1 : Fonction développée avec le logiciel Matlab (version R2014b) pour réaliser les calculs de l’algorithme de base du modèle de fusion des données multiblock (MB-PLS1-DA) function MB = mbpls1(X,Y,Xin,nF,Xpp,Ypp,options)

% in: % X (échantillons x variables) = matrice des variables explicatives % Y (échantillons x classes) = matrice des variables à prédire % Xin = bornes des variables des blocs % exemple: Xin={1:50 51:100} pour 2 blocs: X1(:,1:50), X2(:,51:100) % nF (1 x 1) nombre maximum de variables latentes % Xpp (1 x 1) X-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % Ypp (1 x 1) Y-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % options = % 1: tolerance de convergence (defaut 1e-8) % 2: maximum d'iterations (defaut 2000) % 3: pondération des scores par le nombre de variables de chaque bloc % (0=no, 1=yes->defaut) % % out: % MB (structure) = paramètres du modèle Multiblock % % Références: % mbpls1 s'inspire de la fonction mbpls et utilise les fonctions meanc, % autosc, rangesc proposées par van den Berg % (http://www.models.life.ku.dk/mbtoolbox)

MB.nF = nF; MB.Xin = Xin; MB.ssq = zeros(MB.nF,2); [nX,mX] = size(X); [nY,pY] = size(Y); nbX = size(MB.Xin,2); if nX ~= nY s = ['ERROR: number of objects in X (' num2str(nX) ') and Y-block'... '(' num2str(nY) ') is not the same']; error(s); end if nargin == 4 MB.Xpp = -ones(nbX,1); MB.Ypp = -ones(1,1); MB.options = [1e-8 2000 1]; elseif nargin == 5 MB.Xpp = Xpp; MB.Ypp = -ones(1,1); MB.options = [1e-8 2000 1]; elseif nargin == 6 MB.Xpp = Xpp; MB.Ypp = Ypp; MB.options = [1e-8 2000 1];

159

else MB.Xpp = Xpp; MB.Ypp = Ypp; MB.options = options; end clear nF Xin Xpp Ypp options ssqX = zeros(1,nbX+1); for a=1:nbX coli = MB.Xin{a}; if any(coli > mX) error('ERROR: block index is outside of X-block range') end if MB.Xpp(a) == -1 inp = input(['X-block scaling (0=none, 1=mean center,'... '2=autoscale, 3=range(0-1)scale)?']); MB.Xpp(a) = inp; else inp = MB.Xpp(a); end switch inp case 0 case 1 [X(:,coli),MB.moyX(:,coli)] = meanc(X(:,coli)); case 2 [X(:,coli),MB.moyX(:,coli),MB.stdX(:,coli)] = ... autosc(X(:,coli)); case 3 [X(:,coli),MB.rgX(:,coli)] = rangesc(X(:,coli)); otherwise error(['ERROR: X-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end ssqX(a+1) = sum(sum(X(:,coli).^2)); ssqX(1) = ssqX(1) + ssqX(a+1); end if MB.Ypp == -1 inp = input(['Y-block scaling (0=none, 1=mean center, 2=autoscale,'... '3=range(0-1)scale)?']); MB.Ypp = inp; else inp = MB.Ypp; end switch inp case 0 case 1 [Y,MB.moyY] = meanc(Y); case 2 [Y,MB.moyY,MB.stdY] = autosc(Y); case 3 [Y,MB.rgY] = rangesc(Y); otherwise error(['ERROR: Y-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end ssqY = sum(sum(Y.^2));

160

for a=1:MB.nF iter = 0; [Ym,Ymi] = max(sum(Y.^2,1)); MB.U(:,a) = Y(:,Ymi); [Xm,Xmi] = max(sum(X.^2,1)); MB.Tt(:,a) = X(:,Xmi); Tt_old = MB.Tt(:,a)*100; while (sum((Tt_old - MB.Tt(:,a)).^2)/sum(Tt_old.^2) > MB.options(1))... && (iter < MB.options(2)) iter = iter + 1; Tt_old = MB.Tt(:,a); for aa=1:nbX coli = MB.Xin{aa}; nvarX(aa)=size(X(:,coli),2); MB.Wb(coli,a) = X(:,coli)'*MB.U(:,a)/(MB.U(:,a)'*MB.U(:,a)); MB.Wb(coli,a) = MB.Wb(coli,a)/norm(MB.Wb(coli,a)); if MB.options(3) == 0 MB.Tb(:,(a-1)*nbX+aa) = X(:,coli)*MB.Wb(coli,a); elseif MB.options(3) == 1 MB.Tb(:,(a-1)*nbX+aa) = X(:,coli)*MB.Wb(coli,a)/... sqrt(nvarX(aa)); else error(['ERROR: option for X-block scores weighting must'... 'be 0(no) or 1(yes)']); end end MB.Wt(:,a) = MB.Tb(:,(a-1)*nbX+1:a*nbX)'*MB.U(:,a)/... (MB.U(:,a)'*MB.U(:,a)); MB.Wt(:,a) = MB.Wt(:,a)/norm(MB.Wt(:,a)); MB.Tt(:,a) = MB.Tb(:,(a-1)*nbX+1:a*nbX)*MB.Wt(:,a)/... (MB.Wt(:,a)'*MB.Wt(:,a)); MB.Q(:,a) = Y'*MB.Tt(:,a)/(MB.Tt(:,a)'*MB.Tt(:,a)); MB.U(:,a) = Y*MB.Q(:,a)/(MB.Q(:,a)'*MB.Q(:,a)); end if iter == MB.options(2) s = ['WARNING: maximum number of iterations ('... num2str(MB.options(2)) ') reached before convergence']; disp(s) end for aa=1:nbX coli = MB.Xin{aa}; MB.Pb(coli,a) = X(:,coli)'*MB.Tt(:,a)/(MB.Tt(:,a)'*MB.Tt(:,a)); X(:,coli) = X(:,coli) - MB.Tt(:,a)*MB.Pb(coli,a)'; end Y = Y - MB.Tt(:,a)*MB.Q(:,a)'; MB.ssq(a,2) = (ssqY - sum(sum(Y.^2)))/ssqY; end

161

Annexe 2 : Fonction développée avec le logiciel Matlab (version R2014b) pour la calibration avec validation croisée et le calcul de prédictions du modèle de fusion des données multiblock (MB-PLS1-DA)

function[Ypred,MBcal,RMSECV,Dif,RMSEC,RMSEP,R2,Q2]=... applyMBPLS1(Xcal,Ycal,Xval,Yval,Xin,nF,Xpp,Ypp,options)

% in: % Xcal (échantillons x variables) = matrice de calibration % Ycal (échantillons x classes) = valeurs à prédire pour matrice de % calibration % Xval (échantillons x variables) = matrice de validation % Yval (échantillons x classes) = valeurs à prédire pour matrice de % validation % Xin = bornes des variables des blocs % exemple: Xin={1:50 51:100} pour 2 blocs: X1(:,1:50), X2(:,51:100) % nF (1 x 1) nombre maximum de variables latentes % Xpp (1 x 1) X-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % Ypp (1 x 1) Y-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % options = % 1: tolerance de convergence (defaut 1e-8) % 2: maximum d'iterations (defaut 2000) % 3: pondération des scores par le nombre de variables de chaque bloc % (0=no, 1=yes->defaut) % % out: % Ypred (échantillons x facteurs) = vecteurs Y de scores à comparer à la % référence Yval % MBcal (structure) = paramètres du modèle Multiblock % RMSECV (1 x facteurs) = erreur de validation croisée % Dif (1 x facteurs) = pourcentage de diminutionde RMSECV entre 2 variables % latentes consécutives % RMSEC (1 x facteurs) = erreur de calibration % RMSEP (1 x facteurs) = erreur de prédiction % R2 (1 x facteurs) = coefficient de détermination pour la calibration % Q2 (1 x facteurs) = coefficient de détermination pour la prédiction % % applyMBPLS1 utilise les fonctions meanc, autosc, rangesc et corr2_1 % proposées par van den Berg (http://www.models.life.ku.dk/mbtoolbox) % mbpls1 s'inspire de la fonction mbpls proposée par van den Berg % (http://www.models.life.ku.dk/mbtoolbox)

[nX,mX] = size(Xcal); [nY,pY] = size(Ycal); nbX = size(Xin,2);

% un modèle prédit une seule classe if pY ~= 1 error('ERROR: this model can only predict one class at a time'); end

% Validation croisée "leave one out" pour choisir le nombre de LV optimal % pour la calibration for a=1:nX indexc = [1:a-1 a+1:nX];

162

Xc = Xcal(indexc,:); Yc = Ycal(indexc,:); xp = Xcal(a,:); yp = Ycal(a,:); MBcv = mbpls1(Xc,Yc,nF,Xin,Xpp,Ypp,options); for aa=1:nbX coli = Xin{aa}; if Xpp(aa) == -1 inp = input(['X-block scaling (0=none, 1=mean center,'... '2=autoscale, 3=range(0-1)scale)?']); Xpp(aa) = inp; else inp = Xpp(aa); end switch inp case 0 case 1 xp(:,coli) = meanc(xp(:,coli),MBcv.moyX(:,coli)); case 2 xp(:,coli) = autosc(xp(:,coli),MBcv.moyX(:,coli),... MBcv.stdX(:,coli)); case 3 xp(:,coli) = rangesc(xp(:,coli),MBcv.rgX(:,coli)); otherwise error(['ERROR: X-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end end for aaa=1:nF for aa=1:nbX coli = Xin{aa}; nvarX(aa)=size(xp(:,coli),2); if options(3) == 0 Tbpcv(a,(aaa-1)*nbX+aa) = xp(:,coli)*MBcv.Wb(coli,aaa); elseif options(3) == 1 Tbpcv(a,(aaa-1)*nbX+aa) = xp(:,coli)*MBcv.Wb(coli,aaa)/... sqrt(nvarX(aa)); else error(['ERROR: option for X-block scores weighting must'... 'be 0(no) or 1(yes)']); end end Ttpcv(a,aaa) = Tbpcv(a,(aaa-1)*nbX+1:aaa*nbX)*MBcv.Wt(:,aaa)/... (MBcv.Wt(:,aaa)'*MBcv.Wt(:,aaa)); if aaa==1 Ypcv(a,aaa) = Ttpcv(a,aaa)*MBcv.Q(:,aaa); else Ypcv(a,aaa) = Ypcv(a,aaa-1) + Ttpcv(a,aaa)*MBcv.Q(:,aaa); end for aa=1:nbX coli = Xin{aa}; xp(:,coli) = xp(:,coli) - Ttpcv(a,aaa)*MBcv.Pb(coli,aaa)'; end end if Ypp == -1 inp = input(['Y-block scaling (0=none, 1=mean center,'... '2=autoscale, 3=range(0-1)scale)?']); Ypp = inp; else inp = Ypp; end

163

switch inp case 0 case 1 for aaa=1:nF Ypcv(a,aaa) = meanc(Ypcv(a,aaa),MBcv.moyY,1); end case 2 for aaa=1:nF Ypcv(a,aaa) = autosc(Ypcv(a,aaa),MBcv.moyY,MBcv.stdY,1); end case 3 for aaa=1:nF Ypcv(a,aaa) = rangesc(Ypcv(a,aaa),MBcv.rgY,1); end otherwise error(['ERROR: X-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end end % Calcul de l'erreur moyenne de validation croisée RMSECV for aaa=1:nF for a=1:nX Ecv(a,aaa) = (Ycal(a,1)-Ypcv(a,aaa)).^2; end RMSECV(aaa) = sqrt(sum(Ecv(:,aaa))/nX); if aaa > 1 Dif(aaa)=(RMSECV(aaa-1)-RMSECV(aaa))/RMSECV(aaa-1); end end Min=min(RMSECV); for aaa=1:nF if RMSECV(aaa) == Min LVmin=aaa; end end % Choix du nombre de LV optimal correspondant à un RMSECV minimum et/ou % %Dif > 5% figure(1) subplot(2,1,1); title('Calibration: RMSECV=f(LV)','FontWeight','bold','FontSize',12); xlabel('LV','FontWeight','bold','FontSize',12); ylabel('RMSECV','FontWeight','bold','FontSize',12); hold on plot(RMSECV,'k') plot(LVmin,Min,'r+') subplot(2,1,2); title('% difference of RMSECV between consecutive LVs','FontWeight',... 'bold','FontSize',12); xlabel('LV','FontWeight','bold','FontSize',12); ylabel('%Dif','FontWeight','bold','FontSize',12); hold on plot(Dif,'k') plot([0 nF],[0.05 0.05],'r:'); input_LV=inputdlg({['Nombre de variables latentes pour construire le'... 'modèle :']},'LV',[1 35],{'10'}); LV=str2num(input_LV{1});

% Calibration pour construire modèle de prédiction avec le nombre de LV % choisi MBcal = mbpls1(Xcal,Ycal,LV,Xin,Xpp,Ypp,options);

164

% calcul des paramètres de qualité du modèle de calibration R2 et RMSEC for aa=1:nbX coli = Xin{aa}; if Xpp(aa) == -1 inp = input(['X-block scaling (0=none, 1=mean center,'... '2=autoscale, 3=range(0-1)scale)?']); Xpp(aa) = inp; else inp = Xpp(aa); end switch inp case 0 case 1 Xcal(:,coli) = meanc(Xcal(:,coli),MBcal.moyX(:,coli)); case 2 Xcal(:,coli) = autosc(Xcal(:,coli),MBcal.moyX(:,coli),... MBcal.stdX(:,coli)); case 3 Xcal(:,coli) = rangesc(Xcal(:,coli),MBcal.rgX(:,coli)); otherwise error(['ERROR: X-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end end for a=1:LV for aa=1:nbX coli = Xin{aa}; nvarX(aa)=size(Xcal(:,coli),2); if options(3) == 0 Tbmodele(:,(a-1)*nbX+aa) = Xcal(:,coli)*MBcal.Wb(coli,a); elseif options(3) == 1 Tbmodele(:,(a-1)*nbX+aa) = Xcal(:,coli)*MBcal.Wb(coli,a)/... sqrt(nvarX(aa)); else error(['ERROR: option for X-block scores weighting must be'... '0(no) or 1(yes)']); end end Ttmodele(:,a) = Tbmodele(:,(a-1)*nbX+1:a*nbX)*MBcal.Wt(:,a)/... (MBcal.Wt(:,a)'*MBcal.Wt(:,a)); if a==1 Ymodele(:,a) = Ttmodele(:,a)*MBcal.Q(:,a); else Ymodele(:,a) = Ymodele(:,a-1) + Ttmodele(:,a)*MBcal.Q(:,a); end for aa=1:nbX coli = Xin{aa}; Xcal(:,coli) = Xcal(:,coli) - Ttmodele(:,a)*MBcal.Pb(coli,a)'; end end if Ypp == -1 inp = input(['Y-block scaling (0=none, 1=mean center, 2=autoscale,'... '3=range(0-1)scale)?']); Ypp = inp; else inp = Ypp; end switch inp case 0 case 1

165

for a=1:LV Ymodele(:,a) = meanc(Ymodele(:,a),MBcal.moyY,1); end case 2 for a=1:LV Ymodele(:,a) = autosc(Ymodele(:,a),MBcal.moyY,MBcal.stdY,1); end case 3 for a=1:LV Ymodele(:,a) = rangesc(Ymodele(:,a),MBcal.rgY,1); end otherwise error(['ERROR: X-block scaling must be 0(none), 1(mean center),'... '2(autoscale) or 3(range(0-1)scale)']); end for a=1:LV R2(a)=corr2_1(Ycal(:,1),Ymodele(:,a)); for aa=1:nX Ec(aa,a) = (Ycal(aa,1)-Ymodele(aa,a)).^2; end RMSEC(a) = sqrt(sum(Ec(:,a))/nX); end

% Prédiction de Ypred selon Xval en fonction du modèle MBcal for aa=1:nbX coli = Xin{aa}; if Xpp(aa) == -1 inp = input(['X-block scaling (0=none, 1=mean center,'... '2=autoscale, 3=range(0-1)scale)?']); Xpp(aa) = inp; else inp = Xpp(aa); end switch inp case 0 case 1 Xval(:,coli) = meanc(Xval(:,coli),MBcal.moyX(:,coli)); case 2 Xval(:,coli) = autosc(Xval(:,coli),MBcal.moyX(:,coli),... MBcal.stdX(:,coli)); case 3 Xval(:,coli) = rangesc(Xval(:,coli),MBcal.rgX(:,coli)); otherwise error(['ERROR: X-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end end for a=1:LV for aa=1:nbX coli = Xin{aa}; nvarX(aa)=size(Xval(:,coli),2); if options(3) == 0 Tbpred(:,(a-1)*nbX+aa) = Xval(:,coli)*MBcal.Wb(coli,a); elseif options(3) == 1 Tbpred(:,(a-1)*nbX+aa) = Xval(:,coli)*MBcal.Wb(coli,a)/... sqrt(nvarX(aa)); else error(['ERROR: option for X-block scores weighting must be'... '0(no) or 1(yes)']); end end

166

Ttpred(:,a) = Tbpred(:,(a-1)*nbX+1:a*nbX)*MBcal.Wt(:,a)/... (MBcal.Wt(:,a)'*MBcal.Wt(:,a)); if a==1 Ypred(:,a) = Ttpred(:,a)*MBcal.Q(:,a); else Ypred(:,a) = Ypred(:,a-1) + Ttpred(:,a)*MBcal.Q(:,a); end for aa=1:nbX coli = Xin{aa}; Xval(:,coli) = Xval(:,coli) - Ttpred(:,a)*MBcal.Pb(coli,a)'; end end if Ypp == -1 inp = input(['Y-block scaling (0=none, 1=mean center, 2=autoscale,'... '3=range(0-1)scale)?']); Ypp = inp; else inp = Ypp; end switch inp case 0 case 1 for a=1:LV Ypred(:,a) = meanc(Ypred(:,a),MBcal.moyY,1); end case 2 for a=1:LV Ypred(:,a) = autosc(Ypred(:,a),MBcal.moyY,MBcal.stdY,1); end case 3 for a=1:LV Ypred(:,a) = rangesc(Ypred(:,a),MBcal.rgY,1); end otherwise error(['ERROR: X-block scaling must be 0(none), 1(mean center),'... '2(autoscale) or 3(range(0-1)scale)']); end % calcul du coefficient de correlation Q2 et erreur de prédiction SEP k = size(Xval,1); for a=1:LV Q2(a)=corr2_1(Yval(:,1),Ypred(:,a)); for aa=1:k Ep(aa,a) = (Yval(aa,1)-Ypred(aa,a)).^2; end RMSEP(a) = sqrt(sum(Ep(:,a))/k); end % Comparaison de Ypred avec la référence Yval figure(2) title('Validation Set: Predicted vs. Reference','FontWeight','bold',... 'FontSize',12); xlabel('Reference Yval','FontWeight','bold','FontSize',12) ylabel('Predicted Ypred','FontWeight','bold','FontSize',12) hold on plot(Yval(:,1),Ypred(:,LV),'bo'); hold on plot([-0.2 1.2],[0.5 0.5],'k'); plot([-0.2 1.2],[-0.4 -0.4],'r:'); plot([-0.2 1.2],[0.4 0.4],'r:'); plot([-0.2 1.2],[0.6 0.6],'r'); plot([-0.2 1.2],[1.4 1.4],'r');

167

Annexe 3 : Fonction développée avec le logiciel Matlab (version R2014b) pour la calibration avec validation croisée et le calcul de prédictions du modèle PLS1-DA function[Ypred,PLScal,RMSECV,Dif,RMSEC,RMSEP,R2,Q2]=... applyPLS1(Xcal,Ycal,Xval,Yval,nF,Xpp,Ypp,options)

% in: % Xcal (échantillons x variables) = matrice de calibration % Ycal (échantillons x classes) = valeurs à prédire pour matrice de % calibration % Xval (échantillons x variables) = matrice de validation % Yval (échantillons x classes) = valeurs à prédire pour matrice de % validation % nF (1 x 1) nombre maximum de variables latentes % Xpp (1 x 1) X-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % Ypp (1 x 1) Y-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % options (1 x 2) tolerance de convergence et maximum d'iterations % (defaut 1e-8 and 2000) % % out: % Ypred (échantillons x facteurs) = vecteurs Y de scores à comparer à la % référence Yval % PLScal (structure) = paramètres du modèle PLS % RMSECV (1 x facteurs) = erreur de validation croisée % Dif (1 x facteurs) = pourcentage de diminutionde RMSECV entre 2 variables % latentes consécutives % RMSEC (1 x facteurs) = erreur de calibration % RMSEP (1 x facteurs) = erreur de prédiction % R2 (1 x facteurs) = coefficient de détermination pour la calibration % Q2 (1 x facteurs) = coefficient de détermination pour la prédiction % % Références: % applyPLS1 utilise les fonctions mypls (adaptée dans pls1), meanc, autosc, % rangesc et corr2_1 proposées par van den Berg % (http://www.models.life.ku.dk/mbtoolbox)

[nX,mX] = size(Xcal); [nY,pY] = size(Ycal);

% un modèle prédit une seule classe if pY ~= 1 error('ERROR: this model can only predict one class at a time'); end

% Validation croisée "leave one out" pour choisir le nombre de LV optimal % pour la calibration for a=1:nX indexc = [1:a-1 a+1:nX]; Xc = Xcal(indexc,:); Yc = Ycal(indexc,:); xp = Xcal(a,:); yp = Ycal(a,:); PLScv = pls1(Xc,Yc,nF,Xpp,Ypp,options); if Xpp == -1 inp = input(['X-block scaling (0=none, 1=mean center,'... '2=autoscale, 3=range(0-1)scale)?']); Xpp = inp;

168

else inp = Xpp; end switch inp case 0 case 1 xp = meanc(xp,PLScv.moyX); case 2 xp = autosc(xp,PLScv.moyX,PLScv.stdX); case 3 xp = rangesc(xp,PLScv.rgX); otherwise error(['ERROR: X-block scaling must be 0(none),'... '1(mean center), 2(autoscale) or 3(range(0-1)scale)']); end for aa=1:nF Tpcv(a,aa) = xp*PLScv.W(:,aa)/(PLScv.W(:,aa)'*PLScv.W(:,aa)); if aa == 1 Ypcv(a,aa) = Tpcv(a,aa)*PLScv.Q(:,aa)'; else Ypcv(a,aa) = Ypcv(a,aa-1) + Tpcv(a,aa)*PLScv.Q(:,aa)'; end xp = xp - Tpcv(a,aa)*PLScv.P(:,aa)'; end for aa=1:nF if Ypp == 1 Ypcv(a,aa) = meanc(Ypcv(a,aa),PLScv.moyY,1); elseif Ypp == 2 Ypcv(a,aa) = autosc(Ypcv(a,aa),PLScv.moyX,PLScv.stdY,1); elseif prepro(2) == 3 Ypcv(a,aa) = rangesc(Ypcv(a,aa),PLScv.rgY,1); end end end % Calcul de l'erreur moyenne de validation croisée RMSECV for aa=1:nF for a=1:nX Ecv(a,aa) = (Ycal(a,:)-Ypcv(a,aa)).^2; end RMSECV(aa) = sqrt(sum(Ecv(:,aa))/nX); if aa > 1 Dif(aa)=(RMSECV(aa-1)-RMSECV(aa))/RMSECV(aa-1); end end Min=min(RMSECV); for aa=1:nF if RMSECV(aa) == Min LVmin=aa; end end % Choix du nombre de LV optimal correspondant à un RMSECV minimum et/ou % %Dif > 5% figure(1) subplot(2,1,1); title('Calibration: RMSECV=f(LV)','FontWeight','bold','FontSize',12); xlabel('LV','FontWeight','bold','FontSize',12); ylabel('RMSECV','FontWeight','bold','FontSize',12); hold on plot(RMSECV,'k') plot(LVmin,Min,'r+') subplot(2,1,2);

169

title('% difference of RMSECV between consecutive LVs','FontWeight',... 'bold','FontSize',12); xlabel('LV','FontWeight','bold','FontSize',12); ylabel('%Dif','FontWeight','bold','FontSize',12); hold on plot(Dif,'k') plot([0 nF],[0.05 0.05],'r:'); input_LV=inputdlg({['Nombre de variables latentes pour construire le'... 'modèle :']},'LV',[1 35],{'10'}); LV=str2num(input_LV{1});

% Calibration pour construire modèle de prédiction avec le nombre de LV % choisi PLScal = pls1(Xcal,Ycal,LV,Xpp,Ypp,options); % calcul des paramètres de qualité du modèle de calibration R2 et RMSEC if Xpp == -1 inp = input(['X-block scaling (0=none, 1=mean center, 2=autoscale,'... '3=range(0-1)scale)?']); Xpp = inp; else inp = Xpp; end switch inp case 0 case 1 Xcal = meanc(Xcal,PLScal.moyX); case 2 Xcal = autosc(Xcal,PLScal.moyX,PLScal.stdX); case 3 Xcal = rangesc(Xcal,PLScal.rgX); otherwise error(['ERROR: X-block scaling must be 0(none), 1(mean center),'... '2(autoscale) or 3(range(0-1)scale)']); end for aa=1:LV Tmodele(:,aa) = Xcal*PLScal.W(:,aa)/(PLScal.W(:,aa)'*PLScal.W(:,aa)); if aa == 1 Ymodele(:,aa) = Tmodele(:,aa)*PLScal.Q(:,aa)'; else Ymodele(:,aa) = Ymodele(:,aa-1) + Tmodele(:,aa)*PLScal.Q(:,aa)'; end Xcal = Xcal - Tmodele(:,aa)*PLScal.P(:,aa)'; end for aa=1:LV if Ypp == 1 Ymodele(:,aa) = meanc(Ymodele(:,aa),PLScal.moyY,1); elseif Ypp == 2 Ymodele(:,aa) = autosc(Ymodele(:,aa),PLScal.moyX,PLScal.stdY,1); elseif prepro(2) == 3 Ymodele(:,aa) = rangesc(Ymodele(:,aa),PLScal.rgY,1); end end for aa=1:LV R2(aa)=corr2_1(Ycal,Ymodele(:,aa)); for a=1:nX Ec(a,aa) = (Ycal(a,:)-Ymodele(a,aa)).^2; end RMSEC(aa) = sqrt(sum(Ec(:,aa))/nX); end

170

% Prédiction de Ypred selon Xval en fonction du modèle PLScal if Xpp == -1 inp = input(['X-block scaling (0=none, 1=mean center, 2=autoscale,'... '3=range(0-1)scale)? : ']); Xpp = inp; else inp = Xpp; end switch inp case 0 case 1 Xval = meanc(Xval,PLScal.moyX); case 2 Xval = autosc(Xval,PLScal.moyX,PLScal.stdX); case 3 Xval = rangesc(Xval,PLScal.rgX); otherwise error(['ERROR: X-block scaling must be 0(none), 1(mean center),'... '2(autoscale) or 3(range(0-1)scale)']); end for aa=1:LV Tpred(:,aa) = Xval*PLScal.W(:,aa)/(PLScal.W(:,aa)'*PLScal.W(:,aa)); if aa == 1 Ypred(:,aa) = Tpred(:,aa)*PLScal.Q(:,aa)'; else Ypred(:,aa) = Ypred(:,aa-1) + Tpred(:,aa)*PLScal.Q(:,aa)'; end Xval = Xval - Tpred(:,aa)*PLScal.P(:,aa)'; end for aa=1:LV if Ypp == 1 Ypred(:,aa) = meanc(Ypred(:,aa),PLScal.moyY,1); elseif Ypp == 2 Ypred(:,aa) = autosc(Ypred(:,aa),PLScal.moyX,PLScal.stdY,1); elseif prepro(2) == 3 Ypred(:,aa) = rangesc(Ypred(:,aa),PLScal.rgY,1); end end %calcul du coefficient de correlation Q2 et erreur de prédiction SEP k = size(Xval,1); for aa=1:LV Q2(aa)=corr2_1(Yval,Ypred(:,aa)); for a=1:k Ep(a,aa) = (Yval(a,:)-Ypred(a,aa)).^2; end RMSEP(aa) = sqrt(sum(Ep(:,aa))/k); end % Comparaison de Ypred avec la référence Yval figure(2) title('Validation Set: Predicted vs. Reference','FontWeight','bold',... 'FontSize',12); xlabel('Reference Yval','FontWeight','bold','FontSize',12) ylabel('Predicted Ypred','FontWeight','bold','FontSize',12) hold on plot(Yval,Ypred(:,LV),'bo'); hold on plot([-0.2 1.2],[0.5 0.5],'k'); plot([-0.2 1.2],[-0.4 -0.4],'r:'); plot([-0.2 1.2],[0.4 0.4],'r:'); plot([-0.2 1.2],[0.6 0.6],'r'); plot([-0.2 1.2],[1.4 1.4],'r');

171

Annexe 3 : Fonction développée avec le logiciel Matlab (version R2014b) pour le calcul de prédictions du modèle de fusion des données avec réduction de dimensions par ACP (PCA-PLS1-DA) function[Ypred,HPLScal,RMSECV,Dif,RMSEC,RMSEP,R2,Q2,PCAcal,PCAval]=... applyPCAPLS1(Xcal,Ycal,Xval,Yval,Xin,nPC,optionsPCA,Xpp,Ypp,optionsPLS)

% ATTENTION: modèle prévu pour une seule classe Y (PLS1-DA), si plusieurs % classes à prédire l'algorithme doit être redémarré pour créer un modèle % pour chaque classe % % in: % Xcal (échantillons x variables) = matrice de calibration % Ycal (échantillons x classes) = valeurs à prédire pour matrice de % calibration % Xval (échantillons x variables) = matrice de validation % Yval (échantillons x classes) = valeurs à prédire pour matrice de % validation % Xin = bornes des variables des blocs % exemple: Xin={1:50 51:100} pour 2 blocs: X1(:,1:50), X2(:,51:100) % nPC = nombre de facteurs pour le calcul des ACP sur chaque bloc % optionsPCA = % 1: tolerance de convergence (defaut = 1e-6) % 2: maximum d'iterations (defaut = 1000) % exemple: [1e-8 2000] % Xpp (1 x 1) X-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % Ypp (1 x 1) Y-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % optionsPLS = % 1: tolerance de convergence (defaut 1e-8) % 2: maximum d'iterations (defaut 2000) % exemple: [1e-8 2000] % % out: % Ypred (échantillons x facteurs) = vecteurs Y de scores à comparer à la % référence Yval % HPLScal (structure) = paramètres du modèle Hierarchique % RMSECV (1 x facteurs) = erreur de validation croisée % Dif (1 x facteurs) = pourcentage de diminutionde RMSECV entre 2 variables % latentes consécutives % RMSEC (1 x facteurs) = erreur de calibration % RMSEP (1 x facteurs) = erreur de prédiction % R2 (1 x facteurs) = coefficient de détermination pour la calibration % Q2 (1 x facteurs) = coefficient de détermination pour la prédiction % PCAcal (structure) = modèles ACP pour chaque bloc de calibration % PCAval (strucutre) = scores ACP obtenus avec PCAcal pour chaque bloc de % validation % % Références: % applyPCA utilise la fonction pca intégrée dans Matlab % (https://fr.mathworks.com/help/stats/pca.html) % applyPLS1 utilise les fonctions mypls (adaptée dans pls1), meanc, autosc, % rangesc et corr2_1 proposées par van den Berg % (http://www.models.life.ku.dk/mbtoolbox)

% Etape 1: ACP sur chaque bloc [nX,mX] = size(Xcal);

172

nbX = size(Xin,2); HXcal = []; HXval = []; for a=1:nbX coli = Xin{a}; s = ['X-Block #' num2str(a)]; disp(s); s = [num2str(length(coli)) ' variables']; disp(s) if any(coli > mX) error('ERROR: block index is outside of X-block range') end Xmv = sparse(isnan(Xcal(:,coli))); pXmv = sum(sum(Xmv))/(nX*length(coli))*100; s = [num2str(pXmv) '% missing values']; disp(s); % construction de l'ACP avec set de calibration PCAcal{a} = applyPCA(Xcal(:,coli),nPC,optionsPCA); % application de l'ACP au set de validation PCAval{a} = (Xval(:,coli)-ones(size(Xval,1),1)*PCAcal{a}.mu)*... PCAcal{a}.P; % concatenation des scores obtenus avec les ACP % +++ pour ajouter normalisation: % +++ PCAcal{a}.T = PCAcal{a}.T/norm(PCAcal{a}.T); % +++ PCAval{a} = PCAval{a}/norm(PCAval{a}); HXcal = [HXcal PCAcal{a}.T]; HXval = [HXval PCAval{a}]; end

% Etape 2: modéle PLS sur les scores des ACP obtenus à l'étape 1 nF = nbX*nPC; [Ypred,HPLScal,RMSECV,Dif,RMSEC,RMSEP,R2,Q2]=applyPLS1(HXcal,Ycal,HXval,... Yval,nF,Xpp,Ypp,optionsPLS);

173

Annexe 4 : Fonction développée avec le logiciel Matlab (version R2014b) pour le calcul de prédictions du modèle de fusion des données avec réduction de dimensions par PLS (PLS-PLS1-DA) function[Ypred,HPLScal,RMSECV,Dif,RMSEC,RMSEP,R2,Q2,PLScal,PLSval]=... applyPLSPLS1(Xcal,Ycal,Xval,Yval,Xin,nLV,Xpp,Ypp,options)

% ATTENTION: modèle prévu pour une seule classe Y (PLS1-DA), si plusieurs % classes à prédire l'algorithme doit être redémarré pour créer un modèle % pour chaque classe % % in: % Xcal (échantillons x variables) = matrice de calibration % Ycal (échantillons x classes) = valeurs à prédire pour matrice de % calibration % Xval (échantillons x variables) = matrice de validation % Yval (échantillons x classes) = valeurs à prédire pour matrice de % validation % % Xin = bornes des variables des blocs % exemple: Xin={1:50 51:100} pour 2 blocks: X1(:,1:50), X2(:,51:100) % nLV: nombre de facteurs pour le premier niveau de PLS sur chaque bloc % Xpp (1 x 1) X-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % Ypp (1 x 1) Y-block scaling (-1 = interactive, 0 = none, 1 = mean center, % 2 = autoscale, 3 = range 0 to 1 scale) % options = % 1: tolerance de convergence (defaut 1e-8) % 2: maximum d'iterations (defaut 2000) % exemple: [1e-8 2000] % % out: % Ypred (échantillons x factors) = vecteurs Y de scores à comparer à la % référence Yval % HPLScal (structure) = paramètres du modèle Séquentiel % RMSECV (1 x facteurs) = erreur de validation croisée % Dif (1 x facteurs) = pourcentage de diminutionde RMSECV entre 2 variables % latentes consécutives % RMSEC (1 x facteurs) = erreur de calibration % RMSEP (1 x facteurs) = erreur de prédiction % R2 (1 x facteurs) = coefficient de détermination pour la calibration % Q2 (1 x facteurs) = coefficient de détermination pour la prédiction % PLScal (structure) = modèles PLS pour chaque bloc de calibration % PLSval (strucutre) = scores PLS obtenus avec PLScal pour chaque bloc de % validation % % Références: % applyPLS1 utilise les fonctions mypls (adaptée dans pls1), meanc, autosc, % rangesc et corr2_1 proposées par van den Berg % (http://www.models.life.ku.dk/mbtoolbox)

% Etape 1: PLS sur chaque bloc [nX,mX] = size(Xcal); nbX = size(Xin,2); HXcal = []; HXval = []; for a=1:nbX coli = Xin{a}; s = ['X-Block #' num2str(a)];

174

disp(s); s = [num2str(length(coli)) ' variables']; disp(s) if any(coli > mX) error('ERROR: block index is outside of X-block range') end Xmv = sparse(isnan(Xcal(:,coli))); pXmv = sum(sum(Xmv))/(nX*length(coli))*100; s = [num2str(pXmv) '% missing values']; disp(s); % construction des modèles PLScal avec set de calibration PLScal{a} = pls1(Xcal(:,coli),Ycal,nLV,Xpp,Ypp,options); % application de PLScal pour calculer les scores du set de validation if Xpp == -1 inp = input('X-block scaling (0=none, 1=mean center, 2=autoscale, 3=range(0-1)scale)? : '); Xpp = inp; else inp = Xpp; end switch inp case 0 case 1 Xval(:,coli) = meanc(Xval(:,coli),PLScal{a}.moyX); case 2 Xval(:,coli) = autosc(Xval(:,coli),PLScal{a}.moyX,PLScal{a}.stdX); case 3 Xval(:,coli) = rangesc(Xval(:,coli),PLScal{a}.rgX); otherwise error('ERROR: X-block scaling must be 0(none), 1(mean center), 2(autoscale) or 3(range(0-1)scale)'); end for aa=1:nLV PLSval{a}.Tpred(:,aa) = Xval(:,coli)*PLScal{a}.W(:,aa)/(PLScal{a}.W(:,aa)'*PLScal{a}.W(:,aa)); Xval(:,coli) = Xval(:,coli) - PLSval{a}.Tpred(:,aa)*PLScal{a}.P(:,aa)'; end % concatenation des scores obtenus avec le premier niveau PLS % +++ pour ajouter normalisation: % +++ PLScal{a}.T = PLScal{a}.T/norm(PLScal{a}.T); % +++ PLSval{a}.Tpred = PLSval{a}.Tpred/norm(PLSval{a}.Tpred); HXcal = [HXcal PLScal{a}.T]; HXval = [HXval PLSval{a}.Tpred]; end

% Etape 2: modèle PLS sur les scores des PLS obtenus à l'étape 1 nF = nbX*nLV; [Ypred,HPLScal,RMSECV,Dif,RMSEC,RMSEP,R2,Q2]=applyPLS1(HXcal,Ycal,HXval,Yva l,nF,Xpp,Ypp,options);

175

Résumé Il est important d’assurer l’authenticité et la traçabilité des produits alimentaires pour faire face aux préoccupations des consommateurs et aux pertes économiques engendrée par les cas de fraude. Or les huiles d’olive, en particulier celles dont l’origine leur apporte une valeur ajoutée supplémentaire, sont parmi les produits les plus touchés par les fraudes. Cependant, les analyses officielles actuelles s’intéressent seulement aux critères de pureté et de qualité des huiles, mais pas à la confirmation de leur origine variétale. Cette thèse propose donc des outils chimiométriques visant à faciliter l’application de la reconnaissance variétale des huiles d’olive lors de contrôles de routine. D’une part, en ajoutant au modèle PLS1-DA une règle de décision par carte de contrôle, les échantillons dont les caractéristiques s’éloignent du profil de référence peuvent être facilement et rapidement identifiés, même en présence d’effectifs déséquilibrés entre les classes à prédire. D’autre part, le développement de stratégies de fusion des données permet de bénéficier de la synergie entre les sources d’information complémentaires que sont l’analyse spécifique des acides gras par chromatographie gazeuse et les analyses globales par spectroscopies proche- et moyen-infrarouge.

Mots-clés chimiométrie, PLS-DA, huile d’olive, cultivars, carte de contrôle, fusion de données, GC, MIR, NIR

Abstract Ensuring the authenticity and traceability of food products is important to deal with consumer concerns and economic losses generated by food fraud cases. Moreover, olive oils, especially those whose origin gives them an extra added value, are among the products most affected by frauds. However, current official analyses only focus on purity and quality criteria of the oils, but not on the confirmation of their varietal origin. Therefore, this thesis proposes chemometric tools to facilitate the application of the varietal recognition of olive oils by routine analyses. On the one hand, by adding a control chart decision rule to the PLS1-DA model, samples whose characteristics are deviating from the reference profile can easily and quickly be identified, even in the presence of unbalanced numbers between the predicted classes. On the other hand, the development of data fusion strategies allows to benefit from the synergy between complementary sources of information such as the specific analysis of fatty acids by gas chromatography and the global analyses by near- and mid-infrared spectroscopy.

Keywords chemometrics, PLS-DA, olive oil, cultivars, control chart, data fusion, GC, MIR, NIR