The PLS Regression Model: Algorithms and Application to Chemometric Data

Universita` degli Studi di Udine Dipartimento di Matematica e Informatica Dottorato di Ricerca in Informatica Ciclo xxv Ph.D. Thesis The PLS regression model: algorithms and application to chemometric data Candidate: Supervisor: Del Zotto Stefania prof. Roberto Vito Anno Accademico 2012-2013 Author's e-mail: [email protected] Author's address: Dipartimento di Matematica e Informatica Universitàdegli Studi di Udine Via delle Scienze, 206 33100 Udine Italia To all the kinds of learners. Abstract Core argument of the Ph.D. Thesis is Partial Least Squares (PLS), a class of techniques for modelling relations between sets of observed quantities through latent variables. With this trick, PLS can extract useful informations also from huge data and it manages computational complexity brought on such tall and fat datasets. Indeed, the main strength of PLS is that it performs accurately also with more variables than instances and in presence of collinearity. Aim of the thesis is to give an incremental and complete description of PLS, together with its tasks, advantages and drawbacks, starting from an overview of the discipline where PLS takes place up to the application of PLS to a real dataset, moving through a critical comparison with alternative techniques. For this reason, after a brief introduction of both Machine Learning and a corresponding general working procedure, first Chapters explain PLS theory and present concurrent methods. Then, PLS regression is computed on a measured dataset and concrete results are evaluated. Conclusions are made with respect to both theoretical and practical topics and future perspectives are highlighted. The first part starts describing Machine Learning, its topics, approaches, and designed techniques. Then, a general working procedure for dealing with such problems is presented with its three main phases. Secondly, a systematic overview of the PLS methods is given, together with math- ematical assumptions and algorithmic features. In its general form PLS, developed by Wold and co-workers, defines new components by maximizing covariance between two different blocks of original variables. Then, measured data are projected onto the more compact latent space in which the model is built. The classical computation of the PLS is based on the Nonlinear Iterative PArtial Least Squares (NIPALS) algorithm; but, different variants have been proposed. The PLS tasks include regression and classification, as well as dimension reduction. Nonlinear PLS can also be specified. Since PLS works successfully with large numbers of strictly linked variables with a noticeable computational simplicity, PLS results in a powerful versatile tool that can be used in many areas of research and industrial applications. Thirdly, PLS can be compared with concurrent choices. As far as regression is con- cerned, PLS is connected with Ordinary Least Squares (OLS), Ridge Regression (RR), Canonical Correlation Analysis (CCA). PLS can also be related with Prin- cipal Component Analysis and Regression (PCA and PCR). Moreover, Canonical Ridge Analysis is a unique approach that collects all the previous methods as spe- cial cases. In addition, all these techniques can be cast under a unifying approach called continuum regression. Some solutions, as OLS, are the most common estima- tion procedures and have desirable features but at the same time they require strict constrains to be applied successfully. Since measured data of real applications do not usually comply needed conditions, alternatives should be used. PLS, RR, CCA and PCR try to solve this problem but PLS performance overcomes the others. As regards classification and nonlinearity, PLS can be linked with Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). LDA as PLS is a supervised classification and dimensionality reduction technique but LDA has a limited rep- resentation ability and sets some assumptions that real data usually do not fulfill causing computation fails. SVM has some interesting properties that make it a com- petitor with respect to PLS but its advantages and drawbacks should be carefully analyzed. SVM allows to expand PLS approach comprising nonlinearity. In the second part PLS is applied to a chemometric problem. The aim is to define a model able to predict the fumonisins content in maize from Near InfraRed (NIR) spectra. A dataset collects NIR spectra of maize and their fumonisins (a group of mycotoxins) content. The NIR wavelengths are almost a thousand strictly collinear variables; observations are of the order of hundreds. A model is computed using PLS with full cross validation, after identifying the outliers among both the instances and the wavelengths. The results are promising as regards the correlation between measured and fitted values with respect to both calibration and validation. In addition, the relation between measured quantities is represented by a simpler model with about twenty latent variables and a decreasing residual variance. Moreover, the model assures screening ability with respect to the most important European legal limits, fixed as tolerable daily intake for fumonisins consumption of maize-based food. Fur- thermore, it can also correctly classify other groups of contaminated maize. In conclusion, maize as well as maize-based feed and food are a common dietary sta- ple for both cattle and people. However, fungi infection of maize is usually certified throughout the world. Moreover, fungi may produce mycotoxins and consumption of contaminated maize causes serious diseases. As a consequence, procedures to mea- sure as early as possible fumonisins content in maize batches are highly needed in order to split up safely dangerous maize from healthy one. As a matter of fact, tra- ditional techniques for measurements of mycotoxins are expensive, long, destructive, and require trained personnel, whereas NIR spectroscopy is cheaper, faster, simpler, and produce no waste materials. Therefore, PLS modelling techniques grounded on NIR spectroscopy data provide a quite promising methodology for fumonisins detection in maize. As usual, further systematic analyses are required in order to confirm conclusions and improve results. Acknowledgments During these years at the Department of Mathematics and Computer Science of the University of Udine I met people without whom my work would not have been possible. I would like to thank all of them for their insightful guidance and help. I would like to express my sincere gratitude to my supervisor prof. Vito Roberto for his continuous support, precious advice and welcome suggestions that allowed me to improve myself and my studies, capabilities and results. Words cannot describe how lucky I feel for the chance to work with prof. Giacomo Della Riccia. At the Research Center \Norbert Wiener" we developed together the statistical analysis and his enthusiasm and profound experience were essential to overcome any obstacle. He is a very nice person who delighted me with his incred- ible and amazing stories, memories of a life, that make my thoughts travel from Alexandria to Moscow, from Paris to Boston. I also would like to thank my reviewers, professors Corrado Lagazio, Carlo Lauro and Ely Merzbach as well as professors Ruggero Bellio, Francesco Fabris, Hans Lenz and Paolo Vidoni for their invaluable feedback, encouragement and useful discus- sions. I am grateful to the director of Ph.D. Course in Computer Science prof. Alberto Policriti, who always answered my questions kindly and carefully followed my stu- dent career step by step. Furthermore, I would like to thank every professor of the courses I followed, not only were they helpful in lecturing they also gave excellent suggestions for final reports. The described case study is a part of \Micosafe", a project funded by the Nat- ural, Agricultural and Forestry Resources Office of Friuli Venezia Giulia Region, proposed and developed by the Department of Agricultural and Environmental Sci- ences of the University of Udine in cooperation with the International Research Center for the Mountain (CirMont) and the Breeders Association of FVG. I thank professors Bruno Stefanon and Giuseppe Firrao for choosing the Department of Mathematics and Computer Science as their partner for statistical analysis and for making data available. Brigitta Gaspardo, Emanuela Torelli and Sirio Cividino are the researchers with whom I collaborated. I would like to express my gratitude to them. The fruitful and enjoyable experience of my studies could not have been accom- plished without the daily presence of the people who love and live everyday with me, listening to me and sharing thoughts and feelings. I take this occasion to express to each of them my thanks. My parents, Laura and Renato, sustain me and highlight my shortcomings to make me a better woman. Their life, their behaviour and their choices are the first visible example and model for me. My brothers, Simone e Michele, fill my life with their \disorder": they make me happy when I am sad or angry, they speak aloud when I am too silent, they ignore my recurring useless comments. Simone shares his life and time with me in all the possible ways. He understands me even without words, he sees deep inside me and he respects with endless perse- verance me and my moody moments. The presence of my extended family is also essential: grandparents, aunts and un- cles, cousins and other relatives make me feel glad and we stay joyfully together. Claudia and her group of teenagers warm my heart with their smiles, their hugs and their jokes. The presence of many other people enriches my days. My friend Cristina spent many hours near me, always listening to my questions and trying to help me. My colleague Omar puts up my grumbles and still asks me to join him for a coffee break. With Elena I have the most restful talks where she gives me the best advice for every kind of situation. Furthermore, there are many other people that I cannot list here but whose friend- ship is essential for me, with whom I spend pleasant moments of spare time and who cannot ever be forgotten.

The PLS Regression Model: Algorithms and Application to Chemometric Data

Paradigms of Combinatorial Optimization

Solving Non-Boolean Satisfiability Problems with Stochastic Local Search

Random Θ(Log N)-Cnfs Are Hard for Cutting Planes

On Total Functions, Existence Theorems and Computational Complexity

Arxiv:1910.02319V2 [Cs.CV] 10 Nov 2020

A Tour of the Complexity Classes Between P and NP

Introduction to SAT History, Algorithms, Practical Considerations

Pure Nash Equilibria and PLS-Completeness∗

Total NP Functions I: Complexity and Reducibility

Combinatorial Problems and Search

Evolving Combinatorial Problem Instances That Are Difficult to Solve

Random Cnfs Are Hard for Cutting Planes