Synergies between Chemometrics and Machine Learning
Rickard Sjögren
Department of Chemistry, Umeå University Umeå, Sweden, 2021 This work is protected by the Swedish Copyright Legislation (Act 1960:729) Dissertation for PhD ISBN: 9789178555581 (print) ISBN: 9789178555598 (pdf) The cover art illustrates a brain and a tree. I simply thought it looked nice. Image attribution: Gordon Johnson, USA. (GDJ @ Pixabay) Electronic version available at: http://umu.divaportal.org/ Printed by: Cityprint i Norr AB Umeå, Sverige, 2021 Don’t worry, be happy. Table of Contents
Abstract ...... ii Sammanfattning på svenska ...... iii List of Publications ...... iv Other Publications ...... v Abbreviations ...... vii Notation ...... viii
1 Introduction 1 1.1 Scope of this Thesis ...... 3
2 Background 4 2.1 What is data? ...... 4 2.2 Machine Learning ...... 5 2.2.1 Machine Learning Models ...... 7 2.2.2 Deep Learning ...... 9 2.2.3 Data Preprocessing ...... 14 2.3 Chemometrics ...... 15 2.3.1 Data generation ...... 15 2.3.2 Data Analysis ...... 17 2.3.3 Predictive Modelling ...... 18 2.3.4 Model Diagnostics ...... 20
3 Results 21 3.1 Summary of Paper Contributions ...... 23 3.2 Paper I: MVDA for Patent Analytics ...... 24 3.3 Paper II: DOE for Data Processing Pipelines ...... 26 3.4 Paper III: Outlier Detection in ANNs ...... 27 3.4.1 Additional Case Studies ...... 29 3.5 Paper IV: Quality Assurance in Large Data Projects ...... 33
4 Future Perspectives 36 4.1 Separating Sources of Variation ...... 36 4.2 Rethinking PLS Weights ...... 37
Acknowledgements 40
Bibliography 42
i Abstract
Thanks to digitization and automation, data in all shapes and forms are generated in ever growing quantities throughout society, industry and science. Datadriven methods, such as machine learning algorithms, are already widely used to benefit from all these data in all kinds of applications, ranging from text suggestion in smartphones to process monitoring in industry. To ensure maximal benefit to society, we need workflows to generate, analyze and model data that are performant as well as robust and trustworthy.
There are several scientific disciplines aiming to develop datadriven methodologies, two of which are machine learning and chemometrics. Machine learning is part of artificial intel ligence and develops algorithms that learn from data. Chemometrics, on the other hand, is a subfield of chemistry aiming to generate and analyze complex chemical data in an optimal manner. There is already a certain overlap between the two fields where machine learning algorithms are used for predictive modelling within chemometrics. Although, since both fields aims to increase value of data and have disparate backgrounds, there are plenty of possible synergies to benefit both fields. Thanks to its wide applicability, there are many tools and lessons learned within machine learning that goes beyond the predictive models that are used within chemometrics today. On the other hand, chemometrics has always been applicationoriented and this pragmatism has made it widely used for quality assur ance within regulated industries.
This thesis serves to nuance the relationship between the two fields and show that knowl edge in either field can be used to benefit the other. We explore how tools widely used in applied machine learning can help chemometrics break new ground in a case study of text analysis of patents in Paper I. We then draw inspiration from chemometrics and show how principles of experimental design can help us optimize largescale data processing pipelines in Paper II and how a method common in chemometrics can be adapted to al low artificial neural networks detect outlier observations in Paper III. We then show how experimental design principles can be used to ensure quality in the core of concurrent ma chine learning, namely generation of largescale datasets in Paper IV. Lastly, we outline directions for future research and how stateoftheart research in machine learning can benefit chemometric method development.
ii Sammanfattning på svenska
Tack vare digitalisering och automation genereras växande mängder data i alla möjliga for mer runtom i samhället, industrin och akademin. För att utnyttja dessa data på bästa vis används redan idag så kallade datadrivna metoder, till exempel maskininlärning, i mängder av tillämpningar i allt ifrån förslag av nästa ord i SMS på smartphones till processövervakning inom industri. För att maximera samhällsnyttan av den data som genereras behöver vi ro busta och pålitliga arbetsflöden för att skapa, analysera och modellera data för alla tänkbara tillämpningar.
Det finns många vetenskapliga fält som utvecklar och utnyttjar datadrivna metoder, där två av dessa är maskininlärning och kemometri. Maskininlärning faller inom det som kallas ar tificiell intelligens och utvecklar algoritmer som lär sig från data. Kemometri däremot har sitt ursprung i kemi och utvecklar metoder för att generera, analysera och maximera värdet av komplexa kemiska data. Det finns ett visst överlapp mellan fälten där maskininlärnings algoritmer används flitigt för prediktiv modellering inom kemometrin. Eftersom bägge fält försöker öka värdet av data och har vitt skilda bakgrunder finns det många potentiella syn ergieffekter. Tack vare att maskininlärning är så vida använt finns det många verktyg och lärdomar utöver dom prediktiva modeller som redan används inom kemometrin. Å an dra sidan har kemometri alltid varit inriktat på praktisk tillämpning och har tack vare sin pragmatism lett till att det idag är välanvänt för kvalitetsarbete inom reglerad industri.
Den här avhandlingen har som mål att nyansera förhållandet mellan kemometri och maskin inlärning och visa att lärdomar inom vardera fält kan gynna det andra. Vi visar på hur man kan använda verktyg vanliga inom maskininlärning för att hjälpa kemometrin att bryta ny mark i en casestudie på textanalys av patentsamlingar i Paper I. Sedan lånar vi från kemometrin och visar hur vi kan använda experimentdesign för att optimera storskaliga beräkningsflöden i Paper II och hur en vanlig metod inom kemometrin kan formuleras om för att för att upptäcka avvikande mätningar i artificiella neuronnät i Paper III. Efter det visar vi hur principer från experimentdesign kan användas för att säkerställa kvalitet i kärnan av modern maskininlärning, nämligen skapandet av stora dataset i Paper IV. Till sist ger vi förslag på framtida forskning och hur de senaste metoderna inom maskin inlärning kan gynna metodutveckling inom kemometrin.
iii List of Publications
This thesis is based upon the following publications:
Paper I Multivariate patent analysis — Using chemometrics to analyze collections of chemical and pharmaceutical patents, Rickard Sjögren, Kjell Stridh, Tomas Skotare, Johan Trygg Journal of Chemometrics 34.1 (2020): e3041. DOI: 10.1002/cem.3041
Paper II doepipeline: a systematic approach to optimizing multilevel and multistep data processing workflows, Daniel Svensson†, Rickard Sjögren†, David Sundell, Andreas Sjödin, Johan Trygg BMC bioinformatics 20.1 (2019): 498. DOI: 10.1188/s128590193091z
Paper III OutofDistribution Example Detection in Deep Neural Net works using Distance to Modelled Embedding, Rickard Sjögren, Johan Trygg Manuscript under review Machine Learning
Paper IV LIVECell A largescale dataset for labelfree live cell segmen tation Christoffer Edlund†, Timothy R Jackson†, Nabeel Khalid†, Nicola Bevan, Timothy Dale, Andreas Dengel, Sheraz Ahmed, Johan Trygg, Rickard Sjögren Manuscript under review Nature Methods
Where applicable, the papers are reprinted with permission from the publisher.
†These authors made equal contributions
iv Other Publications
These publications by the author are not included in this thesis:
Paper V Visualization of descriptive multiblock analysis, Tomas Skotare, Rickard Sjögren, Izabella Surowiec, David Nilsson, Johan Trygg Journal of Chemometrics 34.1 (2020): e3071. DOI: 10.1002/cem.3071
Paper VI Joint and unique multiblock analysis of biological data– multiomics malaria study, Izabella Surowiec, Tomas Skotare, Rickard Sjögren, Sandra GouveiaFigueira, Judy Orikiiriza, Sven Bergström, Johan Nor mark, Johan Trygg Faraday discussions 218 (2019): 268283. DOI: 10.1039/C8FD00243F
Paper VII A geographically matched control population efficiently limits the number of candidate diseasecausing variants in an unbi ased wholegenome analysis, Matilda Rentoft, Daniel Svensson, Andreas Sjödin, Pall I Ola son, Olle Sjöström, Carin Nylander, Pia Osterman, Rickard Sjö gren, Sergiu Netotea, Carl Wibom, Kristina Cederquist, Andrei Chabes, Johan Trygg, Beatrice S Melin, Erik Johansson PloS one 14.3 (2019): e0213350. DOI: 10.1371/journal.pone.0213350
Paper VIII A Machine Vision Approach for Bioreactor Foam Sensing, Jonas Austerjost, Robert Söldner, Christoffer Edlund, Johan Trygg, David Pollard, Rickard Sjögren SLAS Technology (2021) DOI: 10.1177/24726303211008861
v Following patent applications relate to Paper IV included in this thesis:
I Computerimplemented method, computer program product and system for data analysis Rickard Sjögren, Johan Trygg E.P. Patent Application: EP3620983A1 W.O. Patent Application: WO2020049094A1 U.S. Patent Application: US20200074269A1
II Computerimplemented method, computer program product and system for analysis of cell images Rickard Sjögren, Johan Trygg E.P. Patent Application: EP3620986A1 W.O. Patent Application: WO2020049098A1
III Computerimplemented method, computer program product and system for anomaly detection and/or predictive mainte nance Rickard Sjögren, Johan Trygg W.O. Patent Application: WO2020049087A1
vi Abbreviations
ANN Artificial Neural Network
CNN Convolutional Neural Network
DOE Design of Experiments
GSD Generalized Subset Design
MLR Multiple Linear Regression
MVDA MultiVariate Data Analysis
OOD Outofdistribution, also known as outlier
PCA Principal Component Analysis
PLS Partial Least Squares, or Projections to Latent Structures, regression
Q2 Goodness of Prediction
QA Quality Assurance
QbD QualitybyDesign
QSAR Quantitative Structure Activity Relationship
RNN Recurrent Neural Network
RMSE Root Mean Square Error
vii Notation
X The observation space
Y The response space
R The set of all real numbers
x An observation in X
y A response in Y
n Number of observations in a dataset
p Number of variables in a twodimensional dataset
q Number of responses in a twodimensional dataset
k Dimensionality of a model, e.g. number of components
x A length p vector of an observation
t A length k score vector of an observation
p A length p vector of variable loadings
y A length q vector of measured responses
yˆ A length q vector of predicted responses
X n × p observation matrix
Y n × q matrix of measured responses
Yˆ n × q matrix of predicted responses
T n × k matrix of observation scores
W p × k matrix of variable weights
Wi ki−1 × ki matrix of variable weights in a multilayer model
P p × kmatrix of variable loadings
C q × k matrix of PLS Yloadings
θ A set of model parameters in any form.
θ A vector of model parameters.
L(y, yˆ) The loss function L : Y2 → R.
viii 1 | Introduction
Going into the 2020s, generation and utilization of data is everpresent in the science com munity, industry as well as throughout society in general, often under the umbrella term artificial intelligence (AI). Many forecasts project that the global AI market will grow into hundreds of billions USD, meaning that its implications on society will be profound. Al though AI encompasses a wide array of both technical and philosophical subfields, the boom we are experiencing now is largely driven in advances in datadriven methods and especially machine learning. There are many definitions of what machine learning means and a common definition paraphrases Arthur Samuel [1], which describes machine learn ing as algorithms that:
[...] build a model based on sample data, known as ”training data”, in order to make predictions or decisions without being explicitly programmed to do so.
Wikipedia Machine Learning
In short, this means that we show examples of expected input data and desired output data to machine learning algorithms that then serve to find a map between input and output. In other words, datadriven methods such as machine learning algorithms can be used to fit a model of the process transforming our input to the output that we can use to predict the output for new input data.
Working with this definition there are virtually endless possibilities of what data can be used as input and what output we may want to predict. Most people in developed countries regularly encounter applications that use machine learning in daily life ITapplications. Some examples include smartphones that use machine learning to suggest a suitable next word based on the currently written characters in text messages, streaming services that use machine learning to recommend what TVseries to watch next based on previously watched content and social media that use machine learning to decide what content on your timeline you are likely to interact with based on previous interactions. In industry we find machine learningmethods in all levels of the production chain, from automated visual inspection for quality control [2] to forecasting wind farm energy output to help optimizing equipment use [3]. Looking ahead, we have seen exciting developments in machine learn ing that will have profound impact future’s society. For instance, DeepMind recently used machine learning to predict protein structures accurately enough to be practically useful in structural biology [4], which will have immense impact on biology and drug development.
1 1. Introduction
Although datadriven methods such as machine learning algorithms show great promise to solve many problems throughout society, they are intimately connected to the strengths and weaknesses of the data itself. A machine learning model will never be better than the data it was trained on, nor can we expect it to perform well on new data dissimilar to the training data. This is widely known among machine learning practitioners and poses fundamental limitations of what we can expect from these types of systems. Despite being widely known, less effort than we may like is spent systematizing the data itself to ensure that we maximize the benefit to our applications and society. In contrast, immense effort is spent optimizing algorithms for predictive performance and applying algorithms to new problems. No matter how welloptimized a datadriven algorithm may be, we must make sure that the premises in terms of what data we use are right in order for it to be useful.
This leads us to the topic of this thesis, namely the field of chemometrics and its relationship to concurrent machine learning. Originally named in a grant application in 1971, one if its pioneers summarizes chemometrics as:
[...] the branch of chemistry concerned with the analysis of chemical data (ex tracting information from data) and ensuring that experimental data contain maximum information (the design of experiments).
Svante Wold [5]
Despite being a subfield of chemistry, smaller and younger than machine learning, the underlying principles of chemometrics are applicable far beyond its original formulation. Chemometrics is characterized by pragmatic use of mathematical, statistical and data driven methods to maximize the value gained of generated data by means of statistical experimental design and structured data analysis to maximize the knowledge gained by the data we have. It is easy to argue that these characteristics are applicable, and desirable, in other domains outside of chemistry. In fact, the principles and methods of chemomet rics have gained widespread traction in regulated industries, such as the pharmaceutical industry [6]. With its long history of practical applications, there are many lessons learned among chemometricians that deserve wider use.
Although chemometrics is sometimes described as having an adverse relationship with ma chine learning [7], machine learning methods are widely used in chemometrics. We will leave details for later chapters, but linear latent variablebased methods such as princi pal components analysis (PCA) [8] and partial least squares regression (PLS) [9], used in chemometrics for decades, can be viewed as machine learning algorithms. In later years, we have seen an influx of a wide variety of machine learning algorithms into chemometrics, including so called artificial neural networks [10, 11] that are largely responsible for the last decade’s AIboom. Cross pollination between these two fields shows great benefits already and this thesis aims to show that there are more potential synergies and we should strive towards a symbiotic relationship.
2 1.1. Scope of this Thesis
1.1. Scope of this Thesis
This thesis serves to nuance the relationship between the fields of machine learning and chemometrics and show that knowledge in either field can be used to benefit the other. Machine learning algorithms are already used in chemometrics and there are principles in common between the two fields and this thesis argues that there are more potential synergies to be found.
We start by exploring how tools widely used in applied machine learning can help chemo metrics break new ground. Thanks to its wide arrays of applications, particularly in the domain of unstructured data such as text and images, machine learning feature a toolbox that spans far beyond model fitting algorithms. These tools are readily available and Pa per I demonstrates that they can be reused to help chemometrics enter new applications in a case study of text analysis of patents.
Simultaneously, chemometrics’ pragmatic view of how to work with data is directly trans ferable to problems where machine learning algorithms are used as well as the algorithms themselves. For instance, experimental design methods to help generate and evaluate small datasets with maximal information are straightforward to adapt and were applied in Paper II to implement an algorithm to optimize largescale genomic data processing pipelines. Paper III then draws inspiration from chemometric approaches to determine which predictions from a machine learning model can be trusted and develops a method to allow artificial neural network models detect unrecognized data. Paper IV then demon strates how experimental design principles can be used to ensure quality in the core of concurrent machine learning, namely generation of largescale datasets. Lastly, we out line directions for future research where stateoftheart research in machine learning can benefit chemometric method development.
3 2 | Background
2.1. What is data?
Since this thesis regards the topics of generation and analysis of data, we first explain what we mean by data. One definition states that data are pieces of information or facts, often numerical, that are collected through observation [12]. For the purpose of this work, the term information is referred to in a colloquial sense rather than the formal information theoretic sense. This definition is deliberately broad for the purpose of this thesis, which involves analysis of several types of data. A dataset is then simply a collection of obser vations of the same type of information, where each observation may carry one or more pieces of information, or variables.
Since data analysis and machine learning deals with mathematical manipulation of data, data must be in numeric form and observations must follow the same data model in or der to be useful for those purposes. Data where observations follow a defined data model are often referred to as structured data. For instance, tabular data where each row con sists of an observation is perhaps the most famous example of structured data. Other types of structured data include observations of single, realvalued measurements from a sin gle sensor, for instance a pHmeter. Further examples include spectral data, where each observation is a sequence, or vector, of realvalued intensities or absorbances measured at different wavelengths. A frequently occurring problem in realworld structured data is missing data, where one or more variables lack measurements. Missing data may ei ther occur randomly or systematically, where the former is more problematic from a data analytic perspective since it may for instance introduce biases.
The opposite to structured data are so called unstructured data, which refers to data that lack a defined data model. An unstructured observation is a piece of information, but lack a direct way to represent it numerically for the purpose of data analysis or machine learn ing. An example of unstructured data is provided by textual data, which typically contain information but cannot be directly represented in a structured numeric form. Another ex ample is provided by image data that lack a direct way of defining how the pixels relate to each other in order to be useful even though each pixel value is numeric. This means that raw unstructured data must be subject to data preprocessing such as feature extraction in order to convert it into structured form (see section 2.2.3).
Since many problems in both machine learning and chemometrics involve fitting a model to transform one form of data into another, we can distinguish different types of data based on whether or not the observations form pairs. For brevity we do not discuss multiblock
4 2.2. Machine Learning analysis, where observations form triplets or more, which has been discussed at length elsewhere [13, 14]. Naming of variables in either input or output data is sometimes a source of confusion in interdisciplinary communication. The variables in the input data may be called predictors, regressors, factors, covariates or independent variables, or often features in machine learning. Wheras the variables in the output data are often called responses in chemometrics, targets or labels in machine learning, or dependent variables, regressands, among other depending on the field. The output variables are then further distinguished based on whether they are measured or predicted. Furthermore, the process of manually creating targets for input data is sometimes called annotation or labelling, which is most common in terms of unstructured data.
2.2. Machine Learning
This section serves to provide basic understanding for the machine learning concepts used in this thesis. It does in no way provide a comprehensive overview of the field.
Machine learning is a subfield of artificial intelligence (AI) and regards the specific type of problem of enabling software to Artificial Intelligence learn from examples (figure 2.1). AI is a vast field and involves techniques such as search and planning algorithms and con Machine Learning cepts such as reasoning. Although ma chine learning comprise a subset of AI, Deep Learning the two terms are sometimes used inter changeably. Furthermore, in most appli cations encountered in literature, machine learning is restricted to learning from a Figure 2.1: Illustration of the hierarchy fixed training set to fit a model that is then of artificial intelligence and its subfields used for predictive purposes. Considering machine and deep learning. the complex dynamics of the world, the scope of such learning is very narrow but has proven very useful. Although, there are machine learning algorithms that learn from interactions with a dynamic environment, such as reinforcement learning [15], they are beyond the scope of this thesis.
The term machine learning may refer to two different, but related, topics. It may refer to the process of applied machine learning aiming to develop useful models for the problem at hand (illustrated in figure 2.2). Applied machine learning includes transforming data to useful form by means of data preprocessing (section 2.2.3), application of machine learn ing algorithms to fit a predictive model, scoring and evaluation of models (described in section 2.3.3) and deploying models for downstream use where they may be monitored to ensure that they are suitable for their task (section 2.3.4). The term may also refer to the actual machine learning algorithms, which is the main topic of this section.
5 2. Background
Iterate to find best model
Training Model Data Training
Learning algorithms Data Model Raw Data Preprocessing Deployment
Data Manipulation Downstream use Data Cleaning Feature Engineering Evaluation Model Data Evaluation
Figure 2.2: Illustration of applied machine learning as a multistep process. Given raw data, the data is preprocessed into suitable form for model training. The preprocessed data is then typically split into training data, used to update the model, and evaluation data used to evaluate and score the model. In most cases, many different models are trained and scored before the final model is deployed for downstream use.
In broad terms, machine learning algorithms are divided into two main regimes, super vised and unsupervised learning. The most widely used regime is supervised learning, which refers to when the training set consists of inputoutput pairs and train a model to predict the target output well based on input features. Measured targets are considered ground truth and are used to determine how well the model performs and we train our models to minimize the error with respect to the ground truth. Meaning that the train ing is supervised by our actual measured targets. Problem types falling under supervised learning are typically divided based on the nature of the target and include, but are not limited to, regression, where the target is a continuous value, and classification, where the target is a discrete category.
Unsupervised learning, on the other hand, deals with training data where the observations have no target associated with them and aim to find a useful representation of the data. This representation may be used to interrogate the data and help find general trends or subgroups in the dataset. An example of an unsupervised machine learning problem is di mensionality reduction that serves to find a lowdimensional representation of a dataset with many variables, for instance by means of PCA [8]. Closely related to unsupervised learning is selfsupervised learning, which has shown great results in several domains re cently [16, 17, 18]. Selfsupervised learning uses supervised learning algorithms but the targets are generated from the training inputs themselves instead of being provided exter nally. An example in natural language processing shows that large artificial neural network models can be trained to fill in blanks in text on very large scale to learn representations of language from text useful for downstreamtasks such as document classification [16]. Another example is provided by so called consistencybased approaches that show con vincing results in computer vision and train models to output consistent representations for an image and a randomly perturbed version of the same image [18, 17]. These types of approaches combined with smaller amounts of labelled data in so called a semisupervised fashion, are quickly growing popular in domains where raw data is plentiful.
6 2.2. Machine Learning
2.2.1. Machine Learning Models
Models as function approximators In its most general form, we describe a machine learning model as a function approximator. This means that the model is used to approx imate a mathematical function that maps from the observation space X to the response space Y. Given an observation x ∈ X , and our model f : X → Y parameterized by the set θ, the response y ∈ Y is approximated by the prediction yˆ ∈ Y as:
y ≈ yˆ = f(x, θ) (2.1)
For structured and numerical data, such as tabular or spectral data, x is an observation vector x and the observation space X = Rp where p is the dimensionality of x, and the matrix of all n input observation is the n × pmatrix X. Similarly, the target y is also often a vector y and Y = Rq with dimensionality q, and the matrix of targets for all observation is an n × qmatrix Y . The predicted target yˆ is on the same form as y, with corresponding prediction vector yˆ and matrix of predicted responses, Yˆ . In following sections, we will only discuss numeric vector observations unless specified otherwise.
The specifics of the actual model is hidden behind f above and there are a myriad of model types available with different complexity, which are useful for different problems. The com plexities of these models range from something as simple as basic multiple linear regression
(MLR) with regression parameter vector θ = θ = [θ0, θ1, ...., θp] predicting a scalar yˆ:
yˆ = θ0 + xiθ1 + ... + xpθp (2.2)
A slightly more complex model, which is popular in chemometrics, is partial least squares regression (PLS) [9] that is parameterized by weight matrices θ = {W , C}, where the p×k matrix W is used to project to an intermediate, so called latent, space from which the q × k matrix C is then used to project from the latent space to output space, and may be written as: yˆ = xW CT (2.3) Other popular machine learning models include Random Forests [19] that consists of many, a so called ensemble of, shallow decision trees that we denote with slightly abused nota tion as θ in equation 2.1. The decision trees’ learning algorithm trains the decision trees to divide observations into compartments in an optimal way according to a criterion of in terest. Other wellknown examples include knearest neighbor regression or classification [20] and support vector machines [21]. There are also socalled metaalgorithms available to optimally combine several models into an ensemble, such as a random forest, where ex amples include the popular AdaBoostalgorithm [22]. A model type that has grown highly popular in the past decade is provided by so called Artificial Neural Networks (ANNs) and deep learning, and will be describes separately below in 2.2.2, Deep Learning. In conclu sion, machine learning models aims to approximate the underlying function that give rise to our observed data. The specifics of how to model these functions are governed by the many different available model types and their respective learning algorithms.
7 2. Background
Loss functions In order to quantify to what extent our model is correct, we use so called loss functions L : Y2 → R that numerically describe the model error by comparing the predicted and measured outputs. How to compare predicted and measured outputs as well as what particular loss function to use depends on the problem at hand. An important part of the machine learning practitioner’s work is then to determine suitable loss functions. For instance, the loss of a regression model is often given by the meansquared error (MSE), where the MSE calculated over all or a subset of observations in the dataset is given by:
n q 1 X X L (Y , Yˆ ) = (Y − Yˆ )2 (2.4) MSE n ij ij j=1 i=1
For classification models, where each ground truth target variable yi is 1 for the class the ∈ observationP belongs to and 0 otherwise and the model predicts probabilities yˆi [0, 1] so q that i=1 yˆi = 1, the loss is often given by the average crossentropy:
n q 1 X X L (Y , Yˆ ) = − Y log(Yˆ ) (2.5) CE n ij ij j=1 i=1
The loss function may have significant impact on the the behaviour of the resulting model and developing new loss functions is an active research area in itself. A full survey of avail able loss functions is however beyond the scope of this thesis. An example to illustrate the flexibility of loss functions is provided by MSE that is widely used for regression problems but suffers from high sensitivity to outlier observations due to the squared residual. Differ ent countermeasures are available such as using the absolute value of the residual instead, also known as L1loss. L1loss is less sensitive to outliers but results in a derivative with constant magnitude that has implications for optimization as we will see below. So, one possible way to balance between insensitivity to outliers and ease of optimization is pro vided by a recently proposed family of loss functions [23] that allows us to tune sensitivity to outliers. For a scalar output y we let ϵ = y − yˆ and the loss is given by: ! |2 − α| ϵ2 α/2 L (ϵ, α) = − 1 (2.6) Barron α |2 − α|
Where α ∈ R is parameter used to tune the sensitivity, with for instance limα→2 LBarron =
LMSE. To conclude, by designing the loss function correctly we have great control over properties of our machine learning model.
Model fitting Model training, or fitting, describes the actual learning part of machine learning and is a central part of the field. Our modelf(x, θ) describes how to map an input observation to an output given a set of parameters and the loss function quantifies how wrong its predictions are. The missing piece is then how to find parameters that minimize the loss, which is what the training algorithms aim to solve. Depending on what model and loss function is used, there are many algorithms available where some algorithms are specific to the model at hand and others are more general.
Certain models and loss function combinations have closedform solutions, meaning that the model parameters are found by solving an equation. For instance, multiple linear re
8 2.2. Machine Learning gression, equation 2.2, minimizing MSE, equation 2.4, have closedform solution:
− θ = (XXT ) 1XT y (2.7)
Similarly, PLS (equation 2.3) maximizes the covariance between XW and YC that can be solved by the first eigenvector of the combined variance–covariance matrix XT YY T X.
In most cases there is no closedform solution available and iterative optimization algo rithms are necessary to minimize the loss function. Optimization algorithms can be divided into two coarse categories, gradientbased and gradientfree algorithms. As the name im plies, gradientbased algorithms rely on the gradients, that is the first or secondorder (partial) derivatives, of the loss function whereas gradientfree algorithms do not. We will not describe gradientfree algorithms in depth but example methods include random search [24], particle swarm optimization [25], simulated annealing [26] and many more.
In terms of gradientbased algorithms, we focus on a method widely used in machine learn ing, namely gradient descent. The key idea behind gradient descent is to stepwise adjust parameters in the opposite direction of the gradient of the loss function at the current point. The gradient describes the direction with steepest slope, and the whole procedure may be metaphorically described as finding the bottom of a bowl by taking small steps in the steep est downwards slope. Formally, for the average loss Lx,y, calculated over all training set observations, at iteration i, we update the kth model parameter θk,i using the update rule:
− ∇ θk,i+1 = θk,i λ θk,i (Lx,y) (2.8)
∇ · ∈ R Where θk,i ( ) denotes the gradient with respect to θk,i and learning rate λ determines how large steps we take. Although popular, gradient descent struggles with certain types of functions and is not guaranteed to find the global minimum in nonconvex problems (multiple hollows on a surface where the global minimum is the deepest hollow accord ing to our metaphor). Despite these shortcomings, gradient descent often works very well in practice. For large datasets, and especially in deep learning, the average loss is often calculated from a random subset of the training dataset instead of all observations. This procedure is typically called stochastic gradient descent (SGD) and sometimes minibatch gradient descent. There are many popular variants of gradient descent that aim to improve model convergence by for instance adding different types of momentum [27].
2.2.2. Deep Learning
This section serves to give a brief introduction to deep learning and relate it to the general machine learning concepts described above. Deep learning is a fastmoving field that has been surrounded by hype and high expectations in the past decade. We will not describe stateoftheart methods but rather explain core concepts.
As described in a previous section, deep learning is a subfield of machine learning (fig ure 2.1) and deals with a specific type of machine learning model called deep artificial neural networks (ANNs). Deep learning and its history has been described at length in
9 2. Background previous work [28, 29] but we will summarize it briefly. Deep ANNs describe a multistep, nonlinear transformation of input data to output data (see illustration in figure 2.3) and have their roots in seminal work on computational models of biological neurons in the 1940s [30]. The Perceptron, a progenitor to modern ANNs was published in the 1950s [31] and the first multilayered networks were developed little more than a decade after that [32]. Although dating back many decades, deep ANNs did not have their mainstream breakthrough until 2012 when deep learning was used to win a major computer vision com petition with huge margin [33]. The reason for taking so long is mainly due to that deep ANNs require vast datasets and computational resources to work well, something that was not widely available until the 2010s. Since then, ANNs and deep learning have made rapid progress and are used in complex prediction problems in biology [34, 35], chemistry [36], medical imaging [37], material science [38], fluid dynamics [39] and many more fields. With rapid algorithmic development, increased computational power and evergrowing data availability as well as raised expectations on predictive power, deep learning is likely here to stay.
Forward pass In most cases, ANNs describe a sequence of mathematical operations of how to transform input to output that can be described by a directed acyclic graph. These types of ANNs are referred to as feedforward networks, and applying a feedforward ANN to an observation is often called the forward pass. One example of nonfeedforward networks is provided by recurrent neural networks briefly described below in section 2.2.2 Model architectures.
As illustrated in figure 2.3, the predicted output of a feedforward ANN is given by sequen tially multiplying variables with parameters, typically called weights in this context, sum ming the products, adding a bias parameter and then applying a nonlinear activation func tion. Disregarding the activation function for a short moment, the intermediate values, also called activations, are given by equations identical to multiple linear regression above (equation 2.2). For twodimensional data and fully connected network as shown in fig ure 2.3, we put the weights of each layer l ∈ N in weight matrices Wl and bias vectors bl and conveniently write the prediction as the nested function:
yˆ = fl fl−1 ... f1(xW1 + b1)W2 + b2 Wl + bl (2.9)
Note that without activation functions, PLS (equation 2.3) can be written as an ANN with a single intermediate layer, bias vectors b1 = b2 = 0.
The activation functions fl : R → R can be any differentiable function and serves to in duce nonlinearity to help the ANN learn more complex functions. An example of activation function is the rectified linear unit, ReLU(x) = max(0, x), which is one of the most widely used activation functions in practice.
As already mentioned, the ANNmodel describes a directed graph of sumproducts and ac tivations meaning that we are not restricted to twodimensional data. To generalize from 2D to any regular, input, intermediate and output data as well as weights and biases are
10 2.2. Machine Learning
Output
H2
H1
Input x
Figure 2.3: Illustration of the forward pass using an ANN with two input variables xi, two intermediate layers H1 and H2 with three intermediate variables each and a single out y as example. During the forward pass, each intermediate value, or activation, h·j ∈ R is given by the sum of the values in the preceding layer multiplied by parameters, or weights, w·j ∈ R plus a biasterm bj ∈ R and passed through a differentiable nonlinear activation function f : R → R. Adapted from [28]. denoted using tensors. Tensors can be viewed as multidimensional arrays or generaliza tion of vectors, tensors of order 1, and matrices, tensors of order 2, and serve as the basic data structure in widely used software libraries for deep learning [40, 41]. Tensor manip ulation is especially useful for unstructured data such as images, which can be described as order 3 h × w × c tensors, where h is the height, w width and c the number of channels (e.g. three for RGB images, or more for multi or hyperspectral ones).
Backpropagation Typical ANNs are multilayered, complex models that are often over parameterized, meaning that they have more parameters than there are observations avail able to train the model. Perhaps obvious, but this means that ANNs have no closed form solution for model training and must be trained using optimization algorithms as described above (section 2.2.1 Model Fitting). The most widely used method to train ANNs is using the backpropagation algorithm (summarized in figure 2.4) and some variant of gradient descent [42], an approach originally developed in the field of control theory [43]. As de scribed above, gradient descent updates the model parameters, i.e. ANN weights, based
11 2. Background
Derivative of loss with respect to output
Input x
Figure 2.4: Illustration of backpropagation of the ANN. During backpropagation, the derivative of the loss with respect to each intermediate layer is calculated by repeatedly applying the chainrule of partial derivatives from output and backwards through the ANN layers. Adapted from [28].
on the firstorder derivative of the loss with respect to each parameters. In a multilayered model, such as a deep ANN, calculating the gradient of the output loss with the respect of a parameter in the middle of a large model is not trivial. To solve this problem, backprop agation provides a straightforward algorithm based on the chainrule of partial derivatives to calculate the gradients starting with the output layer and propagating the gradient back ∇ ward, which is described in figure 2.4. That is, backpropagation gives us the gradients θi,k used to update the parameters in gradient descent (equation 2.8).
We shall note that training deep ANNs is a highly nonconvex problem with high dimen sionality due to the large number of parameters and a large number of saddle points, which makes optimization nontrivial [44]. Despite these difficulties, the seemingly simple algo rithms of backpropagation and stochastic gradient descent help deep ANNs converge to models that generalize to an unprecedented extent in many problems. At the time of writ ing this thesis, the theory of why deep learning works so well is poorly understood. Unrav elling the reasons behind is a highly active research field [45, 46, 47] sometimes showing counterintuitive results such as the ”doubledescent” phenomenon [48]. We conclude that backpropagation has helped deep learning revolutionize many aspects of modern machine learning, although why is not fully understood.
12 2.2. Machine Learning
ANN architectures Another aspect of deep ANNs that have had large impact on their success is that of model architectures, which defines the way the sequence of operations are arranged within the model. The most basic architecture is provided by so called fully connected feedforward network. In a fully connected network, the ANN is defined as a simple sequence of layers where the activations in a layer depends on all activations in the preceding layer (as shown in figure 2.3). Thanks to the flexible network structure of ANNs, we can design sophisticated architectures to allow the ANN to benefit from better inductive biases, i.e. make them better predisposed to learn and generalize certain tasks. The need for inductive biases is well known in machine learning [49, 50] and designing ANN architecture variants is a research field in itself.
Perhaps the most famous example of a family of ANN architectures is provided by called convolutional neural networks (CNNs) that are widely used for all data with spatial correlations, especially for vi sion tasks. The basic mechanism was pro posed in the 1980s [51, 52] in which the model weights are arranged into small ker Kernel nels that are reused over the whole spa tial extent of an observation in a sliding window fashion (illustrated in figure 2.5). Input Output Using kernels that operate locally allow CNNs to efficiently learn general features Figure 2.5: Illustration of application of a that can be reused over the whole spatial 2 × 2kernel in a convolutional neural net extent and then combined into complex work for image data. In a slidingwindow features in deeper layers of the network. fashion, the pixelvalues in each region of the Large scale training of CNNs was used input is multiplied by the kernel weights and to win the ImageNet Large Scale Visual then summed to the intensity of one pixel in Recognition Challenge 2012 [33] with a the next layer. large margin, which led to the widespread adoption of ANNs that we see today.
Although the most famous example, CNNs are not the only ANN architecture family avail able. To process sequential observations, different variants of so called recurrent neural networks (RNNs) are sometimes used, which are an example of non feedforward ANN. RNNs operate by passing on the activations of the preceding observation along with the current to enable learning of complex timedependent features and include models such as Long Shortterm Memory Networks [53]. In the past few years, so called transformers [54] have emerged as a flexible and promising architecture family. Transformers also op erate on sequential data by repeatedly using scaled dotproduct attention (not described here) and have been massively successful for textual data [16] and have recently shown great promise for image data as well [55]. To summarize, part of why deep learning has been immensely successful is that network architectures create inductive biases enabling deep ANNs to efficiently learn to solve certain problems.
13 2. Background
2.2.3. Data Preprocessing
Preprocessing is a broad term, and may include everything that data are subject to before modelling. This ranges from digitizing measured but analog data, identifying relevant ob servations in historical databases, merging disjoint datasets, to scaling individual numeric values. To limit the scope of discussion, we will only briefly describe the process of trans forming data from unstructured to structured format, often referred to as feature extraction or feature engineering. That is, the process of finding a representation of the data useful for the problem at hand.
As described above (section 2.1), data needs to be in structured and numeric format in order to be useful machine learning algorithms. For most unstructured data, there are many dif ferent options available of how to transform it, each with their strengths and weaknesses. If we consider textual data as an example. a simple way of preprocessing is provided by so called bagsofwords [56]. A bagofwords is an n × pmatrix of n observations (such as sentences or documents) where each element describes by how many times each observa tion uses each of the p words in the vocabulary. Note that there is no definite way of how to construct a bagofwords from a given dataset and configuration parameters, such as the minimum number of times a word is used in order to be included in the vocabulary, will have large impact on the resulting representation. The bagofwords representation is sim ple and effective but comes with the drawback that all location information is discarded, meaning that any sense of what order words are used is missing.
Feature extraction is widely studied for other more or less unstructured types of data. Com puter vision, for instance, offers a vast array of methods to describe image data in useful ways for both pixelwise and wholeview analysis. These methods range from lowlevel op erations such as edge detection [57] to more elaborate methods for shape detection [58, 59]. Another example include timeseries data that, although it is structured in itself, is often subject to preprocessing such as binning into fixed time intervals, calculation of moving statistics on different timeresolutions, differentiation, lagging and autocorrelation of the different features, and so on. Similarly, differentiation over the spectral dimension is com mon when analyzing spectral data in chemometrics.
Feature extraction is intimately connected to the inductive biases that we strive for in ma chine learning (discussed briefly in section 2.2.2 ANN architectures). Good feature extrac tion creates good inductive biases for our machine learning models by drawing on human expertise. In the past decade, has the rise of deep learning resulted in that feature ex traction is largely replaced by feature learning in certain domains. In contrast to feature extraction, feature learning describes the use of machine learning algorithms to find an optimal representation of data. Deep ANNs have shown remarkable ability to learn use ful representations for many kinds of complex, unstructured data [28, 29]. For instance, convolutional neural networks not only learn lowlevel features such as edge detectors, they also learn how to assemble them into abstract highlevel concepts that are more ro bust than what we have accomplished using feature engineering approaches [60]. In natu ral language processing, even relatively simplistic feature learning methods, such as word
14 2.3. Chemometrics embeddings, learn semantic relationships between words in a way that is incredibly chal lenging to accomplish using feature extraction [61]. By allowing general machine learning principles to find optimal features for the problem at hand, deep ANNs have not only es tablished new levels stateoftheart performance. They have also to certain extent democ ratized machine learning by making it more available for unstructured data problems.
2.3. Chemometrics
In contrast to machine learning that focus on develop principles and algorithms for learn ing regardless of problem domain, chemometrics aims to extract information from a cer tain type of data, chemical, using methods that are practical. Due to the complex nature of chemistry, the general principles in chemometrics emerged partially as a reaction to the singlefactor experiments and analysis prominent in experimental sciences [5]. Instead, chemometrics draws upon multivariate statistics and machine learning algorithms to gen erate and analyze data with many factors or variables, while having large focus on knowl edge generation. For simplicity, we choose to view chemometrics as a fourstep process as illustrated in figure 2.6 where each step comes with its own set of methods.
Predictive Data Generation Data Analysis Model Diagnostics Modelling
Knowledge Generation Application of Knowledge
Figure 2.6: Illustration of chemometrics as a process to generate and apply knowledge using datadriven methods.
An important part of chemometrics is the use of datadriven methods all the way through out the data lifecycle. Statistical design of experiments (DOE) is used to help generate informationrich datasets and multivariate data analysis (MVDA) to extract information from data with many variables. In terms of predictive modelling, chemometrics already uses machine learning models, although with great focus on linear latent variable meth ods. For decades [62], chemometrics has put great focus on model diagnostics to validate that predictive model and new observations fall within what is expected. These principles are widely applicable and are now used in frameworks such as QualitybyDesign (QbD), which is increasingly used in the pharmaceutical industry [6].
2.3.1. Data generation
Not all data are created equal. Seemingly obvious but sometimes overlooked, an important part of datadriven research is to ensure that data is representative of the problem that is to be solved. Perhaps the best way to ensure representative data is to design experiments in such way that the resulting data is optimal for datadriven modelling, which is the purpose of DOE that stems from centuryold research [63]. Most reallife systems are influenced
15 2. Background
C
A B
Full factorial design Fractional factorial design Generalized subset design
Figure 2.7: Graphic illustration of three types of 3factor designs, where the blue circles indicate unique factor combinations to include as experiments in the design. The full and fractional factorial designs investigates to levels of all factors, whereas the generalized subset design investigates three levels in factors A and B and two in factor C.
by more than one factor, meaning that it is strictly necessary to investigate interactions be tween factors to fully characterize the system. A fundamental challenge when the number of factors increases, is that the number of experiments necessary to test all combinations of factors grows exponentially. DOE provides methods to select experiments that are max imally useful for datadriven modelling of the investigated system.
To generate data optimal for linear model fitting, designs based on combinations of fac tors investigated at two levels are widely used (figure 2.7). For instance, the classical full factorial design explores all factor combinations possible (plus optional center point ex periments), which allows identification of all possible interaction terms in a multiple lin ear regression model (equation 2.2). The downside of full factorial design is that they are not very economical in terms of required experiments since the number of required exper iments grows exponentially with the number of factors. Fractional factorial designs on the other hand sacrifice possibility to identify certain interaction terms in order to cut the re quired number of experiments by the order of a power of two. These types of designs can also be generalized to more than two levels of each factor, for instance by using general ized subset designs (GSDs) [64] that allow design of dataefficient subsets in multilevel experiments.
A more general type of designs is provided by optimal designs. As the name implies, they are a type of designs that are optimal to some statistical criterion and found using optimiza tion algorithms. In short, out of a candidate set of possible experiments, optimal designs pick a set of experiments that best meets the criterion. By selecting a suitable candidate set, it is straightforward to incorporate constraints on what experiments are allowed in optimal designs. There are many types of optimal designs available and a popular one is provided by so called Doptimal designs [65]. The Doptimal designs find designs that maximize the
16 2.3. Chemometrics determinant of the information matrix as:
T XDoptimal = argmax |X X| (2.10) X∈XC
Where X ∈ XC describes the candidate set. This criterion can be interpreted geometrically that the square root of the determinant is inversely proportional to the volume of the con fidence ellipsoids, or uncertainty, of a linear model’s parameters [65]. Hence, Doptimal designs minimize the resulting model parameter uncertainty, which for instance is useful during screening experiments when factor importances are investigated. Although we have not provided a comprehensive summary of available designs, we conclude that DOE as a framework to generate data that is suitable for the problem at hand.
2.3.2. Data Analysis
Structured multivariate data analysis (MVDA) using linear latent variable models is often seen as a hallmark of chemometrics. Latent variables as a concept refer to that many ob served variables depend on some underlying variable that we can not observe directly. In chemometrics, latent variables are instead inferred from observed variables by using data driven models, typically linear ones denoting latent variables as components [66, 62]. His torically, this approach has been referred to ”soft” modelling in contrast to ”hard” where all variables are directly observed. One advantage of using latent variables instead of observed variables directly is that many realworld phenomena are difficult to measure directly and each observed variable only shows a noisy part of the truth. By relating many observed variables directly to one latent variable, by for instance analyzing their covariance, we can often draw far better conclusions than by analyzing them onebyone. For instance, in healthcare there is no, at least to the knowledge of the author, single variable to measure to indicate the general health of the patient. But combining many different measurements, such as blood pressure, blood cholesterol, BMI, age, and so on, allows us to form the big ger picture of health status. Another advantage is that the number of latent variables is often many times fewer than the number of observed variables, making interpretation of the underlying relationships, to for instance analyze trends or spot outliers, far easier. A third advantage is that latent variable models handle collinearity, i.e. correlated observed variables, well. Rather than struggling with model identification under collinearity, latent variable models instead often benefit from it and summarize said correlations.
A fundamental latent variable method is provided by principal component analysis (PCA) [67, 8], which is an unsupervised linear latent variable method based on matrix decom position. The PCA components form an orthonormal projection basis that maximize the variance explained of the training data X. That is, using k components, we decompose the n×p data matrix X into the n×k score matrix T and p×k loadings matrix P = [p1, ..., pk] as X = TP T and the first loading vector is given by: X 2 p1 = argmax Xp (2.11) ∥p∥=1
Finding the first vector is an optimization problem (see section 2.2.1 Model fitting), which
17 2. Background has many solutions. One of which is the closedform solution given by the largest eigen vector of the covariance matrix XXT .
Being unsupervised and linear, PCA provides a basic tool for straightforward investiga tion of highdimensional datasets by examining the found components ranked by their explained variance. There are much fewer components than variables, which leads to that inspecting the scorevectors provides a convenient way to get an overview of the dominant sources of variation in the dataset. Inspecting the loading vectors then provides transpar ent insight into how the original variables relate to the found components. Closely related to PCA but supervised, PLS (equation 2.3) allows targeted investigation by extracting com ponents directed by the target of interest. Other decompositionbased methods include in dependent component analysis [68], multivariate curve resolution [69] and nonnegative matrix factorization [70, 71]. In terms of analyzing data where observations are character ized using two or more multivariate sets, or blocks, of variables, these methods fall under the scope of multiblock, or multiway, data analysis [13, 14].
Although linear decompositionbased models benefit from straightforward implementa tion, making them attractive also for largescale datasets [7], they are limited in terms of the complexity they can model. There are many available methods for nonlinear modelling for data analysis including formulating common decompositionbased models as kernel methods [72]. Recently, methods for nonlinear dimension reduction common in the ma chine learning community have found use in chemometric applications as well, for instance tdistributed stochastic neighbor embeddings [73, 74]. Since nonlinear models provide better possibility to model complex relationships they may provide a more dense summa rization of observations compared to linear models. Although, this comes at the price of reduced interpretability since the relationship between observed trends and the original variables is modelled in a more complex way than by using linear combinations. All in all, linear model or not, MVDA describes the underlying trends in complex, multivariate, data and shows that the whole is greater than the sum of its parts.
2.3.3. Predictive Modelling
Whereas DOE serves to generate meaningful data and MVDA to create new knowledge from data, predictive modelling serves to put that knowledge to use. To large extent, predic tive modelling in chemometrics means application of machine learning algorithms, which was covered in more depth in section 2.2. While linear latent variable models based on PLS are the goto methods in many chemometric applications, we have seen adoption of ANN based methods in recent years [10, 11]. Many chemometric predictive problems fall under multivariate calibration [62] which aims to predict properties of interest for a chemical system based on multivariate measurements, such as infrared, Raman or nuclear magnetic resonance spectroscopy. Another chemometric application where predictive models are used is provided by so called quantitative structure activity relationship (QSAR)modelling, where biological activity is modelled as a function of chemical structures [75].
18 2.3. Chemometrics
In any application where predictive modelling is used, not only chemometrics, there are certain pitfalls to avoid, where a major one is how to detect and avoid overfitting. Any pre dictive model needs to balance model complexity, number of parameters in equation 2.1, to find a model that is expressive enough to solve the problem efficiently but not so expressive that it starts to model spurious variation. An underfitted model fails to learn the underlying function, for instance a straight line modelling a curved relationship, whereas an overfit ted model is built on spurious variation not related to the underlying relationship and fails to generalize to new data. To monitor for overfitting, different types of crossvalidation are common in all fields, meaning that the model is fitted on one part of the dataset and evaluated on another. In small datasets, so called kfold crossvalidation is widely used where several models are fitted and evaluated on different subsets of the complete dataset in a roundrobin fashion reporting the average performance. For large datasets it is far more common to use a fixed split into training and one or more validation sets. How to split the data is highly application dependent to ensure realistic performance evaluation. In timeseries for instance, observations very close in time are often highly correlated and easy to predict from their neighbours, meaning that splitting observations randomly will likely give overly optimistic performance evaluations. In such situations, it is better to split at a certain time point and use the early time for training and later time for evaluation. Or in the case of batchwise time series, split out complete batches for validation. What metrics to use to evaluate models is problemdependent and there is a myriad of options available that will not be listed here. The metrics may be given by loss functions (section 2.2.1) or other metrics suitable for the problem at hand, such as goodnessoffit/predict R2/Q2 for regression or percentage of correct predictions (accuracy) for classification.
When a crossvalidation strategy is selected for the problem at hand, there are several ways of how to control model expressivity. Firstly, the number of parameters determines the complexity of relationships the model can approximate. By adding polynomial features to an MLRmodel, more components in a PLS model, more decision trees to a random forest or more layers in a deep ANN the approximation becomes more complex. By monitoring the difference in predictive performance between the training and validation set, we can find the complexity sweetspot. Another widely used approach is regularization, where the model complexity is penalized as part of model fitting. Examples include MLR models where a symptom of overfitting is that parameters grow in magnitude. The model can then be regularized by adding a penalty on large weights to its loss function, for instance by the squared size (ridge regression [76]) or by the absolute size (LASSO [77]). How to regularize models depends on what model is used, where for instance so called dropout is simple and popular in deep learning and works by ignoring random sets of nodes for each update during training [78].
To summarize, machine learning models are already heavily used in chemometrics for pre dictive modelling. Limited to neither machine learning or chemometrics, there are certain challenges to overcome for successful prediction. Regardless of what problem is studied, predictive models should be appropriately validated but exactly what is appropriate de pends on the problem at hand.
19 2. Background
2.3.4. Model Diagnostics
Given any predictive model, a key question to ask is how much we can trust it. This is important for all problem domains where predictive models are used and is by no means unique to chemometrics. Although, chemometrics has a long tradition to use quantitative metrics to ensure that the model is applicable on the new observations it is intended to be used on [62]. Note that quantifying to what extent an individual prediction can be trusted is complementary to how well predictions can be trusted on average according to model validation described above.
There are two main reasons for why predictions can not be predicted, outliers and dis tribution shift. Outliers are rare observations that differ significantly from expected data. Outlier detection, sometimes called anomaly detection, then uses different types of metrics to detect when an observation is different enough and flag it for further action [79]. Some examples of how to perform outlier detection include quantifying the density between ob servations in observation space and detect outliers as observations in unexpectedly low density regions [80], or by the size of the reconstruction residual of the observation by a model of training set observations [62]. The latter is particularly common in chemometrics where the training dataset is typically modelled using linear latent variables methods. For instance, when using PLS, an observation is approximated as x ≈ tW T , where t = xW (simplified by ignoring calculation of the loadings of the nondeflated original variables). The reconstruction residual sumofsquares (RSS): Xp T 2 RSS = xi − (tW )i (2.12) i=1 then provides a straightforward metric of how distant the observation then is from the hyperplane spanned by the columns in W [81].
On the other hand, distribution, or dataset, shift refers to when the underlying probability distribution observations are drawn from changes over time, which is common in real world systems. When the underlying data changes, the model becomes obsolete and not trustworthy anymore. To detect when distribution shifts occur, statistics of observations encountered after training time can be recorded and compared to the training set distribu tion. For instance by simply performing twosample tests of the null hypotheses that the prediction time and training set distributions are equal [82].
20 3 | Results
This chapter summarizes the contributions of this thesis and will discuss the findings of each paper in relation to the aims formulated in section 1.1.
Jumping straight to the conclusion of this thesis, there is plenty of room for synergies be tween chemometrics and machine learning. Using machine learning algorithms to fit pre dictive models is commonplace in chemometrics and has been so for many years. Histor ically, and today still, linear models have been used extensively with great success. These linear models have mostly been focused around latent variable methods such as PCA and PLS and their variants. Other methods, including ANNs, are already used and are increas ingly used when growing data quantities and expectations on predictive performance grow. Even though predictive performance is relevant to study, this thesis serves to describe ways of how we can nuance the relationship between the fields of machine learning and chemo metrics beyond predictive modelling and show that knowledge in either field can be used to benefit the other.
As described in section 2.3, we refer to chemometrics as a fourstep process of data gener ation and analysis, predictive modelling and diagnostics (figure 2.6) and machine learning as the applied workflow of fitting performant predictive models given raw data (figure 2.2). The papers included in this thesis then serve as examples of how each of the fields may improve the other in another aspect than providing another machine learning algorithm (summarized in figure 3.1). The included examples do by no means provide an exhaustive list of the possibilities but will hopefully serve as inspiration for future work.
We start by exploring how MVDA combined with feature engineering techniques borrowed from machine learning, enables us to study chemistry from a new dimension, namely in the form of text. Few would argue against that text provides a source of chemical, and any type of, knowledge but systematic analysis of large text collections is rarely seen as a task for chemometrics. In Paper I we demonstrate how the combination of structured MVDA and interpretable feature extraction allows us to quickly gain new knowledge on trends and clusters from collections of hundreds to thousands of text documents.
Turning to the problem of optimizing largescale data processing pipelines, common in computational sciences. We know that how we configure each step in the pipeline influ ence the end results but depending on the complexity finding the best configuration may be challenging. Originally developed for manual experimentation, DOE provides tools to find optimal solutions in multifactor problems. In Paper II we develop an algorithm implementing DOE to optimize multistep data pipelines for genomic data processing.
21 3. Results
Training Model Machine Learning Data Training
Data Model Raw Data Preprocessing Deployment
Evaluation Model Data Evaluation Paper IV Paper I Paper III
Today Paper II Chemometrics
Data Predictive Model Data Analysis Generation Modelling Diagnostics
Figure 3.1: Illustration of how each paper in thesis contribute to nuancing the rela tionship between the fields of machine learning and chemometrics, which today revolves around using machine learning algorithms for predictive modelling. Paper I shows how data preprocessing common in machine learning can broaden the applicability of MVDA. Paper II then illustrates how DOE methods can be used to optimize data pipelines. Paper III draws inspiration from MVDA to provide a new method to monitor ANNs. Finally, Paper IV applies DOE principles to assure quality in largescale datasets.
An important aspect of the linear latent variable methods prominent in chemometrics is that they provide a model of expected variation based on training data. Comparing new observations to these models allows us to easily determine whether they fall outside the expected variation and should be viewed as outliers. By definition, we lack evidence for that the model generalizes to outlier observations, meaning that outlier detection gives us clear indication whether we should trust the prediction at all. Inspired by this property of linear latent variable methods, we describe a new mehod in Paper III that gives ANNs an equivalent possibility.
Lastly, we turn to the core of modern machine learning, namely collection, annotation and quality control of largescale datasets. Collecting and annotating a large dataset of im ages is complex and expensive, both from an economically and labourwise point of view. The pragmatism and transparency of chemometrics has made it widely used for industrial process quality work. In Paper IV we demonstrate a case study in how we can design largescale image datasets to ensure quality directly from the start.
22 3.1. Summary of Paper Contributions
3.1. Summary of Paper Contributions
Table 3.1 provides a highlevel overview of my contributions to each paper. All work was performed under supervision of Prof. Johan Trygg and cosupervisor Dr. Olivier Cloarec who acted as advisors and provided feedback on the research described in this thesis.
Table 3.1: Summary of my contributions for each paper included in this thesis.
Paper Contribution I i) Formulated research question together with coauthors; ii) Collected and curated data; iii) Implemented algorithms and ran all experi ments; iv) Wrote the paper II i) Implemented an existing DOEbased algorithm; ii) Extended the al gorithm with new functionality for screening based on generalized sub set designs and automatic regression model selection; iii) Wrote the algorithm description in the paper III i) Formulated research question together with supervisor; ii) Devel oped the algorithm; iii) Implemented algorithm and ran all experi ments; iv) Wrote the paper IV i) Formulated research question; ii) Collected and curated data to gether with coauthors; iii) Developed processes for annotation, in cluding DOE, and quality assurance together with coauthors; iv) Oversaw and provided feedback on all experiments; v) Performed MVDA on resulting data; vi) Designed and implemented benchmarks together with coauthors; vii) Wrote the paper together with co authors
23 3. Results
3.2. Paper I: MVDA for Patent Analytics
Patents, which are sometimes overlooked in science, provide a rich source of technological and scientific knowledge since much information is disclosed in patents only. Analyzing patent collections provides information on emerging technological trends, prior art for as sessment of new innovations, technical inspiration for problem solving in R&D, the legal landscape of the market of a new field for a company, technology gaps for national policy makers, and much more. Due to the quickly growing number of patents available world wide, the task to find valuable information is growing increasingly challenging. Today, patent offices are receiving hundreds of thousands of applications per year [83] making algorithms and data mining methods for quickly analyzing vast patent collections crucial. Machine learning algorithms are increasingly used to aid patent analysis by various means, such as technology categorization, prediction of future value and so on [84, 85]. Due to the business critical nature of patent analytics, fully automated models only partially relieve the challenges due to never being 100 % correct. Since investigating patent content is com monly done by legally trained experts, there is need for datadriven tools to accelerate their work even though automating will be difficult in the short term.
Aim Text analysis, machine learning and data analysis are all widely used in patent analyt ics. Although, they are often limited to machine learningbased predictive models or simple univariate analysis of patent metadata. On the other hand, the structured approach to ana lyze complex dataset provided by MVDA has proven insightful in many other domains but it has not yet been fully used in textbased analysis due to disparate research communities. This study aimed to use common text preprocessing methods borrowed from the machine learning community and analyze patent collections using a structured MVDAapproach to explore its value for patent analytics.
Results We propose a workflow based on MVDA on a datamatrix formed by preprocess ing a collection patent documents into bagsofwords subject to nounfiltering and term frequencyinverse document frequency (tfidf) transformation [86]. We demonstrate how the workflow can be used as a tool to interrogate large collections of more than ten thousand documents (Paper I Figure 1), as well as a tool for more targeted analysis in smaller collec tions (Paper I Figures 24). Thanks to the interpretability of variable influence provided by linear latent variable loadings, we can easily draw conclusions on the general technology trends observed in the interrogated collections.
Discussion Neither preprocessing patents into bagsofwords or applying machine learn ing to patent analysis is novel in itself. What our workflow demonstrates is that datadriven models, such as PCA, are useful for much more than predictions even in the case of large patent collections. The structured approach of working with highdimensional datasets provided by MVDA lends itself well as a tool to support domain experts, even for non chemical use cases. Additionally, the bagofwords representation is itself highly inter pretable when combined with loadingbased model interpretation, where strong loadings are directly interpreted as frequent use of the words in question.
24 3.2. Paper I: MVDA for Patent Analytics
Figure 3.2: Schematic illustration of the MVDAbased workflow for patent text analyt ics proposed in Paper I.
For predictive modelling, the models we use are far from stateoftheart, in terms of patent analytics as well as text analytics in general. In most textbased prediction problems, large ANNs based on the transformerarchitecture and pretrained for language modelling on massive text datasets [16] are considered stateoftheart. Although offering a certain de gree of interpretability [87], these types of models are still far from offering intuitive and straightforward explanation of what they see and why in the way we can using linear latent variable methods. With that said, development of interpretation and explainability meth ods for complex nonlinear models is making great progress [88, 89], and there is great interest in ”opening the black box” of complex models. Since data are growing rapidly in terms of both quantity and complexity, within chemometrics and virtually all other fields, more complex datadriven models are needed to capture said complexity. We will however need to take care to not sacrifice the interpretability of today’s methods.
Future perspective As we demonstrated in a case study in patent analytics, MVDA as embraced within chemometrics is useful beyond chemical applications. By carefully choos ing preprocessing and representations, even unstructured data, such as text, can benefit from structured and interpretable workflows. Chemometric approaches are already well established for other complex data, such as omicsdata [90], and by borrowing prepro cessing tools from the machine learning community they may definitely be useful beyond chemical applications.
25 3. Results
3.3. Paper II: DOE for Data Processing Pipelines
Bioinformatics software often require configuration of a large number of parameters for the software to work properly. Parameters may be quantitative as well as qualitative and often nonobvious in how they relate to the end results, making configuration far from trivial. Considering that software tools are often used in multistep pipelines, these difficulties are further exacerbated. Making it even more challenging, certain types of instruments, such as DNA sequencing platforms [91], show subtle differences in the underlying data depending on which platform is used, which need to be taken into account.
Aim Often, parameter settings are selected based on personal or peer experience obtained in a trialanderror fashion, or by simply retaining the default values risking suboptimal results. Trying all possible configurations, so called bruteforce search, quickly becomes infeasible. A more efficient alternative is to select a limited number of combinations at random, so called random search [92], is common within machine learning but ends leav ing the results to chance. DOE offers a systematic and efficient approach for experimen tation in general, and this study aims to explore how DOE can be used to select parameter configurations for complex pipelines.
Results We present a pipeline optimization algorithm based on DOE to optimize bioin formatic data processing pipelines (Paper II Figure 1), supporting quantitative and qual itative factors as well as multiple response optimization. The algorithm is divided into two phases. Firstly, generalized subset designs [64] are used to span a large configuration parameter space and find an approximate maximum preferred settings for qualitative fac tors. Secondly, response surface designs, ordinary least squares regression and numeric optimization are used to finetune quantiative factors around the approximate optimum. We demonstrated the algorithm in four bioinformatic case studies (Paper II Tables 15) and compared it to brute forcesearch for validation.
Discussion Principled search for optimal factor combinations is a common problem in many disciplines. The use of DOE in chemistry is wellproven to reduce cost of experimen tation both in time and labor to optimize complex processes. We demonstrate that this transfers well to reduce cost for largescale computational pipelines as well. Unfortunately, the software implementation was originally developed for inhouse use cases and later pub lished, which made it not directly compatible with standard pipeline running tools, such as Snakemake [93], limiting it’s ease of use.
Future perspective With evergrowing computational pipelines, the problems that we face in bioinformatics today will become even more prevalent. The efficiency provided by DOE lends itself naturally for such problems, something we have already seen in applied machine learning [94]. Promoting and developing opensource and easytouse software packages would help establishing DOE as a solution to a common problem for computa tional scientists.
26 3.4. Paper III: Outlier Detection in ANNs
3.4. Paper III: Outlier Detection in ANNs
Thanks to the powerful transformations learned by deep ANNs, deep learning is increas ingly applied throughout society and impacts many aspects of modern life. Due to their strong predictive performance in many tasks, ANNs are increasingly used in safety crit ical systems as well, with vision systems for autonomous cars as a wellknown example. However, all machine learningmodels have undefined behaviour when input data differs from training data, which means that there are no guarantees of how they will behave for outliers.
Adopting deep learning in increasingly complex and possibly safetycritical systems makes it crucial to know not only whether the model’s predictions are accurate, but also whether the model should predict at all for a given observation. Equipping our models with possi bility to detect outliers, also called outofdistribution (OOD)examples, would allow us to implement systems that can fall back to safe behaviour when encountering outliers, which may minimize negative consequences of faulty predictions. A tragic reallife example of possible consequences is the fatal collision of a pedestrian and an autonomous vehicle in March 2018 [95]. In the moments before the crash, the car’s vision system erratically classi fied the pedestrian as different object classes moving at different speeds in different frames, which is a possible symptom of OODexamples. To avoid similar future accidents it is im portant to understand the limits of our models’ learned representations in order to detect when observations are not recognized, so that autonomous decision making based on deep learning can be improved.
Aim In chemometrics, the linear latent variable models commonly used for predictive modelling allow for outlier detection in a straightforward manner by, for instance, monitor ing the reconstruction residual sum of squares (RSS) (see equation 2.12). An observation with unexpectedly large RSS is deemed ”too far” from the model of the training set. If the observation is too far away then we have no evidence for that we can trust the model’s pre diction for that observation. This study set out to explore what options are available to let deep ANNs detect OODobservations and develop a new method to fill in the gaps.
Results We view ANNactivations as a multivariate description of an observation and draw inspiration from MVDA to develop a new metric for outlier detection. We propose DIstance to Modelled Embedding (DIME) (Paper III Figure 1 & Equation 4), which pro vides an analogue to RSSbased outlier detection in linear latent variable models. DIME works by approximating the ANNactivations using a highrank PCAmodel and reporting RSS for new observations’ activations relative to the PCAapproximation. We demonstrate that DIME is able to capture complex cases of outliers in computer vision and natural lan guage processing (Paper III Figures 2 & 3). Compared to stateoftheart methods, DIME provides both high detectionperformance as well as low computational overhead.
In addition to simply working well, DIME also has relatively few configuration parame ters, so called hyperparameters, to tune in order to work well. The two main parameters
27 3. Results are which layer’s activations to use and the PCA rank. In terms of which layer to use, we have found it to correspond well to what deep learning practitioners would use for transfer learning (for instance, late but not penultimate layer in a CNN). Regarding PCA rank, we found that in contrast to most chemometric applications where we typically limit the anal ysis to relatively few principal components, DIME did not work well until the explained variance was very high (around 99 %) (Paper III Figure 4).
Discussion DIME is based on a very simple idea, namely that the intermediate represen tations of a deep ANN provides a useful, but abstract, representation of any data that we then simply approximate using a hyperplane. To our surprise, we found no previous meth ods working along these principles but quickly found it to work well. DIME then serves as a prime example of how an approach long used within chemometrics can provide inspiration for a solution to an important problem within machine learning.
Outlier detection is an important but somewhat understudied field within deep learning. A common perception among deep learningpractitioners is that the notion of outliers should be encoded into deep ANNs and easily caught by either inspecting the model’s confidence [96] or adding an extra ”catch all”output. DIME alongside other methods [97] demon strate that reality is not so simple and that dedicated outlier detection is necessary for a more complete solution.
In addition to the technical perspective, this study also illustrates an honest attempt to do good science in machine learning. In the past year, it has come to light that poor use of basic statistics and substandard baselines are widespread when comparing machine learning methods [98, 99]. These manners make it easy to present any new method as outperform ing the prior alternatives, whereas the difference may be due to chance or nonexistent if the baseline would have been properly tuned. When evaluating DIME, we tuned the base line methods for each experiment and only used their best values to avoid crippling the baselines. We also performed metaanalysis of all our experimental results finding that DIME performed slightly better than all alternatives on average, although that difference was not always statistically significant (Paper III Table 1). We believe that this approach offers a more fair comparison and realistic portrayal of the performance of our method and hope to set an example of how experimentation in machine learning can be improved.
Future perspective Although our experiments show convincing results, outlier detec tion is not yet a solved problem. Our experiments are limited to relatively small scale experiments, roughly equivalent in size to what most practitioners would use for transfer learning, as are most benchmarks in the field [100]. Datasets are growing larger and prob lems more complex, making outlier detection ever more important while at the same time more difficult to study. We have already seen deep ANNs with hundreds of billions of parameters trained on petabytes of data [101], making the gap between stateoftheart in model training and stateoftheart in outlier detection incomprehensively large. Although, we conclude that when datadriven models are becoming evermore present throughout so ciety they need a basic degree of common sense to admit that they sometimes are uncertain to be trustworthy.
28 3.4. Paper III: Outlier Detection in ANNs
3.4.1. Additional Case Studies
When developing DIME, we performed several experiments that did not end up in Paper III due to space constraints. Some of these case studies are summarized here to demonstrate its applicability further.
Predictive Maintenance Predictive maintenance techniques aim to automatically de termine when a piece of equipment require maintenance to avoid breakage. Automatically estimating the degradation of the equipment allow for significant cost savings compared to routine or timebased maintenance. When following a fixed schedule, there is risk that maintenance is performed earlier than necessary leading to excess expenses. There is also risk that the equipment fails earlier than expected, which may lead to catastrophic failure or process stops. To demonstrate the applicability of DIME in predictive maintenance, we use the Turbofan Engine Degradation dataset provided by the Prognostics CoE at NASA Ames [102]. The dataset consists of simulated turbofan engines running to failure under differ ent operational conditions and fault modes. The engines were monitored over time using 21 sensors and 3 control settings were recorded for 100 engines in the training dataset and 100 in the test dataset. The challenge is to predict failure 15 sensor cycles before it happens to avoid catastrophic failure.
We trained a threelayer RNN based on LongShort Term Memory (LSTM)blocks [53] to predict failure before it ahead of time. To train the LSTMmodel, we used sliding windows that were 50 cycles long as input and a binary response indicating whether the last cycle of the sliding window is within 15 cycles away from failure. We used the 21 sensor outputs, the 3 control settings and current cycle since beginning as variables for prediction. We scaled all variables to range 01, and used the scaling parameters for the training data to scale the test data. To monitor training progression, we used 10 % of the training dataset for validation. We trained the model until binary cross entropy for the validation set stopped improving (eight epochs) using the Adam optimizer [103]. The resulting model achieved a testset F1score of 94.1 % at sliding window classification.
To simulate prediction time outliers where a sensor stops working, we randomly chose half of the test set engines to serve as outliers. For each of these engines, we selected on sensor at random and set its output to zero starting at the middle of the fulltime sequence. To provide features for DIME, we collected the activations from the LSTMlayers of the predictive maintenance model from the first two LSTMlayers and averaged them across time to summarize each sliding window. For the last LSTMlayer, we simply used the last cycle activations in the time sequence. The activations from these three layers were then concatenated into a single matrix.
We used DIME for outlier detection, using a 100component PCAmodel fitted on all the training data and used the 99.9th percentile of training set DIME as cutoff for outliers. We then predicted failure status and calculated DIME for the outlier and remaining testset LSTMfeatures for all sliding windows, successfully distinguishing between outliers and normal variation (precision = recall = 1 for this experiment). Following sensor breakage,
29 3. Results
Figure 3.4: Qualitative testset results of outlier detection in the predictive maintenance case study. The lefthand side (a, c & e) shows test set without sensor breakage aligned at middle time point and the right hand side (b, d & f) shows sequences where one ran dom sensor breaks aligned at sensor breakage (vertical dashed line indicate when sensor broke). Top row (a & b) shows DIME (denoted ResSS here) for outlier detection, hori zontal dashed line indicates the 99.9th percentile based on training set distances. Middle row (c & d) shows the model’s prediction if the turbine is going to break within 15 cycles, where 1 means that engine is predicted to fail. The bottom row (e & f) shows the labels indicating if the turbine is going to break within 15 cycles, where 1 means that the engine is going to fail. the LSTMmodel predicts that many of the engines are going to break but DIME captures all cases as outliers. This experiment shows that by using DIME we are able to distinguish between predictions that could be trusted and those that should not. It also shows that simply inspecting model confidence is not always reliable, in this case the predictions be haved erratically with very high confidence (figure 3.4 d). It also shows that DIME is not only applicable to feedforward networks, but recurrent networks as well.
30 3.4. Paper III: Outlier Detection in ANNs
To explore how the number of components and residual sumofsquares influence the results, we varied the number of compo nents from 1 to 150 and reported the pre cision and recall for outlier detection (see figure 3.3). Components with very small explained variation is required to achieve high precision and recall (99.99 % cumu lative R2 is reached after only 36 compo nents). This implies that the variation that differ between outliers and normal test set sequences are small nuances that require very small PCA components to be cap tured. We also observe overfitting with re duced precision when the number of com ponents becomes too large, confirming re sults from other experiments that the PCA rank should be high but not full. Similar Figure 3.3: DIME precision and recall for trends are observed for other DIME cutoffs classifying outliers in the predictive mainte (not shown). nance case study.
Speech Commands To show how out lier detection can be used to distinguish between short speech commands and background sounds, we use the Speech Commands dataset [104]. The Speech Commands dataset con sists of 105 000 short utterances of 35 words recorded using 16 kHz samplingrate. The words includes digits zero to nine, command words yes, no, up, down, left, right, on, off, stop, go, backward, forward, follow, learn, visual. The dataset also include a set of arbi trary words bed, bird, cat, dog, happy, house, marvin, sheila, tree and wow. In addition to the short utterances, the dataset also include background noise including white and pink noise, noise from an exercise bike, a man sounding like a cat, the sound from a running tap and sound from a person doing the dishes. In our experiment, we train a classification model of all utterances and use the background noise as OODexamples.
We transform the utterances, and background noise, into 64 MelFrequency Cepstral Coef ficients [105], using a framelength of 2048 samples and framestride of 512 samples. We then trained a CNN classifier on all 35 words. The CNN consisted of three convolutional blocks with a ConvConv2x2 MaxPoolSpatialDropout structure. The convolutional layers of each block had 32, 64 and 128 filters respectively, kernel size 3×3 and ReLuactivation. All spatial dropoutlayers used 20 % dropout. The input was batch normalized before the first convolutional block. After the last convolutional block, a last convolutional classifier layer using softmax activation was pooled using global average pooling. The model was trained for 20 epochs using stochastic gradient descent with 0.9 momentum. The initial learning rate was 0.1 which was reduced after 10 epochs to 0.01, and then further reduced to 0.001 after another 5 epochs achieving top1 test set accuracy of 86 % across 35 classes.
31 3. Results
Table 3.2: Area under PrecisionRecall To detect OODexamples, we compared curves (PRAUC) from detection of OOD DIME to prediction confidence, i.e. the examples on the Speech Commands experi maximal predicted probability p [96], max ment. and Monte Carlo (MC)Dropout entropy of predictions [106]. For DIME we used the p DIME Entropy globally average pooled output from the max last convolutional layer before the last spa White Noise 0.964 0.995 0.978 tial dropout, and a PCAmodel explaining 99.9 % variance of the embedding varia Pink Noise 0.955 0.997 0.964 tion. For MCDropout the Shannon en Running Tap 0.036 0.998 0.175 tropy of predictions from 30 Monte Carlo samples. Since the datasets are unbal Dishwasher 0.621 0.994 0.692 anced (around 60 outlier examples com Cat Sounds 0.798 0.996 0.849 pared to thousands of testset examples), the results were compared using areas un Exercise Bike 0.816 0.995 0.947 der PrecisionRecall curves (PRAUC).
DIME was able to distinguish all types of background noise reliably (see results in Table
3.2). MCDropout entropy and the baseline approach, pmax, detected four of the six back ground noises successfully (PRAUC > 0.8). Interestingly, the model confidently predicted noise from a running tap as the word six or house, which may be due to the trailing sibilant ssounds resemble a running tap. Apart from that DIME performs well in this example, it shows the importance of dedicated outlier methods compared to relying only on models’ predicted confidence.
32 3.5. Paper IV: Quality Assurance in Large Data Projects
3.5. Paper IV: Quality Assurance in Large Data Projects
Light microscopic images provide a cheap and accessible data source to study biological phenomena in a noninvasive manner. When acquired in highthroughput manner and combined with wellestablished protocols of twodimensional cell cultures, 2Dmicroscopy allows detailed quantification of cell count, morphology, proliferation, migration and cel lular interactions making it an attractive tool to study cell biology. To gain insight into the behaviour of individual cells, segmentation algorithms are necessary to detect and segment individual cells directly from images. Due to the low contrast and high object density of mi croscopic images, cell segmentation is an incredibly challenging computer vision problem. In the past decade, convolutional neural network (CNN)based segmentation models have far outperformed their nondeep learning counterparts. Although, deep CNNs require vast annotated datasets for which there is no suitable resource available for labelfree cellular imaging.
Aim This study set out to address the gap of missing largescale resources for training deep learningbased cell segmentation models for labelfree 2Dmicroscopic images. An notating images for cell segmentation is complex and extraordinarily labourintensive due to lack of algorithms to do the job for us. To provide labour, large teams of nonexpert annotators are used, which introduce a whole other dimension of complexity in order to guarantee data quality. For the purpose of this thesis, we focus on the aspect of how to plan and assure quality in largescale, complex image annotation projects.
Results We present LIVECell, a highquality microscopic image dataset that is the largest of its kind to date, comprising over 1.6 million manually outlined cells from a diverse set of cell morphologies and culture densities. To create LIVECell, we selected a morphologically diverse set of cell types that pose different challenges for cell segmentation (Paper IV Table 1) and prepared 5240 images balanced over cell types and cell culture density. Annotating microscopic images is challenging and the annotator team is expected to learn the task over time, meaning that there is risk of introducing timedependent bias in annotation quality unless this is taken into account during planning. To avoid this risk, we divided the images into eight balanced batches using generalized subset designs (GSD) of cell type, timestamp and well as factors and uploaded the batches sequentially for annotation (see illustration in figure 3.5). During annotation we performed extensive quality assurance in two levels, firstly by the annotation managers and secondly by an experienced cell biologist.
To verify the morphological diversity of LIVECell, we characterized each cell by a multi variate description of commonly used shape and morphologydescriptors. Using PCA of imagewise averages, we then confirmed that LIVECell indeed spans a wide set of mor phologies (Paper IV Figures 1). To demonstrate the utility of LIVECell, we designed benchmarks measuring different aspects of cell segmentation performance and trained CNNbased segmentation models that were evaluated on each benchmark (Paper IV Fig ures 3 & 4). Finally, we demonstrated the benefit of the scale by training models on subsets of varied size and evaluating them on the complete test set (Paper IV Figure 5).
33 3. Results
Image batch a b
Well Annotator skill
Cell culture density
Cell type Time
Figure 3.5: Illustration of the rationale behind uploading images for annotation in bal anced batches as given by DOE. The lefthand side (a) illustrates how a generalized sub set design is used to split images into batches balanced according to metadata (cell type, cell culture density, well). The righthand side (b) illustrates how the annotators skill increases over time and the batches are provided in sequence to minimize the risk of in troducing systematic bias in annotation quality due to annotators learning over time.
Discussion To date there is no family of machine learning methods close to the perfor mance of deep ANNs trained on large datasets for unstructured data problems. This means that data generation is more important than ever to maximize the benefit of machine learn ing. For complex problems, such as cell segmentation in microscopic images, manual an notation is currently the most reliable method to create quality responses for raw data. The large scale combined with the reliance on manual labour means that there are strong economic incentives to maximize the benefit of the data sent for annotation.
We know that machine learning models, not only deep learningbased ones, benefit not only from dataset scale, but also diversity. By training on diverse datasets, the resulting models will be more widely applicable. This has direct parallels to DOE, which aims to sys tematically maximize the diversity of datasets, although typically at small scale. Our study is directly inspired by DOE principles, where we carefully selected cell types to work with, ensured that the images were balanced over the conditions in question and that we explic itly used DOEmethods to split the datasets into batches to avoid introducing systematic bias in annotation quality.
To confirm the morphological diversity of LIVECell, we opted for a simple alternative such as PCA (Paper IV Figure 1). By using a transparent model for dimension reduction com bined with interpretable morphology metrics, we could easily draw conclusions on the dif ferent morphologies represented by LIVECell. This is in contrast to the trend of using non linear dimension reduction methods (example in [107] Figure 2) that reduce any data to two dimensions at the cost of interpretability. We argue that the principle of simplicity also applies to data analysis, and that simpler and interpretable methods should be used as long as they do the job.
34 3.5. Paper IV: Quality Assurance in Large Data Projects
Future perspective In the past few years, there has been intense discussion on bias and fairness of machine learning systems. For instance, it was demonstrated that commercial facial recognition systems displayed systematic bias due to both gender and skin color by performing significantly worse for female and darkskinned faces [108]. Part of the reason why machine learning systems become biased is that most realworld datasets are biased due to the simple fact that they are collected in our society and reflects the biases and un fairness therewithin. In the case of facial recognition, this discussion spurred creation of new balanced datasets to make it possible to study the prevalence of systematic bias and how to avoid replicating it [109].
This thesis is not a study in bias and ethics of largescale machine learning and the problem is not solved simply by balancing datasets. However, the discussion of biased datasets and the deliberate efforts to create more balanced versions is related to dataset design that we explored in Paper IV. Instead of simply grabbing whatever data is available to train our machine learning models, we ought to think about how we design our datasets for the purpose they should fulfill. Without deliberate effort, realworld data is very rarely balanced with respect to the different factors influencing the system, regardless if it is from applicant statistics in historical job applications or statistics on equipment breakdown in industrial processes. Knowing this and that machine learning systems simply get good on the data they train on, we can make the situation a bit better by thinking beforehand in an attempt to be aware of and account for the imbalances and biases we might face.
35 4 | Future Perspectives
While this thesis has shown several ways of how chemometrics and machine learning may benefit each other, there are still many open avenues to explore. Three out of four papers in this thesis describe how chemometric principles can be used in a machine learning context whereas this chapter outlines two ways of going the other way.
4.1. Separating Sources of Variation
One of the most widely used variants of PLS is orthogonal partial least squares regression (OPLS) [110]. From a prediction perspective, OPLS is equivalent to PLS. The main dif ference is that OPLS explicitly separates predictive latent variables from other systematic variation in order to simplify interpretation. That is, the OPLS model of X is:
T T X = TP + ToPo + E (4.1)
With n × k matrices of predictive scores T and orthogonal scores To, p × k matrices of predictive loadings P and orthogonal loadings Po, and n × p residual matrix E. Exploring the different sources separately is then simply a matter of inspecting the pairs T /P and
To/Po separate from each other.
In linear models, this separation is fairly straightforward to accomplish since we can de fine systematic variation unrelated to the response as orthogonal to y and separate the two types of variation using linear algebra. The biggest advantage of separating difference sources of variation is simplified interpretation. Inspecting the variation related to the re sponse unconfounded with other systematic variation is less complex than interpreting all variation at once. Even though prediction performance is equivalent, better possibilities for interpretation allows for better datadriven decisionmaking.
In the general case, this type of separation is conceptually related to that of disentangled representation learning. Even for complex and nonlinear relationships there is a hypoth esis that realworld data is a mixture of several underlying phenomena that may be sep arated, or disentangled, from each other. In recent years there has been great interest to identify these underlying relationships, since separation sources of variation may increase both interpretability and usability as already demonstrated by OPLS in the linear case. An example of disentangled representation learning is provided by attempts to separate the semantic content of text, meaning the underlying message, from the style, meaning the specific way it was written [111]. In facial recognition, there are methods for disen tangled representation learning that separate the face of the person from the pose of the face, meaning which way the person is looking [112]. Going one step further, there is ongo
36 4.2. Rethinking PLS Weights ing work to develop causal representation learning [89], which refers to machine learning algorithms that aim to identify the underlying causeeffect relationships. Understanding what variation causes, rather than just correlates with, other variation would then provide even stronger possibilities for interpretation. Part of the reason why the AIcommunity aims for disentanglement and causality is that separated sources would make it easier to develop autonomous agents. An autonomous agent working with disentangled or causal representations of its environment would be better predispositioned to focus on the right information for the problem it tries to solve.
The transparency of linear latent variable methods has long benefited chemometrics. Along with most disciplines, datadriven problems are getting more complex and expectations on predictive models are growing. While we may adopt more expressive models for increased predictive performance, this often comes to the cost of sacrificing transparency. Adopting recent developments in disentangled, or even causal, representations may provide an av enue for performant and understandable predictive models for future chemical problems.
4.2. Rethinking PLS Weights
Few would argue against that PLS [9, 113] and its variants have been incredibly successful for many problems in chemometrics. Although successful, there are certain limitations of PLS such as sensitivity to outliers or being prone to overfitting. Here we present an alternative formulation of how to calculate the PLS model inspired by how we train other machine learning models such as ANNs.
We write the full PLSmodel as: X = TP T + E (4.2) Y = UCT + F With n × k Xscore matrix T , n × k Yscore matrix U, p × k Xloading matrix P , q × k Yweight matrix Q, n × p residual matrix E in X , and n × q residual matrix F in Y, where T = U in case the model is fitting well.
During PLS model fitting, we also get the p × k PLS weight matrix W = [w1, w2, ..., wk].
The weight vectors, wi, are typically calculated sequentially, where the first weight vector pair is given by:
{w1, c1} = argmax Cov(t1, u1) (4.3) ∥w1∥=∥c1∥=1
Where the first Xscore vector t1 = Xw1 is the projection of observations onto the weights, n n and Yscore vector u1 = Y c1 is the response projection and Cov(t1, u1): R × R → R: 1 Cov(t , u ) = t − E[t ] T u − E[u ] (4.4) 1 1 n − 1 1 1 1 1 denotes the empirical covariance between the score vectors. As mentioned in section 2.2.1, equation 4.3 is solved by the first eigenvector of the combined variance–covariance matrix XT YY T X, typically calculated using a variant of the iterative NIPALSalgorithm [66]. T T The corresponding loading vector p1 = X tt/tt tt, where t = Xw1, is then used to deflate
37 4. Future Perspectives
Training set - Inner relation PLS (RMSE = 0.748) L1-PLS (RMSE = 0.498) 3 6 6
2 4 4
1 2 2
0 0 0 t[1] y(pred) y(pred) 1 2 2
2 4 4
3 6 6 0 10 20 5 0 5 5 0 5 y(obs) y(obs) y(obs)
Figure 4.1: Toy problem results to illustrate possibilities of alternative formulation of PLS weights. From left to right, the true inner relation used to generate data (note the outlier), regular PLS results on validation set, alternative PLSresults using L1norm as loss instead of covariance on validation set. Black lines indicate the identity relationship.
T the observation matrix as X1 = X−t1p1 . The next weight vector is then given by the same procedure using the deflated matrix X1 instead of the original X. Depending on which PLSvariant is used, the response matrix Y may also be deflated.
We note that we may in fact view equation 4.3 as a loss function as described in section 2.2.1 and restate weight identification as optimization of any loss function:
{w1, c1} = argmax L(X, Y , w1, c1) (4.5) ∥w1∥=∥c1∥=1
Knowing that lossfunctions are powerful instruments to influence the behaviour of ma chine learning models (section 2.2.1), being able to tune PLSbehaviour in this way is tempt ing. As pointed out in relation to equation 2.9, the prediction by PLS according to equa tion 2.3 is equivalent to that by an ANN with a single hidden layer, meaning two weight matrices, and no activation function or bias vectors. Knowing that we can readily train arbitrarily deep ANNs using the backpropagation algorithm and gradient descent opti mization, we may use whatever loss function we like to find the weight vectors for our PLS model. A related formulation is provided by PCA which is closely related to PLS, which has already been rewritten in a similar fashion to be robust to extreme data corruption [114].
Thanks to the flexibility provided by autodifferentiation in modern softwareframeworks for deep learning [40, 41], implementing and exploring ideas such as this is straightfor ward. To demonstrate the capabilities of our alternative PLSformulation on a toyproblem, we generate a small regression dataset of 50 observations governed by a single latent vari able in the presence strong outlier. The data is generated by drawing scores for 50 observa tion, t1 from a normal distribution and generating the response by multiplying the scores by two and adding normally distributed noise with half the standard deviation of the scores.
A strong outlier is set manually in the inner relation. Loadings, p1, are then generated by again drawing 500 elements from a normal distribution scaled to unit norm and are then used to generate the observation matrix X = tpT . We also generated a separate validation set without outliers with 100 observations using the same underlying relation.
38 We then fit both a singlecomponent PLSmodel both using regular covariance as well as using the L1norm of the predicted residuals as given by: Xn T {w1, c1} = argmax yi − (Xw1c1 )i (4.6) ∥ ∥ ∥ ∥ w1 = c1 =1 i=1 as lossfunction instead of the covariance. The L1norm is wellknown to be more robust to outliers compared to least squares alternatives such as meansquared error or covari ance. The alternative PLSmodel is implemented in the Pythonbased deep learning frame work PyTorch [41] and trained using the backpropagation algorithm and gradient descent (equation 2.8) for 500 iterations with learningrate λ = 10−2 (taking far less than second on a midend laptop). The two models are then compared using the rootmean squared error (RMSE) of the validation set (see figure 4.1), showing that the PLSmodel trained us ing L1loss achieved substantially lower prediction error (RMSE = 0.5 compared to 0.75 of normal PLS). If the outlier is removed both models performed equally well (RMSE = 0.5), confirming that the L1based loss function indeed reduced the PLSmodel’s sensitivity to outliers.
Although merely a toy problem, these results serve to illustrate that we can easily give PLS based models new capabilities by letting our weight vectors maximize other functions than the covariance. Armed with whatever lossfunctions machine learning can offer, we may have a way to express latent variables in a way different from what we are used to and is something that we leave for future research. We however conclude that formulating how PLS can be trained ANNprinciples provides a pretty neat way of wrapping up this thesis.
39 Acknowledgements
Under dessa år är det många människor som på ett eller annat vis bidragit till att göra skrivandet av den här avhandlingen möjligt. Några av er vill jag tacka särskilt.
Stort tack till min handledare Johan Trygg som gav mig den här möjligheten. Jag är oer hört tacksam för att du alltid trott på mig och det jag gör och all den hjälp jag fått. Det har varit en resa som heter duga och jag tror vi båda lärt oss mycket på vägen. I would also like to thank my cosupervisor Olivier Cloarec, who enjoys geeking out just as much as I do, for all stimulating discussions during these years. Jag vill också tacka mina referensper soner Mats Josefsson och Erik Elmroth för den feedback jag fått.
Jag vill också tacka alla kollegor och gruppmedlemmar, både nya och gamla. Om vi bör jar med gamla gänget i CLiCkorridoren som välkomnade mig in i akademins värld. Tack min kära rumskamrat Tomas för alla timmar av diskussioner kring allt mellan himmel och jord, alla gånger jag lånat din borrhammare, för att du styrde upp grillen, och så vi dare, men inte för dina högljudda hörlurar, Izabella for all invaluable advice, and for ar ranging all things big and small, Frida för att du välkomnade mig till gruppen som min mentor min första sommar, Matilda för din smittsamma optimism och för att du också gillar evighetslånga diskussioner om tråkiga vuxensaker som barnuppfostran, politik och samhälle, David för att du delat med dig av all din erfarenhet, Daniel för gott samarbete och att du alltid är lika obotligt trevlig oavsett vad sker.
Tack till alla er underbara människor på Sartorius där jag jobbat de senaste tre åren, ni är dock alldeles för många för att nämna hela högen. Men jag vill i synnerhet tacka min ständige vapendragare Christoffer, as well as you other team members, current and for mer, not already mentioned; Emil, Chris, Brandon, Nick, Martin, Robert, Anne and Jonas for all the good times and great projects, och även mina nya kära medarbetare Elsa, Frida och Alexander för det fantastiska jobb ni gör. Also, special thanks to my other Sartoriuscolleagues Tim J, Gillian, and all the rest of you for the great work we do to gether. I would also like to thank Oscar for believing in my work and getting me on board. Tack också till Henrik, Paul och Stefan med flera för er ständiga nyfikenhet och trevliga diskussioner, och Erik för att du delat med dig av din långa erfarenhet, din outtrötliga ny fikenhet på nya projekt och att du alltid säger vad du tycker. Tack också till er AWvänner, ni vet vilka ni är och jag hoppas vi ses över en bägare snart igen. Jag hoppas inte någon känner sig glömd. Jag lovar att jag inte har glömt någon med flit.
40 Tack till er andra på universitet, som ni sprillans nya gruppmedlemmar Edvin, Mattias och Nikita, I hope we will work more together. Stort tack till Andreas och Hans för trevliga diskussioner och goda råd. Även tack till er andra jag haft ynnesten att dela kor ridor och lunchraster med över åren; Benny, Pär, Jocke, Kristina, Sebastian, Anna, Carl, Sergiu, Nabeel, Beatriz, Olena och säkert fler jag tappat namnet på. Jag vill också tacka er i kemiinstitutionens ledning och adminstration som fått allt att fungera genom åren, att jag kunnat trycka denna avhandling och att det alltid funnits kaffe i lunchrum met.
Jag vill också tacka er andra vänner jag lärt känna genom åren i BiT10, i föräldragrup per, på gymmet och överallt annars. Även tack till morsan, farsan, brorsor och hela tjocka släkten för att ni alltid funnits där.
Till sist vill jag tacka min underbara familj. Särskilt tack till dig Emma för alla dessa år tillsammans, att du alltid finns där och att du orkar med mig. Tack till mina underbara ungar Liam och Arvid för all den glädje ni sprider och att ni ger mig något värt att kämpa för, och nytillskottet Vera satte eld i baken på mig att skriva färdigt avhandlingen innan du kom till världen. Älskar er ♡
Tack för mig.
Rickard
41 Bibliography
[1] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal of Research and Development, vol. 3, no. 3, pp. 210–229, 1959.
[2] “Amazon Lookout for Vision.” https://aws.amazon.com/lookout-for-vision/. Accessed: 20210225.
[3] “Machine learning can boost the value of wind energy.” https://deepmind.com/ blog/article/machine-learning-can-boost-value-wind-energy. Accessed 20210225.
[4] E. Callaway, “’It will change everything’: DeepMind’s AI makes gigantic leap in solv ing protein structures.,” Nature, vol. 588, pp. 203–204, 2020.
[5] S. Wold, “Chemometrics; what do we mean with it, and what do we want from it?,” Chemometrics and Intelligent Laboratory Systems, vol. 30, no. 1, pp. 109–115, 1995.
[6] European Medicines Agency, “ICH guideline Q8 (R2) on pharmaceutical develop ment,” 2017. https://www.ema.europa.eu/en/documents/scientific-guideline/ international-conference-harmonisation-technical-requirements- registration-pharmaceuticals-human-use_en-11.pdf.
[7] H. Martens, “Quantitative big data: where chemometrics can contribute,” Journal of Chemometrics, vol. 29, no. 11, pp. 563–581, 2015.
[8] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 13, pp. 37–52, 1987.
[9] P. Geladi and B. R. Kowalski, “Partial leastsquares regression: A tutorial,” Analyt ica Chimica Acta, vol. 185, pp. 1–17, 1986.
[10] S. Malek, F. Melgani, and Y. Bazi, “Onedimensional convolutional neural net works for spectroscopic signal regression,” Journal of Chemometrics, vol. 32, no. 5, p. e2977, 2018.
[11] C. Cui and T. Fearn, “Modern practical convolutional neural networks for multi variate regression: Applications to NIR calibration,” Chemometrics and Intelligent Laboratory Systems, vol. 182, pp. 9–20, 2018.
[12] O. Publishing, OECD glossary of statistical terms, p. 119. Organisation for Eco nomic Cooperation and Development, 2008.
42 [13] D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: an overview of methods, challenges, and prospects,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1449–1477, 2015.
[14] T. Skotare, Multivariate integration and visualization of multiblock data in chem ical and biological applications. PhD thesis, Umeå universitet, 2019.
[15] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[16] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” 2019.
[17] K. Sohn, D. Berthelot, C.L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semisupervised learning with con sistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.
[18] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for con trastive learning of visual representations,” 2020.
[19] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[20] N. S. Altman, “An introduction to kernel and nearestneighbor nonparametric re gression,” The American Statistician, vol. 46, no. 3, pp. 175–185, 1992.
[21] C. Cortes and V. Vapnik, “Supportvector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
[22] Y. Freund and R. E. Schapire, “A decisiontheoretic generalization of online learn ing and an application to boosting,” Journal of computer and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
[23] J. T. Barron, “A general and adaptive robust loss function,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[24] L. Rastrigin, “The convergence of the random search method in the extremal control of a many parameter system,” Automaton & Remote Control, vol. 24, pp. 1337– 1342, 1963.
[25] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pp. 39–43, Ieee, 1995.
[26] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated anneal ing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.
[27] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learn ing, pp. 1139–1147, PMLR, 2013.
[28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
43 [29] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural net works, vol. 61, pp. 85–117, 2015.
[30] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bulletin of mathematical biology, vol. 52, no. 12, pp. 99–115, 1943.
[31] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
[32] A. G. Ivakhnenko, “Polynomial theory of complex systems,” IEEE transactions on Systems, Man, and Cybernetics, no. 4, pp. 364–378, 1971.
[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing sys tems, vol. 25, pp. 1097–1105, 2012.
[34] S. Webb, “Deep learning for biology,” Nature, vol. 554, no. 7693, 2018.
[35] C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle, “Deep learning for compu tational biology,” Molecular systems biology, vol. 12, no. 7, p. 878, 2016.
[36] A. C. Mater and M. L. Coote, “Deep learning in chemistry,” Journal of chemical information and modeling, vol. 59, no. 6, pp. 2545–2559, 2019.
[37] B. Sahiner, A. Pezeshk, L. M. Hadjiiski, X. Wang, K. Drukker, K. H. Cha, R. M. Sum mers, and M. L. Giger, “Deep learning in medical imaging and radiation therapy,” Medical physics, vol. 46, no. 1, pp. e1–e36, 2019.
[38] M. Ge, F. Su, Z. Zhao, and D. Su, “Deep learning analysis on microscopic imaging in materials science,” Materials Today Nano, p. 100087, 2020.
[39] J. N. Kutz, “Deep learning in fluid dynamics,” Journal of Fluid Mechanics, vol. 814, pp. 1–4, 2017.
[40] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for largescale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265–283, 2016.
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, highperformance deep learning library,” in Advances in Neural Information Processing Systems 32, pp. 8024–8035, Curran Associates, Inc., 2019.
[42] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[43] H. J. Kelley, “Gradient theory of optimal flight paths,” Ars Journal, vol. 30, no. 10, pp. 947–954, 1960.
44 [44] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identi fying and attacking the saddle point problem in highdimensional nonconvex opti mization,” in Advances in Neural Information Processing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, eds.), vol. 27, pp. 2933– 2941, Curran Associates, Inc., 2014.
[45] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems (S. Ben gio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, eds.), vol. 31, pp. 6389–6399, Curran Associates, Inc., 2018.
[46] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J. SohlDickstein, and S. Gan guli, “Statistical mechanics of deep learning,” Annual Review of Condensed Matter Physics, 2020.
[47] J. Huang and H.T. Yau, “Dynamics of deep neural networks and neural tangent hierarchy,” in International Conference on Machine Learning, pp. 4542–4551, PMLR, 2020.
[48] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machinelearning practice and the classical bias–variance tradeoff,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019.
[49] T. M. Mitchell, The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, Rutgers University, 1980.
[50] D. F. Gordon and M. Desjardins, “Evaluation and selection of biases in machine learning,” Machine learning, vol. 20, no. 12, pp. 5–22, 1995.
[51] K. Fukushima and S. Miyake, “Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition,” in Competition and coop eration in neural nets, pp. 267–285, Springer, 1982.
[52] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
[53] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computa tion, vol. 9, no. 8, pp. 1735–1780, 1997.
[54] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, pp. 5998–6008, Curran Associates, Inc., 2017.
[55] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
45 [56] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 23, pp. 146–162, 1954.
[57] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pat tern analysis and machine intelligence, no. 6, pp. 679–698, 1986.
[58] D. G. Lowe, “Object recognition from local scaleinvariant features,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 2, pp. 1150– 1157, Ieee, 1999.
[59] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recogni tion (CVPR’05), vol. 1, pp. 886–893, Ieee, 2005.
[60] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net works,” in European conference on computer vision, pp. 818–833, Springer, 2014.
[61] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word repre sentations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[62] H. Martens, T. Naes, and T. Naes, Multivariate Calibration. John Wiley & Sons, 1992.
[63] R. A. Fisher, The Design of Experiments. Oliver And Boyd; Edinburgh; London, 1937.
[64] I. Surowiec, L. Vikstrom, G. Hector, E. Johansson, C. Vikstrom, and J. Trygg, “Generalized subset designs in analytical chemistry,” Analytical chemistry, vol. 89, no. 12, pp. 6491–6497, 2017.
[65] P. F. de Aguiar, B. Bourguignon, M. Khots, D. Massart, and R. PhanThanLuu, “D optimal designs,” Chemometrics and intelligent laboratory systems, vol. 30, no. 2, pp. 199–210, 1995.
[66] H. Wold, “Soft modelling by latent variables: the nonlinear iterative partial least squares (NIPALS) approach,” Journal of Applied Probability, vol. 12, no. S1, pp. 117–142, 1975.
[67] K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
[68] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and appli cations,” Neural networks, vol. 13, no. 45, pp. 411–430, 2000.
[69] R. Tauler, B. Kowalski, and S. Fleming, “Multivariate curve resolution applied to spectral data from multiple runs of an industrial process,” Analytical chemistry, vol. 65, no. 15, pp. 2040–2047, 1993.
[70] W. H. Lawton and E. A. Sylvestre, “Self modeling curve resolution,” Technometrics, vol. 13, no. 3, pp. 617–633, 1971.
[71] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
46 [72] D.S. Cao, Y.Z. Liang, Q.S. Xu, Q.N. Hu, L.X. Zhang, and G.H. Fu, “Exploring nonlinear relationships in chemical data using kernelbased methods,” Chemomet rics and Intelligent Laboratory Systems, vol. 107, no. 1, pp. 106–115, 2011.
[73] G. H. Tinnevelt and J. J. Jansen, “Resolving complex hierarchies in chemical mix tures: how chemometrics may serve in understanding the immune system,” Fara day discussions, vol. 218, pp. 317–338, 2019.
[74] M. J. Cejas and R. C. Barrick, “Chemometric mapping of polychlorinated dibenzop dioxin (PCDD) and dibenzofuran (PCDF) congeners from the Passaic River, NJ: In tegrated application of RSIMCA, PVA, and tSNE,” Environmental Forensics, pp. 1– 17, 2020.
[75] L. Eriksson and E. Johansson, “Multivariate design and modeling in qsar,” Chemo metrics and intelligent laboratory systems, vol. 34, no. 1, pp. 1–19, 1996.
[76] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthog onal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
[77] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[78] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[79] V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial intelligence review, vol. 22, no. 2, pp. 85–126, 2004.
[80] M. M. Breunig, H.P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying density based local outliers,” in Proceedings of the 2000 ACM SIGMOD international con ference on Management of data, pp. 93–104, 2000.
[81] L. Eriksson, T. Byrne, E. Johansson, J. Trygg, and C. Vikström, Multiand megavariate data analysis basic principles and applications, vol. 1. Umetrics Academy, 2013.
[82] S. Rabanser, S. Günnemann, and Z. Lipton, “Failing loudly: An empirical study of methods for detecting dataset shift,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.
[83] EPO, “Patent index 2019 Statistics at a glance,” tech. rep., European Patent Office, 2020.
[84] L. Aristodemou and F. Tietze, “The stateoftheart on intellectual property analyt ics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data,” World Patent In formation, vol. 55, pp. 37–51, 2018.
47 [85] Y. Lu, X. Xiong, W. Zhang, J. Liu, and R. Zhao, “Research on classification and sim ilarity of patent citation based on deep learning,” Scientometrics, vol. 123, no. 2, pp. 813–839, 2020.
[86] K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, 1972.
[87] A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Lin guistics, vol. 8, pp. 842–866, 2021.
[88] S. M. Lundberg and S.I. Lee, “A unified approach to interpreting model predic tions,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4765–4774, Curran Associates, Inc., 2017.
[89] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Ben gio, “Towards causal representation learning,” 2021.
[90] J. Trygg, E. Holmes, and T. Lundstedt, “Chemometrics in metabonomics,” Journal of proteome research, vol. 6, no. 2, pp. 469–479, 2007.
[91] H. Li, “Minimap2: pairwise alignment for nucleotide sequences,” Bioinformatics, vol. 34, no. 18, pp. 3094–3100, 2018.
[92] J. Bergstra and Y. Bengio, “Random search for hyperparameter optimization.,” Journal of machine learning research, vol. 13, no. 2, 2012.
[93] J. Köster and S. Rahmann, “Snakemake—a scalable bioinformatics workflow en gine,” Bioinformatics, vol. 28, no. 19, pp. 2520–2522, 2012.
[94] L. Wei, X. Xu, Gurudayal, J. Bullock, and J. W. Ager, “Machine learning optimiza tion of ptype transparent conducting films,” Chemistry of Materials, vol. 31, no. 18, pp. 7340–7350, 2019.
[95] National Transportation Safety Board, “Accident No.: HWY18MH010 Collision between vehicle controlled by developmental automated driving system and pedes trian,” 2018.
[96] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and outof distribution examples in neural networks,” in International Conference on Learn ing Representations, 2017.
[97] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting outof distribution samples and adversarial attacks,” in Advances in Neural Information Processing Systems, pp. 7167–7177, 2018.
[98] K. Musgrave, S. Belongie, and S.N. Lim, “A metric learning reality check,” in Euro pean Conference on Computer Vision, pp. 681–699, Springer, 2020.
[99] S. Narang, H. W. Chung, Y. Tay, W. Fedus, T. Fevry, M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan, et al., “Do transformer modifications transfer across implementations and applications?,” arXiv preprint arXiv:2102.11972, 2021.
48 [100] “paperswithcode outofdistribution detection.” https://paperswithcode.com/ task/out-of-distribution-detection. Accessed 20210228.
[101] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee lakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. HerbertVoss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Rad ford, I. Sutskever, and D. Amodei, “Language models are fewshot learners,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds.), vol. 33, pp. 1877–1901, Curran Asso ciates, Inc., 2020.
[102] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation modeling for aircraft engine runtofailure simulation,” in 2008 international conference on prognostics and health management, pp. 1–9, IEEE, 2008.
[103] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Third International Conference on Learning Representations, 2015.
[104] P. Warden, “Speech Commands: A dataset for limitedvocabulary speech recogni tion,” arXiv preprint arXiv:1804.03209, 2018.
[105] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” in Readings in speech recognition, pp. 65–74, Elsevier, 1990.
[106] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Represent ing model uncertainty in deep learning,” in International Conference on Machine Learning, pp. 1050–1059, 2016.
[107] C. Stringer, T. Wang, M. Michaelos, and M. Pachitariu, “Cellpose: a generalist al gorithm for cellular segmentation,” Nature Methods, vol. 18, no. 1, pp. 100–106, 2021.
[108] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in Conference on fairness, accountability and transparency, pp. 77–91, PMLR, 2018.
[109] K. Karkkainen and J. Joo, “FairFace: Face attribute dataset for balanced race, gen der, and age for bias measurement and mitigation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1548–1558, 2021.
[110] J. Trygg and S. Wold, “Orthogonal projections to latent structures (OPLS),” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 16, no. 3, pp. 119– 128, 2002.
[111] V. John, L. Mou, H. Bahuleyan, and O. Vechtomova, “Disentangled representa tion learning for nonparallel text style transfer,” in Proceedings of the 57th An nual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 424–434, Association for Computational Linguistics, July 2019.
49 [112] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose invariant face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[113] S. Wold, M. Sjöström, and L. Eriksson, “PLSregression: A basic tool of chemomet rics,” Chemometrics and Intelligent Laboratory Systems, vol. 58, pp. 109–130, Oct. 2001.
[114] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
50