Synergies between Chemometrics and Machine Learning

Rickard Sjögren

Department of , Umeå University Umeå, Sweden, 2021 This work is protected by the Swedish Copyright Legislation (Act 1960:729) Dissertation for PhD ISBN: 978­91­7855­558­1 (print) ISBN: 978­91­7855­559­8 (pdf) The cover art illustrates a brain and a tree. I simply thought it looked nice. Image attribution: Gordon Johnson, USA. (GDJ @ Pixabay) Electronic version available at: http://umu.diva­portal.org/ Printed by: Cityprint i Norr AB Umeå, Sverige, 2021 Don’t worry, be happy. Table of Contents

Abstract ...... ii Sammanfattning på svenska ...... iii List of Publications ...... iv Other Publications ...... v Abbreviations ...... vii Notation ...... viii

1 Introduction 1 1.1 Scope of this Thesis ...... 3

2 Background 4 2.1 What is ? ...... 4 2.2 Machine Learning ...... 5 2.2.1 Machine Learning Models ...... 7 2.2.2 Deep Learning ...... 9 2.2.3 Data Preprocessing ...... 14 2.3 Chemometrics ...... 15 2.3.1 Data generation ...... 15 2.3.2 Data Analysis ...... 17 2.3.3 Predictive Modelling ...... 18 2.3.4 Model Diagnostics ...... 20

3 Results 21 3.1 Summary of Paper Contributions ...... 23 3.2 Paper I: MVDA for Patent Analytics ...... 24 3.3 Paper II: DOE for Data Processing Pipelines ...... 26 3.4 Paper III: Outlier Detection in ANNs ...... 27 3.4.1 Additional Case Studies ...... 29 3.5 Paper IV: Quality Assurance in Large Data Projects ...... 33

4 Future Perspectives 36 4.1 Separating Sources of Variation ...... 36 4.2 Rethinking PLS Weights ...... 37

Acknowledgements 40

Bibliography 42

i Abstract

Thanks to digitization and automation, data in all shapes and forms are generated in ever­ growing quantities throughout society, industry and science. Data­driven methods, such as machine learning algorithms, are already widely used to benefit from all these data in all kinds of applications, ranging from text suggestion in smartphones to process monitoring in industry. To ensure maximal benefit to society, we need workflows to generate, analyze and model data that are performant as well as robust and trustworthy.

There are several scientific disciplines aiming to develop data­driven methodologies, two of which are machine learning and chemometrics. Machine learning is part of artificial intel­ ligence and develops algorithms that learn from data. Chemometrics, on the other hand, is a subfield of chemistry aiming to generate and analyze complex chemical data in an optimal manner. There is already a certain overlap between the two fields where machine learning algorithms are used for predictive modelling within chemometrics. Although, since both fields aims to increase value of data and have disparate backgrounds, there are plenty of possible synergies to benefit both fields. Thanks to its wide applicability, there are many tools and lessons learned within machine learning that goes beyond the predictive models that are used within chemometrics today. On the other hand, chemometrics has always been application­oriented and this pragmatism has made it widely used for quality assur­ ance within regulated industries.

This thesis serves to nuance the relationship between the two fields and show that knowl­ edge in either field can be used to benefit the other. We explore how tools widely used in applied machine learning can help chemometrics break new ground in a case study of text analysis of patents in Paper I. We then draw inspiration from chemometrics and show how principles of experimental design can help us optimize large­scale data processing pipelines in Paper II and how a method common in chemometrics can be adapted to al­ low artificial neural networks detect outlier observations in Paper III. We then show how experimental design principles can be used to ensure quality in the core of concurrent ma­ chine learning, namely generation of large­scale datasets in Paper IV. Lastly, we outline directions for future research and how state­of­the­art research in machine learning can benefit chemometric method development.

ii Sammanfattning på svenska

Tack vare digitalisering och automation genereras växande mängder data i alla möjliga for­ mer runtom i samhället, industrin och akademin. För att utnyttja dessa data på bästa vis används redan idag så kallade datadrivna metoder, till exempel maskininlärning, i mängder av tillämpningar i allt ifrån förslag av nästa ord i SMS på smartphones till process­övervakning inom industri. För att maximera samhällsnyttan av den data som genereras behöver vi ro­ busta och pålitliga arbetsflöden för att skapa, analysera och modellera data för alla tänkbara tillämpningar.

Det finns många vetenskapliga fält som utvecklar och utnyttjar datadrivna metoder, där två av dessa är maskininlärning och kemometri. Maskininlärning faller inom det som kallas ar­ tificiell intelligens och utvecklar algoritmer som lär sig från data. Kemometri däremot har sitt ursprung i kemi och utvecklar metoder för att generera, analysera och maximera värdet av komplexa kemiska data. Det finns ett visst överlapp mellan fälten där maskininlärnings­ algoritmer används flitigt för prediktiv modellering inom kemometrin. Eftersom bägge fält försöker öka värdet av data och har vitt skilda bakgrunder finns det många potentiella syn­ ergieffekter. Tack vare att maskininlärning är så vida använt finns det många verktyg och lärdomar utöver dom prediktiva modeller som redan används inom kemometrin. Å an­ dra sidan har kemometri alltid varit inriktat på praktisk tillämpning och har tack vare sin pragmatism lett till att det idag är välanvänt för kvalitetsarbete inom reglerad industri.

Den här avhandlingen har som mål att nyansera förhållandet mellan kemometri och maskin­ inlärning och visa att lärdomar inom vardera fält kan gynna det andra. Vi visar på hur man kan använda verktyg vanliga inom maskininlärning för att hjälpa kemometrin att bryta ny mark i en case­studie på textanalys av patentsamlingar i Paper I. Sedan lånar vi från kemometrin och visar hur vi kan använda experimentdesign för att optimera storskaliga beräkningsflöden i Paper II och hur en vanlig metod inom kemometrin kan formuleras om för att för att upptäcka avvikande mätningar i artificiella neuronnät i Paper III. Efter det visar vi hur principer från experimentdesign kan användas för att säkerställa kvalitet i kärnan av modern maskininlärning, nämligen skapandet av stora dataset i Paper IV. Till sist ger vi förslag på framtida forskning och hur de senaste metoderna inom maskin­ inlärning kan gynna metodutveckling inom kemometrin.

iii List of Publications

This thesis is based upon the following publications:

Paper I Multivariate patent analysis — Using chemometrics to analyze collections of chemical and pharmaceutical patents, Rickard Sjögren, Kjell Stridh, Tomas Skotare, Johan Trygg Journal of Chemometrics 34.1 (2020): e3041. DOI: 10.1002/cem.3041

Paper II doepipeline: a systematic approach to optimizing multi­level and multi­step data processing workflows, Daniel Svensson†, Rickard Sjögren†, David Sundell, Andreas Sjödin, Johan Trygg BMC 20.1 (2019): 498. DOI: 10.1188/s12859­019­3091­z

Paper III Out­of­Distribution Example Detection in Deep Neural Net­ works using Distance to Modelled Embedding, Rickard Sjögren, Johan Trygg Manuscript under review ­ Machine Learning

Paper IV LIVECell ­ A large­scale dataset for label­free live cell segmen­ tation Christoffer Edlund†, Timothy R Jackson†, Nabeel Khalid†, Nicola Bevan, Timothy Dale, Andreas Dengel, Sheraz Ahmed, Johan Trygg, Rickard Sjögren Manuscript under review ­ Nature Methods

Where applicable, the papers are reprinted with permission from the publisher.

†These authors made equal contributions

iv Other Publications

These publications by the author are not included in this thesis:

Paper V Visualization of descriptive multiblock analysis, Tomas Skotare, Rickard Sjögren, Izabella Surowiec, David Nilsson, Johan Trygg Journal of Chemometrics 34.1 (2020): e3071. DOI: 10.1002/cem.3071

Paper VI Joint and unique multiblock analysis of biological data– multiomics malaria study, Izabella Surowiec, Tomas Skotare, Rickard Sjögren, Sandra Gouveia­Figueira, Judy Orikiiriza, Sven Bergström, Johan Nor­ mark, Johan Trygg Faraday discussions 218 (2019): 268­283. DOI: 10.1039/C8FD00243F

Paper VII A geographically matched control population efficiently limits the number of candidate disease­causing variants in an unbi­ ased whole­genome analysis, Matilda Rentoft, Daniel Svensson, Andreas Sjödin, Pall I Ola­ son, Olle Sjöström, Carin Nylander, Pia Osterman, Rickard Sjö­ gren, Sergiu Netotea, Carl Wibom, Kristina Cederquist, Andrei Chabes, Johan Trygg, Beatrice S Melin, Erik Johansson PloS one 14.3 (2019): e0213350. DOI: 10.1371/journal.pone.0213350

Paper VIII A Machine Vision Approach for Bioreactor Foam Sensing, Jonas Austerjost, Robert Söldner, Christoffer Edlund, Johan Trygg, David Pollard, Rickard Sjögren SLAS Technology (2021) DOI: 10.1177/24726303211008861

v Following patent applications relate to Paper IV included in this thesis:

I Computer­implemented method, computer program product and system for data analysis Rickard Sjögren, Johan Trygg E.P. Patent Application: EP­3620983­A1 W.O. Patent Application: WO­2020049094­A1 U.S. Patent Application: US­20200074269­A1

II Computer­implemented method, computer program product and system for analysis of cell images Rickard Sjögren, Johan Trygg E.P. Patent Application: EP­3620986­A1 W.O. Patent Application: WO­2020049098­A1

III Computer­implemented method, computer program product and system for anomaly detection and/or predictive mainte­ nance Rickard Sjögren, Johan Trygg W.O. Patent Application: WO­2020049087­A1

vi Abbreviations

ANN Artificial Neural Network

CNN Convolutional Neural Network

DOE Design of

GSD Generalized Subset Design

MLR Multiple

MVDA Multi­Variate Data Analysis

OOD Out­of­distribution, also known as outlier

PCA Principal Component Analysis

PLS Partial , or Projections to Latent Structures, regression

Q2 Goodness of Prediction

QA Quality Assurance

QbD Quality­by­Design

QSAR Quantitative Structure Activity Relationship

R2

RNN Recurrent Neural Network

RMSE Root Square Error

vii Notation

X The observation space

Y The response space

R The set of all real numbers

x An observation in X

y A response in Y

n Number of observations in a dataset

p Number of variables in a two­dimensional dataset

q Number of responses in a two­dimensional dataset

k Dimensionality of a model, e.g. number of components

x A length p vector of an observation

t A length k score vector of an observation

p A length p vector of variable loadings

y A length q vector of measured responses

yˆ A length q vector of predicted responses

X n × p observation matrix

Y n × q matrix of measured responses

Yˆ n × q matrix of predicted responses

T n × k matrix of observation scores

W p × k matrix of variable weights

Wi ki−1 × ki matrix of variable weights in a multi­layer model

P p × k­matrix of variable loadings

C q × k matrix of PLS Y­loadings

θ A set of model parameters in any form.

θ A vector of model parameters.

L(y, yˆ) The L : Y2 → R.

viii 1 | Introduction

Going into the 2020s, generation and utilization of data is ever­present in the science com­ munity, industry as well as throughout society in general, often under the umbrella term artificial intelligence (AI). Many forecasts project that the global AI market will grow into hundreds of billions USD, meaning that its implications on society will be profound. Al­ though AI encompasses a wide array of both technical and philosophical sub­fields, the boom we are experiencing now is largely driven in advances in data­driven methods and especially machine learning. There are many definitions of what machine learning and a common definition paraphrases Arthur Samuel [1], which describes machine learn­ ing as algorithms that:

[...] build a model based on sample data, known as ”training data”, in order to make predictions or decisions without being explicitly programmed to do so.

Wikipedia ­ Machine Learning

In short, this means that we show examples of expected input data and desired output data to machine learning algorithms that then serve to find a map between input and output. In other words, data­driven methods such as machine learning algorithms can be used to fit a model of the process transforming our input to the output that we can use to predict the output for new input data.

Working with this definition there are virtually endless possibilities of what data can be used as input and what output we may want to predict. Most people in developed countries regularly encounter applications that use machine learning in daily life IT­applications. Some examples include smartphones that use machine learning to suggest a suitable next word based on the currently written characters in text messages, streaming services that use machine learning to recommend what TV­series to watch next based on previously watched content and social media that use machine learning to decide what content on your timeline you are likely to interact with based on previous interactions. In industry we find machine learning­methods in all levels of the production chain, from automated visual inspection for [2] to forecasting wind farm energy output to help optimizing equipment use [3]. Looking ahead, we have seen exciting developments in machine learn­ ing that will have profound impact future’s society. For instance, DeepMind recently used machine learning to predict protein structures accurately enough to be practically useful in structural [4], which will have immense impact on biology and drug development.

1 1. Introduction

Although data­driven methods such as machine learning algorithms show great promise to solve many problems throughout society, they are intimately connected to the strengths and weaknesses of the data itself. A machine learning model will never be better than the data it was trained on, nor can we expect it to perform well on new data dissimilar to the training data. This is widely known among machine learning practitioners and poses fundamental limitations of what we can expect from these types of systems. Despite being widely known, less effort than we may like is spent systematizing the data itself to ensure that we maximize the benefit to our applications and society. In contrast, immense effort is spent optimizing algorithms for predictive performance and applying algorithms to new problems. No matter how well­optimized a data­driven algorithm may be, we must make sure that the premises in terms of what data we use are right in order for it to be useful.

This leads us to the topic of this thesis, namely the field of chemometrics and its relationship to concurrent machine learning. Originally named in a grant application in 1971, one if its pioneers summarizes chemometrics as:

[...] the branch of chemistry concerned with the analysis of chemical data (ex­ tracting information from data) and ensuring that experimental data contain maximum information (the ).

Svante Wold [5]

Despite being a subfield of chemistry, smaller and younger than machine learning, the underlying principles of chemometrics are applicable far beyond its original formulation. Chemometrics is characterized by pragmatic use of mathematical, statistical and data­ driven methods to maximize the value gained of generated data by means of statistical experimental design and structured data analysis to maximize the knowledge gained by the data we have. It is easy to argue that these characteristics are applicable, and desirable, in other domains outside of chemistry. In fact, the principles and methods of chemomet­ rics have gained widespread traction in regulated industries, such as the pharmaceutical industry [6]. With its long history of practical applications, there are many lessons learned among chemometricians that deserve wider use.

Although chemometrics is sometimes described as having an adverse relationship with ma­ chine learning [7], machine learning methods are widely used in chemometrics. We will leave details for later chapters, but linear ­based methods such as princi­ pal components analysis (PCA) [8] and partial least squares regression (PLS) [9], used in chemometrics for decades, can be viewed as machine learning algorithms. In later years, we have seen an influx of a wide variety of machine learning algorithms into chemometrics, including so called artificial neural networks [10, 11] that are largely responsible for the last decade’s AI­boom. Cross pollination between these two fields shows great benefits already and this thesis aims to show that there are more potential synergies and we should strive towards a symbiotic relationship.

2 1.1. Scope of this Thesis

1.1. Scope of this Thesis

This thesis serves to nuance the relationship between the fields of machine learning and chemometrics and show that knowledge in either field can be used to benefit the other. Machine learning algorithms are already used in chemometrics and there are principles in common between the two fields and this thesis argues that there are more potential synergies to be found.

We start by exploring how tools widely used in applied machine learning can help chemo­ metrics break new ground. Thanks to its wide arrays of applications, particularly in the domain of unstructured data such as text and images, machine learning feature a toolbox that spans far beyond model fitting algorithms. These tools are readily available and Pa­ per I demonstrates that they can be reused to help chemometrics enter new applications in a case study of text analysis of patents.

Simultaneously, chemometrics’ pragmatic view of how to work with data is directly trans­ ferable to problems where machine learning algorithms are used as well as the algorithms themselves. For instance, experimental design methods to help generate and evaluate small datasets with maximal information are straightforward to adapt and were applied in Paper II to implement an algorithm to optimize large­scale genomic data processing pipelines. Paper III then draws inspiration from chemometric approaches to determine which predictions from a machine learning model can be trusted and develops a method to allow artificial neural network models detect unrecognized data. Paper IV then demon­ strates how experimental design principles can be used to ensure quality in the core of concurrent machine learning, namely generation of large­scale datasets. Lastly, we out­ line directions for future research where state­of­the­art research in machine learning can benefit chemometric method development.

3 2 | Background

2.1. What is data?

Since this thesis regards the topics of generation and analysis of data, we first explain what we mean by data. One definition states that data are pieces of information or facts, often numerical, that are collected through observation [12]. For the purpose of this work, the term information is referred to in a colloquial sense rather than the formal information theoretic sense. This definition is deliberately broad for the purpose of this thesis, which involves analysis of several types of data. A dataset is then simply a collection of obser­ vations of the same type of information, where each observation may carry one or more pieces of information, or variables.

Since data analysis and machine learning deals with mathematical manipulation of data, data must be in numeric form and observations must follow the same data model in or­ der to be useful for those purposes. Data where observations follow a defined data model are often referred to as structured data. For instance, tabular data where each row con­ sists of an observation is perhaps the most famous example of structured data. Other types of structured data include observations of single, real­valued measurements from a sin­ gle sensor, for instance a pH­meter. Further examples include spectral data, where each observation is a sequence, or vector, of real­valued intensities or absorbances measured at different wave­lengths. A frequently occurring problem in real­world structured data is , where one or more variables lack measurements. Missing data may ei­ ther occur randomly or systematically, where the former is more problematic from a data analytic perspective since it may for instance introduce biases.

The opposite to structured data are so called unstructured data, which refers to data that lack a defined data model. An unstructured observation is a piece of information, but lack a direct way to represent it numerically for the purpose of data analysis or machine learn­ ing. An example of unstructured data is provided by textual data, which typically contain information but cannot be directly represented in a structured numeric form. Another ex­ ample is provided by image data that lack a direct way of defining how the pixels relate to each other in order to be useful even though each pixel value is numeric. This means that raw unstructured data must be subject to data preprocessing such as feature extraction in order to convert it into structured form (see section 2.2.3).

Since many problems in both machine learning and chemometrics involve fitting a model to transform one form of data into another, we can distinguish different types of data based on whether or not the observations form pairs. For brevity we do not discuss multi­block

4 2.2. Machine Learning analysis, where observations form triplets or more, which has been discussed at length elsewhere [13, 14]. Naming of variables in either input or output data is sometimes a source of confusion in interdisciplinary communication. The variables in the input data may be called predictors, regressors, factors, covariates or independent variables, or often features in machine learning. Wheras the variables in the output data are often called responses in chemometrics, targets or labels in machine learning, or dependent variables, regressands, among other depending on the field. The output variables are then further distinguished based on whether they are measured or predicted. Furthermore, the process of manually creating targets for input data is sometimes called annotation or labelling, which is most common in terms of unstructured data.

2.2. Machine Learning

This section serves to provide basic understanding for the machine learning concepts used in this thesis. It does in no way provide a comprehensive overview of the field.

Machine learning is a subfield of artificial intelligence (AI) and regards the specific type of problem of enabling software to Artificial Intelligence learn from examples (figure 2.1). AI is a vast field and involves techniques such as search and planning algorithms and con­ Machine Learning cepts such as reasoning. Although ma­ chine learning comprise a subset of AI, Deep Learning the two terms are sometimes used inter­ changeably. Furthermore, in most appli­ cations encountered in literature, machine learning is restricted to learning from a Figure 2.1: Illustration of the hierarchy fixed training set to fit a model that is then of artificial intelligence and its subfields used for predictive purposes. Considering machine­ and deep learning. the complex dynamics of the world, the scope of such learning is very narrow but has proven very useful. Although, there are machine learning algorithms that learn from interactions with a dynamic environment, such as reinforcement learning [15], they are beyond the scope of this thesis.

The term machine learning may refer to two different, but related, topics. It may refer to the process of applied machine learning aiming to develop useful models for the problem at hand (illustrated in figure 2.2). Applied machine learning includes transforming data to useful form by means of data preprocessing (section 2.2.3), application of machine learn­ ing algorithms to fit a predictive model, scoring and evaluation of models (described in section 2.3.3) and deploying models for downstream use where they may be monitored to ensure that they are suitable for their task (section 2.3.4). The term may also refer to the actual machine learning algorithms, which is the main topic of this section.

5 2. Background

Iterate to find best model

Training Model Data Training

Learning algorithms Data Model Raw Data Preprocessing Deployment

Data Manipulation Downstream use Data Cleaning Feature Engineering Evaluation Model Data Evaluation

Figure 2.2: Illustration of applied machine learning as a multi­step process. Given raw data, the data is preprocessed into suitable form for model training. The preprocessed data is then typically split into training data, used to update the model, and evaluation data used to evaluate and score the model. In most cases, many different models are trained and scored before the final model is deployed for downstream use.

In broad terms, machine learning algorithms are divided into two main regimes, super­ vised and unsupervised learning. The most widely used regime is supervised learning, which refers to when the training set consists of input­output pairs and train a model to predict the target output well based on input features. Measured targets are considered ground truth and are used to determine how well the model performs and we train our models to minimize the error with respect to the ground truth. Meaning that the train­ ing is supervised by our actual measured targets. Problem types falling under supervised learning are typically divided based on the nature of the target and include, but are not limited to, regression, where the target is a continuous value, and classification, where the target is a discrete category.

Unsupervised learning, on the other hand, deals with training data where the observations have no target associated with them and aim to find a useful representation of the data. This representation may be used to interrogate the data and help find general trends or subgroups in the dataset. An example of an unsupervised machine learning problem is di­ mensionality reduction that serves to find a low­dimensional representation of a dataset with many variables, for instance by means of PCA [8]. Closely related to unsupervised learning is self­supervised learning, which has shown great results in several domains re­ cently [16, 17, 18]. Self­supervised learning uses supervised learning algorithms but the targets are generated from the training inputs themselves instead of being provided exter­ nally. An example in natural language processing shows that large artificial neural network models can be trained to fill in blanks in text on very large scale to learn representations of language from text useful for downstream­tasks such as document classification [16]. Another example is provided by so called consistency­based approaches that show con­ vincing results in computer vision and train models to output consistent representations for an image and a randomly perturbed version of the same image [18, 17]. These types of approaches combined with smaller amounts of labelled data in so called a semi­supervised fashion, are quickly growing popular in domains where raw data is plentiful.

6 2.2. Machine Learning

2.2.1. Machine Learning Models

Models as function approximators In its most general form, we describe a machine learning model as a function approximator. This means that the model is used to approx­ imate a mathematical function that maps from the observation space X to the response space Y. Given an observation x ∈ X , and our model f : X → Y parameterized by the set θ, the response y ∈ Y is approximated by the prediction yˆ ∈ Y as:

y ≈ yˆ = f(x, θ) (2.1)

For structured and numerical data, such as tabular or spectral data, x is an observation vector x and the observation space X = Rp where p is the dimensionality of x, and the matrix of all n input observation is the n × p­matrix X. Similarly, the target y is also often a vector y and Y = Rq with dimensionality q, and the matrix of targets for all observation is an n × q­matrix Y . The predicted target yˆ is on the same form as y, with corresponding prediction vector yˆ and matrix of predicted responses, Yˆ . In following sections, we will only discuss numeric vector observations unless specified otherwise.

The specifics of the actual model is hidden behind f above and there are a myriad of model types available with different complexity, which are useful for different problems. The com­ plexities of these models from something as simple as basic multiple linear regression

(MLR) with regression parameter vector θ = θ = [θ0, θ1, ...., θp] predicting a scalar yˆ:

yˆ = θ0 + xiθ1 + ... + xpθp (2.2)

A slightly more complex model, which is popular in chemometrics, is partial least squares regression (PLS) [9] that is parameterized by weight matrices θ = {W , C}, where the p×k matrix W is used to project to an intermediate, so called latent, space from which the q × k matrix C is then used to project from the latent space to output space, and may be written as: yˆ = xW CT (2.3) Other popular machine learning models include Random Forests [19] that consists of many, a so called ensemble of, shallow decision trees that we denote with slightly abused nota­ tion as θ in equation 2.1. The decision trees’ learning algorithm trains the decision trees to divide observations into compartments in an optimal way according to a criterion of in­ terest. Other well­known examples include k­nearest neighbor regression or classification [20] and support vector machines [21]. There are also so­called meta­algorithms available to optimally combine several models into an ensemble, such as a random forest, where ex­ amples include the popular AdaBoost­algorithm [22]. A model type that has grown highly popular in the past decade is provided by so called Artificial Neural Networks (ANNs) and deep learning, and will be describes separately below in 2.2.2, Deep Learning. In conclu­ sion, machine learning models aims to approximate the underlying function that give rise to our observed data. The specifics of how to model these functions are governed by the many different available model types and their respective learning algorithms.

7 2. Background

Loss functions In order to quantify to what extent our model is correct, we use so called loss functions L : Y2 → R that numerically describe the model error by comparing the predicted and measured outputs. How to compare predicted and measured outputs as well as what particular loss function to use depends on the problem at hand. An important part of the machine learning practitioner’s work is then to determine suitable loss functions. For instance, the loss of a regression model is often given by the mean­squared error (MSE), where the MSE calculated over all or a subset of observations in the dataset is given by:

n q 1 X X L (Y , Yˆ ) = (Y − Yˆ )2 (2.4) MSE n ij ij j=1 i=1

For classification models, where each ground truth target variable yi is 1 for the class the ∈ observationP belongs to and 0 otherwise and the model predicts probabilities yˆi [0, 1] so q that i=1 yˆi = 1, the loss is often given by the average cross­entropy:

n q 1 X X L (Y , Yˆ ) = − Y log(Yˆ ) (2.5) CE n ij ij j=1 i=1

The loss function may have significant impact on the the behaviour of the resulting model and developing new loss functions is an active research area in itself. A full survey of avail­ able loss functions is however beyond the scope of this thesis. An example to illustrate the flexibility of loss functions is provided by MSE that is widely used for regression problems but suffers from high sensitivity to outlier observations due to the squared residual. Differ­ ent counter­measures are available such as using the absolute value of the residual instead, also known as L1­loss. L1­loss is less sensitive to outliers but results in a derivative with constant magnitude that has implications for optimization as we will see below. So, one possible way to balance between insensitivity to outliers and ease of optimization is pro­ vided by a recently proposed family of loss functions [23] that allows us to tune sensitivity to outliers. For a scalar output y we let ϵ = y − yˆ and the loss is given by:   ! |2 − α| ϵ2 α/2 L (ϵ, α) = − 1 (2.6) Barron α |2 − α|

Where α ∈ R is parameter used to tune the sensitivity, with for instance limα→2 LBarron =

LMSE. To conclude, by designing the loss function correctly we have great control over properties of our machine learning model.

Model fitting Model training, or fitting, describes the actual learning part of machine learning and is a central part of the field. Our modelf(x, θ) describes how to map an input observation to an output given a set of parameters and the loss function quantifies how wrong its predictions are. The missing piece is then how to find parameters that minimize the loss, which is what the training algorithms aim to solve. Depending on what model and loss function is used, there are many algorithms available where some algorithms are specific to the model at hand and others are more general.

Certain models and loss function combinations have closed­form solutions, meaning that the model parameters are found by solving an equation. For instance, multiple linear re­

8 2.2. Machine Learning gression, equation 2.2, minimizing MSE, equation 2.4, have closed­form solution:

− θ = (XXT ) 1XT y (2.7)

Similarly, PLS (equation 2.3) maximizes the between XW and YC that can be solved by the first eigenvector of the combined XT YY T X.

In most cases there is no closed­form solution available and iterative optimization algo­ rithms are necessary to minimize the loss function. Optimization algorithms can be divided into two coarse categories, gradient­based and gradient­free algorithms. As the name im­ plies, gradient­based algorithms rely on the gradients, that is the first­ or second­order (partial) derivatives, of the loss function whereas gradient­free algorithms do not. We will not describe gradient­free algorithms in depth but example methods include random search [24], particle swarm optimization [25], simulated annealing [26] and many more.

In terms of gradient­based algorithms, we focus on a method widely used in machine learn­ ing, namely gradient descent. The key idea behind gradient descent is to step­wise adjust parameters in the opposite direction of the gradient of the loss function at the current point. The gradient describes the direction with steepest slope, and the whole procedure may be metaphorically described as finding the bottom of a bowl by taking small steps in the steep­ est downwards slope. Formally, for the average loss Lx,y, calculated over all training set observations, at iteration i, we update the k­th model parameter θk,i using the update rule:

− ∇ θk,i+1 = θk,i λ θk,i (Lx,y) (2.8)

∇ · ∈ R Where θk,i ( ) denotes the gradient with respect to θk,i and learning rate λ determines how large steps we take. Although popular, gradient descent struggles with certain types of functions and is not guaranteed to find the global minimum in non­convex problems (multiple hollows on a surface where the global minimum is the deepest hollow accord­ ing to our metaphor). Despite these shortcomings, gradient descent often works very well in practice. For large datasets, and especially in deep learning, the average loss is often calculated from a random subset of the training dataset instead of all observations. This procedure is typically called stochastic gradient descent (SGD) and sometimes mini­batch gradient descent. There are many popular variants of gradient descent that aim to improve model convergence by for instance adding different types of momentum [27].

2.2.2. Deep Learning

This section serves to give a brief introduction to deep learning and relate it to the general machine learning concepts described above. Deep learning is a fast­moving field that has been surrounded by hype and high expectations in the past decade. We will not describe state­of­the­art methods but rather explain core concepts.

As described in a previous section, deep learning is a subfield of machine learning (fig­ ure 2.1) and deals with a specific type of machine learning model called deep artificial neural networks (ANNs). Deep learning and its history has been described at length in

9 2. Background previous work [28, 29] but we will summarize it briefly. Deep ANNs describe a multistep, nonlinear transformation of input data to output data (see illustration in figure 2.3) and have their roots in seminal work on computational models of biological neurons in the 1940s [30]. The Perceptron, a progenitor to modern ANNs was published in the 1950s [31] and the first multilayered networks were developed little more than a decade after that [32]. Although dating back many decades, deep ANNs did not have their mainstream breakthrough until 2012 when deep learning was used to win a major computer vision com­ petition with huge margin [33]. The reason for taking so long is mainly due to that deep ANNs require vast datasets and computational resources to work well, something that was not widely available until the 2010s. Since then, ANNs and deep learning have made rapid progress and are used in complex prediction problems in biology [34, 35], chemistry [36], medical imaging [37], material science [38], fluid dynamics [39] and many more fields. With rapid algorithmic development, increased computational power and ever­growing data availability as well as raised expectations on predictive power, deep learning is likely here to stay.

Forward pass In most cases, ANNs describe a sequence of mathematical operations of how to transform input to output that can be described by a directed acyclic graph. These types of ANNs are referred to as feedforward networks, and applying a feedforward ANN to an observation is often called the forward pass. One example of non­feedforward networks is provided by recurrent neural networks briefly described below in section 2.2.2 ­ Model architectures.

As illustrated in figure 2.3, the predicted output of a feedforward ANN is given by sequen­ tially multiplying variables with parameters, typically called weights in this context, sum­ ming the products, adding a bias parameter and then applying a nonlinear activation func­ tion. Disregarding the activation function for a short , the intermediate values, also called activations, are given by equations identical to multiple linear regression above (equation 2.2). For two­dimensional data and fully connected network as shown in fig­ ure 2.3, we put the weights of each layer l ∈ N in weight matrices Wl and bias vectors bl and conveniently write the prediction as the nested function:    

yˆ = fl fl−1 ... f1(xW1 + b1)W2 + b2 Wl + bl (2.9)

Note that without activation functions, PLS (equation 2.3) can be written as an ANN with a single intermediate layer, bias vectors b1 = b2 = 0.

The activation functions fl : R → R can be any differentiable function and serves to in­ duce nonlinearity to help the ANN learn more complex functions. An example of activation function is the rectified linear unit, ReLU(x) = max(0, x), which is one of the most widely used activation functions in practice.

As already mentioned, the ANN­model describes a directed graph of sum­products and ac­ tivations meaning that we are not restricted to two­dimensional data. To generalize from 2D to any regular, input­, intermediate­ and output data as well as weights and biases are

10 2.2. Machine Learning

Output

H2

H1

Input x

Figure 2.3: Illustration of the forward pass using an ANN with two input variables xi, two intermediate layers H1 and H2 with three intermediate variables each and a single out y as example. During the forward pass, each intermediate value, or activation, h·j ∈ R is given by the sum of the values in the preceding layer multiplied by parameters, or weights, w·j ∈ R plus a bias­term bj ∈ R and passed through a differentiable non­linear activation function f : R → R. Adapted from [28]. denoted using tensors. Tensors can be viewed as multi­dimensional arrays or generaliza­ tion of vectors, tensors of order 1, and matrices, tensors of order 2, and serve as the basic data structure in widely used software libraries for deep learning [40, 41]. Tensor manip­ ulation is especially useful for unstructured data such as images, which can be described as order 3 h × w × c tensors, where h is the height, w width and c the number of channels (e.g. three for RGB images, or more for multi­ or hyper­spectral ones).

Back­propagation Typical ANNs are multi­layered, complex models that are often over­ parameterized, meaning that they have more parameters than there are observations avail­ able to train the model. Perhaps obvious, but this means that ANNs have no closed form solution for model training and must be trained using optimization algorithms as described above (section 2.2.1 ­ Model Fitting). The most widely used method to train ANNs is using the backpropagation algorithm (summarized in figure 2.4) and some variant of gradient descent [42], an approach originally developed in the field of control theory [43]. As de­ scribed above, gradient descent updates the model parameters, i.e. ANN weights, based

11 2. Background

Derivative of loss with respect to output

Input x

Figure 2.4: Illustration of back­propagation of the ANN. During back­propagation, the derivative of the loss with respect to each intermediate layer is calculated by repeatedly applying the chain­rule of partial derivatives from output and backwards through the ANN layers. Adapted from [28].

on the first­order derivative of the loss with respect to each parameters. In a multi­layered model, such as a deep ANN, calculating the gradient of the output loss with the respect of a parameter in the middle of a large model is not trivial. To solve this problem, backprop­ agation provides a straightforward algorithm based on the chain­rule of partial derivatives to calculate the gradients starting with the output layer and propagating the gradient back­ ∇ ward, which is described in figure 2.4. That is, backpropagation gives us the gradients θi,k used to update the parameters in gradient descent (equation 2.8).

We shall note that training deep ANNs is a highly non­convex problem with high dimen­ sionality due to the large number of parameters and a large number of saddle points, which makes optimization non­trivial [44]. Despite these difficulties, the seemingly simple algo­ rithms of backpropagation and stochastic gradient descent help deep ANNs converge to models that generalize to an unprecedented extent in many problems. At the time of writ­ ing this thesis, the theory of why deep learning works so well is poorly understood. Unrav­ elling the reasons behind is a highly active research field [45, 46, 47] sometimes showing counter­intuitive results such as the ”double­descent” phenomenon [48]. We conclude that backpropagation has helped deep learning revolutionize many aspects of modern machine learning, although why is not fully understood.

12 2.2. Machine Learning

ANN architectures Another aspect of deep ANNs that have had large impact on their success is that of model architectures, which defines the way the sequence of operations are arranged within the model. The most basic architecture is provided by so called fully connected feed­forward network. In a fully connected network, the ANN is defined as a simple sequence of layers where the activations in a layer depends on all activations in the preceding layer (as shown in figure 2.3). Thanks to the flexible network structure of ANNs, we can design sophisticated architectures to allow the ANN to benefit from better inductive biases, i.e. make them better predisposed to learn and generalize certain tasks. The need for inductive biases is well known in machine learning [49, 50] and designing ANN architecture variants is a research field in itself.

Perhaps the most famous example of a family of ANN architectures is provided by called convolutional neural networks (CNNs) that are widely used for all data with spatial correlations, especially for vi­ sion tasks. The basic mechanism was pro­ posed in the 1980s [51, 52] in which the model weights are arranged into small ker­ Kernel nels that are reused over the whole spa­ tial extent of an observation in a sliding­ window fashion (illustrated in figure 2.5). Input Output Using kernels that operate locally allow CNNs to efficiently learn general features Figure 2.5: Illustration of application of a that can be re­used over the whole spatial 2 × 2­kernel in a convolutional neural net­ extent and then combined into complex work for image data. In a sliding­window features in deeper layers of the network. fashion, the pixel­values in each region of the Large scale training of CNNs was used input is multiplied by the kernel weights and to win the ImageNet Large Scale Visual then summed to the intensity of one pixel in Recognition Challenge 2012 [33] with a the next layer. large margin, which led to the widespread adoption of ANNs that we see today.

Although the most famous example, CNNs are not the only ANN architecture family avail­ able. To process sequential observations, different variants of so called recurrent neural networks (RNNs) are sometimes used, which are an example of non feed­forward ANN. RNNs operate by passing on the activations of the preceding observation along with the current to enable learning of complex time­dependent features and include models such as Long Short­term Memory Networks [53]. In the past few years, so called transformers [54] have emerged as a flexible and promising architecture family. Transformers also op­ erate on sequential data by repeatedly using scaled dot­product attention (not described here) and have been massively successful for textual data [16] and have recently shown great promise for image data as well [55]. To summarize, part of why deep learning has been immensely successful is that network architectures create inductive biases enabling deep ANNs to efficiently learn to solve certain problems.

13 2. Background

2.2.3. Data Preprocessing

Preprocessing is a broad term, and may include everything that data are subject to before modelling. This ranges from digitizing measured but analog data, identifying relevant ob­ servations in historical databases, merging disjoint datasets, to scaling individual numeric values. To limit the scope of discussion, we will only briefly describe the process of trans­ forming data from unstructured to structured format, often referred to as feature extraction or feature engineering. That is, the process of finding a representation of the data useful for the problem at hand.

As described above (section 2.1), data needs to be in structured and numeric format in order to be useful machine learning algorithms. For most unstructured data, there are many dif­ ferent options available of how to transform it, each with their strengths and weaknesses. If we consider textual data as an example. a simple way of preprocessing is provided by so called bags­of­words [56]. A bag­of­words is an n × p­matrix of n observations (such as sentences or documents) where each element describes by how many times each observa­ tion uses each of the p words in the vocabulary. Note that there is no definite way of how to construct a bag­of­words from a given dataset and configuration parameters, such as the minimum number of times a word is used in order to be included in the vocabulary, will have large impact on the resulting representation. The bag­of­words representation is sim­ ple and effective but comes with the drawback that all location information is discarded, meaning that any sense of what order words are used is missing.

Feature extraction is widely studied for other more or less unstructured types of data. Com­ puter vision, for instance, offers a vast array of methods to describe image data in useful ways for both pixel­wise and whole­view analysis. These methods range from low­level op­ erations such as edge detection [57] to more elaborate methods for shape detection [58, 59]. Another example include time­series data that, although it is structured in itself, is often subject to preprocessing such as binning into fixed time intervals, calculation of moving on different time­resolutions, differentiation, lagging and auto­correlation of the different features, and so on. Similarly, differentiation over the spectral dimension is com­ mon when analyzing spectral data in chemometrics.

Feature extraction is intimately connected to the inductive biases that we strive for in ma­ chine learning (discussed briefly in section 2.2.2 ANN architectures). Good feature extrac­ tion creates good inductive biases for our machine learning models by drawing on human expertise. In the past decade, has the rise of deep learning resulted in that feature ex­ traction is largely replaced by feature learning in certain domains. In contrast to feature extraction, feature learning describes the use of machine learning algorithms to find an optimal representation of data. Deep ANNs have shown remarkable ability to learn use­ ful representations for many kinds of complex, unstructured data [28, 29]. For instance, convolutional neural networks not only learn low­level features such as edge detectors, they also learn how to assemble them into abstract high­level concepts that are more ro­ bust than what we have accomplished using feature engineering approaches [60]. In natu­ ral language processing, even relatively simplistic feature learning methods, such as word

14 2.3. Chemometrics embeddings, learn semantic relationships between words in a way that is incredibly chal­ lenging to accomplish using feature extraction [61]. By allowing general machine learning principles to find optimal features for the problem at hand, deep ANNs have not only es­ tablished new levels state­of­the­art performance. They have also to certain extent democ­ ratized machine learning by making it more available for unstructured data problems.

2.3. Chemometrics

In contrast to machine learning that focus on develop principles and algorithms for learn­ ing regardless of problem domain, chemometrics aims to extract information from a cer­ tain type of data, chemical, using methods that are practical. Due to the complex nature of chemistry, the general principles in chemometrics emerged partially as a reaction to the single­factor experiments and analysis prominent in experimental sciences [5]. Instead, chemometrics draws upon and machine learning algorithms to gen­ erate and analyze data with many factors or variables, while having large focus on knowl­ edge generation. For simplicity, we choose to view chemometrics as a four­step process as illustrated in figure 2.6 where each step comes with its own set of methods.

Predictive Data Generation Data Analysis Model Diagnostics Modelling

Knowledge Generation Application of Knowledge

Figure 2.6: Illustration of chemometrics as a process to generate and apply knowledge using data­driven methods.

An important part of chemometrics is the use of data­driven methods all the way through­ out the data life­cycle. Statistical design of experiments (DOE) is used to help generate information­rich datasets and multivariate data analysis (MVDA) to extract information from data with many variables. In terms of predictive modelling, chemometrics already uses machine learning models, although with great focus on linear latent variable meth­ ods. For decades [62], chemometrics has put great focus on model diagnostics to validate that predictive model and new observations fall within what is expected. These principles are widely applicable and are now used in frameworks such as Quality­by­Design (QbD), which is increasingly used in the pharmaceutical industry [6].

2.3.1. Data generation

Not all data are created equal. Seemingly obvious but sometimes overlooked, an important part of data­driven research is to ensure that data is representative of the problem that is to be solved. Perhaps the best way to ensure representative data is to design experiments in such way that the resulting data is optimal for data­driven modelling, which is the purpose of DOE that stems from century­old research [63]. Most real­life systems are influenced

15 2. Background

C

A B

Full factorial design Fractional factorial design Generalized subset design

Figure 2.7: Graphic illustration of three types of 3­factor designs, where the blue circles indicate unique factor combinations to include as experiments in the design. The full and fractional factorial designs investigates to levels of all factors, whereas the generalized subset design investigates three levels in factors A and B and two in factor C.

by more than one factor, meaning that it is strictly necessary to investigate interactions be­ tween factors to fully characterize the system. A fundamental challenge when the number of factors increases, is that the number of experiments necessary to test all combinations of factors grows exponentially. DOE provides methods to select experiments that are max­ imally useful for data­driven modelling of the investigated system.

To generate data optimal for fitting, designs based on combinations of fac­ tors investigated at two levels are widely used (figure 2.7). For instance, the classical full factorial design explores all factor combinations possible (plus optional center point ex­ periments), which allows identification of all possible terms in a multiple lin­ ear regression model (equation 2.2). The downside of full factorial design is that they are not very economical in terms of required experiments since the number of required exper­ iments grows exponentially with the number of factors. Fractional factorial designs on the other hand sacrifice possibility to identify certain interaction terms in order to cut the re­ quired number of experiments by the order of a power of two. These types of designs can also be generalized to more than two levels of each factor, for instance by using general­ ized subset designs (GSDs) [64] that allow design of data­efficient subsets in multi­level experiments.

A more general type of designs is provided by optimal designs. As the name implies, they are a type of designs that are optimal to some statistical criterion and found using optimiza­ tion algorithms. In short, out of a candidate set of possible experiments, optimal designs pick a set of experiments that best meets the criterion. By selecting a suitable candidate set, it is straightforward to incorporate constraints on what experiments are allowed in optimal designs. There are many types of optimal designs available and a popular one is provided by so called D­optimal designs [65]. The D­optimal designs find designs that maximize the

16 2.3. Chemometrics determinant of the information matrix as:

T XD­optimal = argmax |X X| (2.10) X∈XC

Where X ∈ XC describes the candidate set. This criterion can be interpreted geometrically that the square root of the determinant is inversely proportional to the volume of the con­ fidence ellipsoids, or uncertainty, of a linear model’s parameters [65]. Hence, D­optimal designs minimize the resulting model parameter uncertainty, which for instance is useful during screening experiments when factor importances are investigated. Although we have not provided a comprehensive summary of available designs, we conclude that DOE as a framework to generate data that is suitable for the problem at hand.

2.3.2. Data Analysis

Structured multivariate data analysis (MVDA) using linear latent variable models is often seen as a hallmark of chemometrics. Latent variables as a concept refer to that many ob­ served variables depend on some underlying variable that we can not observe directly. In chemometrics, latent variables are instead inferred from observed variables by using data­ driven models, typically linear ones denoting latent variables as components [66, 62]. His­ torically, this approach has been referred to ”soft” modelling in contrast to ”hard” where all variables are directly observed. One advantage of using latent variables instead of observed variables directly is that many real­world phenomena are difficult to measure directly and each observed variable only shows a noisy part of the truth. By relating many observed variables directly to one latent variable, by for instance analyzing their covariance, we can often draw far better conclusions than by analyzing them one­by­one. For instance, in healthcare there is no, at least to the knowledge of the author, single variable to measure to indicate the general health of the patient. But combining many different measurements, such as blood pressure, blood cholesterol, BMI, age, and so on, allows us to form the big­ ger picture of health status. Another advantage is that the number of latent variables is often many times fewer than the number of observed variables, making interpretation of the underlying relationships, to for instance analyze trends or spot outliers, far easier. A third advantage is that latent variable models handle collinearity, i.e. correlated observed variables, well. Rather than struggling with model identification under collinearity, latent variable models instead often benefit from it and summarize said correlations.

A fundamental latent variable method is provided by principal component analysis (PCA) [67, 8], which is an unsupervised linear latent variable method based on matrix decom­ position. The PCA components form an orthonormal projection basis that maximize the variance explained of the training data X. That is, using k components, we decompose the n×p data matrix X into the n×k score matrix T and p×k loadings matrix P = [p1, ..., pk] as X = TP T and the first loading vector is given by: X  2 p1 = argmax Xp (2.11) ∥p∥=1

Finding the first vector is an optimization problem (see section 2.2.1 ­ Model fitting), which

17 2. Background has many solutions. One of which is the closed­form solution given by the largest eigen­ vector of the covariance matrix XXT .

Being unsupervised and linear, PCA provides a basic tool for straightforward investiga­ tion of high­dimensional datasets by examining the found components ranked by their explained variance. There are much fewer components than variables, which leads to that inspecting the score­vectors provides a convenient way to get an overview of the dominant sources of variation in the dataset. Inspecting the loading vectors then provides transpar­ ent insight into how the original variables relate to the found components. Closely related to PCA but supervised, PLS (equation 2.3) allows targeted investigation by extracting com­ ponents directed by the target of interest. Other decomposition­based methods include in­ dependent component analysis [68], multivariate curve resolution [69] and non­negative matrix factorization [70, 71]. In terms of analyzing data where observations are character­ ized using two or more multivariate sets, or blocks, of variables, these methods fall under the scope of multiblock, or multiway, data analysis [13, 14].

Although linear decomposition­based models benefit from straightforward implementa­ tion, making them attractive also for large­scale datasets [7], they are limited in terms of the complexity they can model. There are many available methods for non­linear modelling for data analysis including formulating common decomposition­based models as kernel­ methods [72]. Recently, methods for non­linear dimension reduction common in the ma­ chine learning community have found use in chemometric applications as well, for instance t­distributed stochastic neighbor embeddings [73, 74]. Since non­linear models provide better possibility to model complex relationships they may provide a more dense summa­ rization of observations compared to linear models. Although, this comes at the price of reduced interpretability since the relationship between observed trends and the original variables is modelled in a more complex way than by using linear combinations. All in all, linear model or not, MVDA describes the underlying trends in complex, multivariate, data and shows that the whole is greater than the sum of its parts.

2.3.3. Predictive Modelling

Whereas DOE serves to generate meaningful data and MVDA to create new knowledge from data, predictive modelling serves to put that knowledge to use. To large extent, predic­ tive modelling in chemometrics means application of machine learning algorithms, which was covered in more depth in section 2.2. While linear latent variable models based on PLS are the go­to methods in many chemometric applications, we have seen adoption of ANN­ based methods in recent years [10, 11]. Many chemometric predictive problems fall under multivariate calibration [62] which aims to predict properties of interest for a chemical system based on multivariate measurements, such as infrared, Raman or nuclear magnetic resonance . Another chemometric application where predictive models are used is provided by so called quantitative structure activity relationship (QSAR)­modelling, where biological activity is modelled as a function of chemical structures [75].

18 2.3. Chemometrics

In any application where predictive modelling is used, not only chemometrics, there are certain pitfalls to avoid, where a major one is how to detect and avoid overfitting. Any pre­ dictive model needs to balance model complexity, number of parameters in equation 2.1, to find a model that is expressive enough to solve the problem efficiently but not so expressive that it starts to model spurious variation. An underfitted model fails to learn the underlying function, for instance a straight line modelling a curved relationship, whereas an overfit­ ted model is built on spurious variation not related to the underlying relationship and fails to generalize to new data. To monitor for overfitting, different types of cross­validation are common in all fields, meaning that the model is fitted on one part of the dataset and evaluated on another. In small datasets, so called k­fold cross­validation is widely used where several models are fitted and evaluated on different subsets of the complete dataset in a round­robin fashion reporting the average performance. For large datasets it is far more common to use a fixed split into training and one or more validation sets. How to split the data is highly application dependent to ensure realistic performance evaluation. In time­series for instance, observations very close in time are often highly correlated and easy to predict from their neighbours, meaning that splitting observations randomly will likely give overly optimistic performance evaluations. In such situations, it is better to split at a certain time point and use the early time for training and later time for evaluation. Or in the case of batch­wise , split out complete batches for validation. What metrics to use to evaluate models is problem­dependent and there is a myriad of options available that will not be listed here. The metrics may be given by loss functions (section 2.2.1) or other metrics suitable for the problem at hand, such as goodness­of­fit/predict R2/Q2 for regression or percentage of correct predictions (accuracy) for classification.

When a cross­validation strategy is selected for the problem at hand, there are several ways of how to control model expressivity. Firstly, the number of parameters determines the complexity of relationships the model can approximate. By adding polynomial features to an MLR­model, more components in a PLS model, more decision trees to a random forest or more layers in a deep ANN the approximation becomes more complex. By monitoring the difference in predictive performance between the training and validation set, we can find the complexity sweetspot. Another widely used approach is regularization, where the model complexity is penalized as part of model fitting. Examples include MLR models where a symptom of overfitting is that parameters grow in magnitude. The model can then be regularized by adding a penalty on large weights to its loss function, for instance by the squared size (ridge regression [76]) or by the absolute size (LASSO [77]). How to regularize models depends on what model is used, where for instance so called dropout is simple and popular in deep learning and works by ignoring random sets of nodes for each update during training [78].

To summarize, machine learning models are already heavily used in chemometrics for pre­ dictive modelling. Limited to neither machine learning or chemometrics, there are certain challenges to overcome for successful prediction. Regardless of what problem is studied, predictive models should be appropriately validated but exactly what is appropriate de­ pends on the problem at hand.

19 2. Background

2.3.4. Model Diagnostics

Given any predictive model, a key question to ask is how much we can trust it. This is important for all problem domains where predictive models are used and is by no means unique to chemometrics. Although, chemometrics has a long tradition to use quantitative metrics to ensure that the model is applicable on the new observations it is intended to be used on [62]. Note that quantifying to what extent an individual prediction can be trusted is complementary to how well predictions can be trusted on average according to model validation described above.

There are two main reasons for why predictions can not be predicted, outliers and dis­ tribution shift. Outliers are rare observations that differ significantly from expected data. Outlier detection, sometimes called anomaly detection, then uses different types of metrics to detect when an observation is different enough and flag it for further action [79]. Some examples of how to perform outlier detection include quantifying the density between ob­ servations in observation space and detect outliers as observations in unexpectedly low density regions [80], or by the size of the reconstruction residual of the observation by a model of training set observations [62]. The latter is particularly common in chemometrics where the training dataset is typically modelled using linear latent variables methods. For instance, when using PLS, an observation is approximated as x ≈ tW T , where t = xW (simplified by ignoring calculation of the loadings of the non­deflated original variables). The reconstruction residual sum­of­squares (RSS): Xp  T 2 RSS = xi − (tW )i (2.12) i=1 then provides a straightforward metric of how distant the observation then is from the hyperplane spanned by the columns in W [81].

On the other hand, distribution, or dataset, shift refers to when the underlying observations are drawn from changes over time, which is common in real­ world systems. When the underlying data changes, the model becomes obsolete and not trustworthy anymore. To detect when distribution shifts occur, statistics of observations encountered after training time can be recorded and compared to the training set distribu­ tion. For instance by simply performing two­sample tests of the null hypotheses that the prediction time and training set distributions are equal [82].

20 3 | Results

This chapter summarizes the contributions of this thesis and will discuss the findings of each paper in relation to the aims formulated in section 1.1.

Jumping straight to the conclusion of this thesis, there is plenty of room for synergies be­ tween chemometrics and machine learning. Using machine learning algorithms to fit pre­ dictive models is commonplace in chemometrics and has been so for many years. Histor­ ically, and today still, linear models have been used extensively with great success. These linear models have mostly been focused around latent variable methods such as PCA and PLS and their variants. Other methods, including ANNs, are already used and are increas­ ingly used when growing data quantities and expectations on predictive performance grow. Even though predictive performance is relevant to study, this thesis serves to describe ways of how we can nuance the relationship between the fields of machine learning and chemo­ metrics beyond predictive modelling and show that knowledge in either field can be used to benefit the other.

As described in section 2.3, we refer to chemometrics as a four­step process of data gener­ ation and analysis, predictive modelling and diagnostics (figure 2.6) and machine learning as the applied workflow of fitting performant predictive models given raw data (figure 2.2). The papers included in this thesis then serve as examples of how each of the fields may improve the other in another aspect than providing another machine learning algorithm (summarized in figure 3.1). The included examples do by no means provide an exhaustive list of the possibilities but will hopefully serve as inspiration for future work.

We start by exploring how MVDA combined with feature engineering techniques borrowed from machine learning, enables us to study chemistry from a new dimension, namely in the form of text. Few would argue against that text provides a source of chemical, and any type of, knowledge but systematic analysis of large text collections is rarely seen as a task for chemometrics. In Paper I we demonstrate how the combination of structured MVDA and interpretable feature extraction allows us to quickly gain new knowledge on trends and clusters from collections of hundreds to thousands of text documents.

Turning to the problem of optimizing large­scale data processing pipelines, common in computational sciences. We know that how we configure each step in the pipeline influ­ ence the end results but depending on the complexity finding the best configuration may be challenging. Originally developed for manual experimentation, DOE provides tools to find optimal solutions in multi­factor problems. In Paper II we develop an algorithm implementing DOE to optimize multi­step data pipelines for genomic data processing.

21 3. Results

Training Model Machine Learning Data Training

Data Model Raw Data Preprocessing Deployment

Evaluation Model Data Evaluation Paper IV Paper I Paper III

Today Paper II Chemometrics

Data Predictive Model Data Analysis Generation Modelling Diagnostics

Figure 3.1: Illustration of how each paper in thesis contribute to nuancing the rela­ tionship between the fields of machine learning and chemometrics, which today revolves around using machine learning algorithms for predictive modelling. Paper I shows how data preprocessing common in machine learning can broaden the applicability of MVDA. Paper II then illustrates how DOE methods can be used to optimize data pipelines. Paper III draws inspiration from MVDA to provide a new method to monitor ANNs. Finally, Paper IV applies DOE principles to assure quality in large­scale datasets.

An important aspect of the linear latent variable methods prominent in chemometrics is that they provide a model of expected variation based on training data. Comparing new observations to these models allows us to easily determine whether they fall outside the expected variation and should be viewed as outliers. By definition, we lack evidence for that the model generalizes to outlier observations, meaning that outlier detection gives us clear indication whether we should trust the prediction at all. Inspired by this property of linear latent variable methods, we describe a new mehod in Paper III that gives ANNs an equivalent possibility.

Lastly, we turn to the core of modern machine learning, namely collection, annotation and quality control of large­scale datasets. Collecting and annotating a large dataset of im­ ages is complex and expensive, both from an economically and labour­wise point of view. The pragmatism and transparency of chemometrics has made it widely used for industrial process quality work. In Paper IV we demonstrate a case study in how we can design large­scale image datasets to ensure quality directly from the start.

22 3.1. Summary of Paper Contributions

3.1. Summary of Paper Contributions

Table 3.1 provides a high­level overview of my contributions to each paper. All work was performed under supervision of Prof. Johan Trygg and co­supervisor Dr. Olivier Cloarec who acted as advisors and provided feedback on the research described in this thesis.

Table 3.1: Summary of my contributions for each paper included in this thesis.

Paper Contribution I i) Formulated research question together with co­authors; ii) Collected and curated data; iii) Implemented algorithms and ran all experi­ ments; iv) Wrote the paper II i) Implemented an existing DOE­based algorithm; ii) Extended the al­ gorithm with new functionality for screening based on generalized sub­ set designs and automatic regression ; iii) Wrote the algorithm description in the paper III i) Formulated research question together with supervisor; ii) Devel­ oped the algorithm; iii) Implemented algorithm and ran all experi­ ments; iv) Wrote the paper IV i) Formulated research question; ii) Collected and curated data to­ gether with co­authors; iii) Developed processes for annotation, in­ cluding DOE, and quality assurance together with co­authors; iv) Oversaw and provided feedback on all experiments; v) Performed MVDA on resulting data; vi) Designed and implemented benchmarks together with co­authors; vii) Wrote the paper together with co­ authors

23 3. Results

3.2. Paper I: MVDA for Patent Analytics

Patents, which are sometimes overlooked in science, provide a rich source of technological and scientific knowledge since much information is disclosed in patents only. Analyzing patent collections provides information on emerging technological trends, prior art for as­ sessment of new innovations, technical inspiration for problem solving in R&D, the legal landscape of the market of a new field for a company, technology gaps for national policy makers, and much more. Due to the quickly growing number of patents available world­ wide, the task to find valuable information is growing increasingly challenging. Today, patent offices are receiving hundreds of thousands of applications per year [83] making algorithms and data mining methods for quickly analyzing vast patent collections crucial. Machine learning algorithms are increasingly used to aid patent analysis by various means, such as technology categorization, prediction of future value and so on [84, 85]. Due to the business critical nature of patent analytics, fully automated models only partially relieve the challenges due to never being 100 % correct. Since investigating patent content is com­ monly done by legally trained experts, there is need for data­driven tools to accelerate their work even though automating will be difficult in the short term.

Aim Text analysis, machine learning and data analysis are all widely used in patent analyt­ ics. Although, they are often limited to machine learning­based predictive models or simple univariate analysis of patent metadata. On the other hand, the structured approach to ana­ lyze complex dataset provided by MVDA has proven insightful in many other domains but it has not yet been fully used in text­based analysis due to disparate research communities. This study aimed to use common text preprocessing methods borrowed from the machine learning community and analyze patent collections using a structured MVDA­approach to explore its value for patent analytics.

Results We propose a workflow based on MVDA on a data­matrix formed by preprocess­ ing a collection patent documents into bags­of­words subject to noun­filtering and term ­inverse document frequency (tf­idf) transformation [86]. We demonstrate how the workflow can be used as a tool to interrogate large collections of more than ten thousand documents (Paper I ­ Figure 1), as well as a tool for more targeted analysis in smaller collec­ tions (Paper I ­ Figures 2­4). Thanks to the interpretability of variable influence provided by linear latent variable loadings, we can easily draw conclusions on the general technology trends observed in the interrogated collections.

Discussion Neither preprocessing patents into bags­of­words or applying machine learn­ ing to patent analysis is novel in itself. What our workflow demonstrates is that data­driven models, such as PCA, are useful for much more than predictions even in the case of large patent collections. The structured approach of working with high­dimensional datasets provided by MVDA lends itself well as a tool to support domain experts, even for non­ chemical use cases. Additionally, the bag­of­words representation is itself highly inter­ pretable when combined with loading­based model interpretation, where strong loadings are directly interpreted as frequent use of the words in question.

24 3.2. Paper I: MVDA for Patent Analytics

Figure 3.2: Schematic illustration of the MVDA­based workflow for patent text analyt­ ics proposed in Paper I.

For predictive modelling, the models we use are far from state­of­the­art, in terms of patent analytics as well as text analytics in general. In most text­based prediction problems, large ANNs based on the transformer­architecture and pretrained for language modelling on massive text datasets [16] are considered state­of­the­art. Although offering a certain de­ gree of interpretability [87], these types of models are still far from offering intuitive and straightforward explanation of what they see and why in the way we can using linear latent variable methods. With that said, development of interpretation and explainability meth­ ods for complex non­linear models is making great progress [88, 89], and there is great interest in ”opening the black box” of complex models. Since data are growing rapidly in terms of both quantity and complexity, within chemometrics and virtually all other fields, more complex data­driven models are needed to capture said complexity. We will however need to take care to not sacrifice the interpretability of today’s methods.

Future perspective As we demonstrated in a case study in patent analytics, MVDA as embraced within chemometrics is useful beyond chemical applications. By carefully choos­ ing preprocessing and representations, even unstructured data, such as text, can benefit from structured and interpretable workflows. Chemometric approaches are already well established for other complex data, such as omics­data [90], and by borrowing prepro­ cessing tools from the machine learning community they may definitely be useful beyond chemical applications.

25 3. Results

3.3. Paper II: DOE for Data Processing Pipelines

Bioinformatics software often require configuration of a large number of parameters for the software to work properly. Parameters may be quantitative as well as qualitative and often non­obvious in how they relate to the end results, making configuration far from trivial. Considering that software tools are often used in multistep pipelines, these difficulties are further exacerbated. Making it even more challenging, certain types of instruments, such as DNA sequencing platforms [91], show subtle differences in the underlying data depending on which platform is used, which need to be taken into account.

Aim Often, parameter settings are selected based on personal or peer experience obtained in a trial­and­error fashion, or by simply retaining the default values risking suboptimal results. Trying all possible configurations, so called brute­force search, quickly becomes infeasible. A more efficient alternative is to select a limited number of combinations at random, so called random search [92], is common within machine learning but ends leav­ ing the results to chance. DOE offers a systematic and efficient approach for experimen­ tation in general, and this study aims to explore how DOE can be used to select parameter configurations for complex pipelines.

Results We present a pipeline optimization algorithm based on DOE to optimize bioin­ formatic data processing pipelines (Paper II ­ Figure 1), supporting quantitative and qual­ itative factors as well as multiple response optimization. The algorithm is divided into two phases. Firstly, generalized subset designs [64] are used to span a large configuration parameter space and find an approximate maximum preferred settings for qualitative fac­ tors. Secondly, response surface designs, regression and numeric optimization are used to fine­tune quantiative factors around the approximate optimum. We demonstrated the algorithm in four bioinformatic case studies (Paper II ­ Tables 1­5) and compared it to brute force­search for validation.

Discussion Principled search for optimal factor combinations is a common problem in many disciplines. The use of DOE in chemistry is well­proven to reduce cost of experimen­ tation both in time and labor to optimize complex processes. We demonstrate that this transfers well to reduce cost for large­scale computational pipelines as well. Unfortunately, the software implementation was originally developed for in­house use cases and later pub­ lished, which made it not directly compatible with standard pipeline running tools, such as Snakemake [93], limiting it’s ease of use.

Future perspective With evergrowing computational pipelines, the problems that we face in bioinformatics today will become even more prevalent. The provided by DOE lends itself naturally for such problems, something we have already seen in applied machine learning [94]. Promoting and developing open­source and easy­to­use software packages would help establishing DOE as a solution to a common problem for computa­ tional scientists.

26 3.4. Paper III: Outlier Detection in ANNs

3.4. Paper III: Outlier Detection in ANNs

Thanks to the powerful transformations learned by deep ANNs, deep learning is increas­ ingly applied throughout society and impacts many aspects of modern life. Due to their strong predictive performance in many tasks, ANNs are increasingly used in safety crit­ ical systems as well, with vision systems for autonomous cars as a well­known example. However, all machine learning­models have undefined behaviour when input data differs from training data, which means that there are no guarantees of how they will behave for outliers.

Adopting deep learning in increasingly complex and possibly safety­critical systems makes it crucial to know not only whether the model’s predictions are accurate, but also whether the model should predict at all for a given observation. Equipping our models with possi­ bility to detect outliers, also called out­of­distribution (OOD)­examples, would allow us to implement systems that can fall back to safe behaviour when encountering outliers, which may minimize negative consequences of faulty predictions. A tragic real­life example of possible consequences is the fatal collision of a pedestrian and an autonomous vehicle in March 2018 [95]. In the moments before the crash, the car’s vision system erratically classi­ fied the pedestrian as different object classes moving at different speeds in different frames, which is a possible symptom of OOD­examples. To avoid similar future accidents it is im­ portant to understand the limits of our models’ learned representations in order to detect when observations are not recognized, so that autonomous decision making based on deep learning can be improved.

Aim In chemometrics, the linear latent variable models commonly used for predictive modelling allow for outlier detection in a straightforward manner by, for instance, monitor­ ing the reconstruction residual sum of squares (RSS) (see equation 2.12). An observation with unexpectedly large RSS is deemed ”too far” from the model of the training set. If the observation is too far away then we have no evidence for that we can trust the model’s pre­ diction for that observation. This study set out to explore what options are available to let deep ANNs detect OOD­observations and develop a new method to fill in the gaps.

Results We view ANN­activations as a multivariate description of an observation and draw inspiration from MVDA to develop a new metric for outlier detection. We propose DIstance to Modelled Embedding (DIME) (Paper III ­ Figure 1 & Equation 4), which pro­ vides an analogue to RSS­based outlier detection in linear latent variable models. DIME works by approximating the ANN­activations using a high­rank PCA­model and reporting RSS for new observations’ activations relative to the PCA­approximation. We demonstrate that DIME is able to capture complex cases of outliers in computer vision and natural lan­ guage processing (Paper III ­ Figures 2 & 3). Compared to state­of­the­art methods, DIME provides both high detection­performance as well as low computational overhead.

In addition to simply working well, DIME also has relatively few configuration parame­ ters, so called hyperparameters, to tune in order to work well. The two main parameters

27 3. Results are which layer’s activations to use and the PCA rank. In terms of which layer to use, we have found it to correspond well to what deep learning practitioners would use for transfer learning (for instance, late but not penultimate layer in a CNN). Regarding PCA rank, we found that in contrast to most chemometric applications where we typically limit the anal­ ysis to relatively few principal components, DIME did not work well until the explained variance was very high (around 99 %) (Paper III ­ Figure 4).

Discussion DIME is based on a very simple idea, namely that the intermediate represen­ tations of a deep ANN provides a useful, but abstract, representation of any data that we then simply approximate using a hyperplane. To our surprise, we found no previous meth­ ods working along these principles but quickly found it to work well. DIME then serves as a prime example of how an approach long used within chemometrics can provide inspiration for a solution to an important problem within machine learning.

Outlier detection is an important but somewhat understudied field within deep learning. A common perception among deep learning­practitioners is that the notion of outliers should be encoded into deep ANNs and easily caught by either inspecting the model’s confidence [96] or adding an extra ”catch all”­output. DIME alongside other methods [97] demon­ strate that reality is not so simple and that dedicated outlier detection is necessary for a more complete solution.

In addition to the technical perspective, this study also illustrates an honest attempt to do good science in machine learning. In the past year, it has come to light that poor use of basic statistics and substandard baselines are widespread when comparing machine learning­ methods [98, 99]. These manners make it easy to present any new method as outperform­ ing the prior alternatives, whereas the difference may be due to chance or non­existent if the baseline would have been properly tuned. When evaluating DIME, we tuned the base­ line methods for each and only used their best values to avoid crippling the baselines. We also performed meta­analysis of all our experimental results finding that DIME performed slightly better than all alternatives on average, although that difference was not always statistically significant (Paper III ­ Table 1). We believe that this approach offers a more fair comparison and realistic portrayal of the performance of our method and hope to set an example of how experimentation in machine learning can be improved.

Future perspective Although our experiments show convincing results, outlier detec­ tion is not yet a solved problem. Our experiments are limited to relatively small scale­ experiments, roughly equivalent in size to what most practitioners would use for transfer learning, as are most benchmarks in the field [100]. Datasets are growing larger and prob­ lems more complex, making outlier detection ever more important while at the same time more difficult to study. We have already seen deep ANNs with hundreds of billions of parameters trained on petabytes of data [101], making the gap between state­of­the­art in model training and state­of­the­art in outlier detection incomprehensively large. Although, we conclude that when data­driven models are becoming ever­more present throughout so­ ciety they need a basic degree of common sense to admit that they sometimes are uncertain to be trustworthy.

28 3.4. Paper III: Outlier Detection in ANNs

3.4.1. Additional Case Studies

When developing DIME, we performed several experiments that did not end up in Paper III due to space constraints. Some of these case studies are summarized here to demonstrate its applicability further.

Predictive Maintenance Predictive maintenance techniques aim to automatically de­ termine when a piece of equipment require maintenance to avoid breakage. Automatically estimating the degradation of the equipment allow for significant cost savings compared to routine or time­based maintenance. When following a fixed schedule, there is risk that maintenance is performed earlier than necessary leading to excess expenses. There is also risk that the equipment fails earlier than expected, which may lead to catastrophic failure or process stops. To demonstrate the applicability of DIME in predictive maintenance, we use the Turbofan Engine Degradation dataset provided by the Prognostics CoE at NASA Ames [102]. The dataset consists of simulated turbofan engines running to failure under differ­ ent operational conditions and fault modes. The engines were monitored over time using 21 sensors and 3 control settings were recorded for 100 engines in the training dataset and 100 in the test dataset. The challenge is to predict failure 15 sensor cycles before it happens to avoid catastrophic failure.

We trained a three­layer RNN based on Long­Short Term Memory (LSTM)­blocks [53] to predict failure before it ahead of time. To train the LSTM­model, we used sliding windows that were 50 cycles long as input and a binary response indicating whether the last cycle of the sliding window is within 15 cycles away from failure. We used the 21 sensor outputs, the 3 control settings and current cycle since beginning as variables for prediction. We scaled all variables to range 0­1, and used the scaling parameters for the training data to scale the test data. To monitor training progression, we used 10 % of the training dataset for validation. We trained the model until binary cross entropy for the validation set stopped improving (eight epochs) using the Adam optimizer [103]. The resulting model achieved a test­set F1­score of 94.1 % at sliding window classification.

To simulate prediction time outliers where a sensor stops working, we randomly chose half of the test set engines to serve as outliers. For each of these engines, we selected on sensor at random and set its output to zero starting at the middle of the full­time sequence. To provide features for DIME, we collected the activations from the LSTM­layers of the predictive maintenance model from the first two LSTM­layers and averaged them across time to summarize each sliding window. For the last LSTM­layer, we simply used the last cycle activations in the time sequence. The activations from these three layers were then concatenated into a single matrix.

We used DIME for outlier detection, using a 100­component PCA­model fitted on all the training data and used the 99.9th of training set DIME as cutoff for outliers. We then predicted failure status and calculated DIME for the outlier and remaining test­set LSTM­features for all sliding windows, successfully distinguishing between outliers and normal variation (precision = recall = 1 for this experiment). Following sensor breakage,

29 3. Results

Figure 3.4: Qualitative test­set results of outlier detection in the predictive maintenance case study. The left­hand side (a, c & e) shows test set without sensor breakage aligned at middle time point and the right hand side (b, d & f) shows sequences where one ran­ dom sensor breaks aligned at sensor breakage (vertical dashed line indicate when sensor broke). Top row (a & b) shows DIME (denoted Res­SS here) for outlier detection, hori­ zontal dashed line indicates the 99.9th percentile based on training set distances. Middle row (c & d) shows the model’s prediction if the turbine is going to break within 15 cycles, where 1 means that engine is predicted to fail. The bottom row (e & f) shows the labels indicating if the turbine is going to break within 15 cycles, where 1 means that the engine is going to fail. the LSTM­model predicts that many of the engines are going to break but DIME captures all cases as outliers. This experiment shows that by using DIME we are able to distinguish between predictions that could be trusted and those that should not. It also shows that simply inspecting model confidence is not always reliable, in this case the predictions be­ haved erratically with very high confidence (figure 3.4 d). It also shows that DIME is not only applicable to feed­forward networks, but recurrent networks as well.

30 3.4. Paper III: Outlier Detection in ANNs

To explore how the number of components and residual sum­of­squares influence the results, we varied the number of compo­ nents from 1 to 150 and reported the pre­ cision and recall for outlier detection (see figure 3.3). Components with very small explained variation is required to achieve high precision and recall (99.99 % cumu­ lative R2 is reached after only 36 compo­ nents). This implies that the variation that differ between outliers and normal test set sequences are small nuances that require very small PCA components to be cap­ tured. We also observe overfitting with re­ duced precision when the number of com­ ponents becomes too large, confirming re­ sults from other experiments that the PCA­ rank should be high but not full. Similar Figure 3.3: DIME precision and recall for trends are observed for other DIME cutoffs classifying outliers in the predictive mainte­ (not shown). nance case study.

Speech Commands To show how out­ lier detection can be used to distinguish between short speech commands and background sounds, we use the Speech Commands dataset [104]. The Speech Commands dataset con­ sists of 105 000 short utterances of 35 words recorded using 16 kHz ­rate. The words includes digits zero to nine, command words yes, no, up, down, left, right, on, off, stop, go, backward, forward, follow, learn, visual. The dataset also include a set of arbi­ trary words bed, bird, cat, dog, happy, house, marvin, sheila, tree and wow. In addition to the short utterances, the dataset also include background noise including white and pink noise, noise from an exercise bike, a man sounding like a cat, the sound from a running tap and sound from a person doing the dishes. In our experiment, we train a classification model of all utterances and use the background noise as OOD­examples.

We transform the utterances, and background noise, into 64 Mel­Frequency Cepstral Coef­ ficients [105], using a frame­length of 2048 samples and frame­stride of 512 samples. We then trained a CNN classifier on all 35 words. The CNN consisted of three convolutional blocks with a Conv­Conv­2x2 MaxPool­SpatialDropout structure. The convolutional layers of each block had 32, 64 and 128 filters respectively, kernel size 3×3 and ReLu­activation. All spatial dropout­layers used 20 % dropout. The input was batch normalized before the first convolutional block. After the last convolutional block, a last convolutional classifier layer using softmax activation was pooled using global average pooling. The model was trained for 20 epochs using stochastic gradient descent with 0.9 momentum. The initial learning rate was 0.1 which was reduced after 10 epochs to 0.01, and then further reduced to 0.001 after another 5 epochs achieving top­1 test set accuracy of 86 % across 35 classes.

31 3. Results

Table 3.2: Area under Precision­Recall To detect OOD­examples, we compared curves (PR­AUC) from detection of OOD­ DIME to prediction confidence, i.e. the examples on the Speech Commands experi­ maximal predicted probability p [96], max ment. and Monte Carlo (MC)­Dropout entropy of predictions [106]. For DIME we used the p DIME Entropy globally average pooled output from the max last convolutional layer before the last spa­ White Noise 0.964 0.995 0.978 tial dropout, and a PCA­model explaining 99.9 % variance of the embedding varia­ Pink Noise 0.955 0.997 0.964 tion. For MC­Dropout the Shannon en­ Running Tap 0.036 0.998 0.175 tropy of predictions from 30 Monte Carlo­ samples. Since the datasets are unbal­ Dishwasher 0.621 0.994 0.692 anced (around 60 outlier examples com­ Cat Sounds 0.798 0.996 0.849 pared to thousands of test­set examples), the results were compared using areas un­ Exercise Bike 0.816 0.995 0.947 der Precision­Recall curves (PR­AUC).

DIME was able to distinguish all types of background noise reliably (see results in Table

3.2). MC­Dropout entropy and the baseline approach, pmax, detected four of the six back­ ground noises successfully (PR­AUC > 0.8). Interestingly, the model confidently predicted noise from a running tap as the word six or house, which may be due to the trailing sibilant s­sounds resemble a running tap. Apart from that DIME performs well in this example, it shows the importance of dedicated outlier methods compared to relying only on models’ predicted confidence.

32 3.5. Paper IV: Quality Assurance in Large Data Projects

3.5. Paper IV: Quality Assurance in Large Data Projects

Light microscopic images provide a cheap and accessible data source to study biological phenomena in a noninvasive manner. When acquired in high­throughput manner and combined with well­established protocols of two­dimensional cell cultures, 2D­ allows detailed quantification of cell count, morphology, proliferation, migration and cel­ lular interactions making it an attractive tool to study cell biology. To gain insight into the behaviour of individual cells, segmentation algorithms are necessary to detect and segment individual cells directly from images. Due to the low contrast and high object density of mi­ croscopic images, cell segmentation is an incredibly challenging computer vision problem. In the past decade, convolutional neural network (CNN)­based segmentation models have far outperformed their non­deep learning counterparts. Although, deep CNNs require vast annotated datasets for which there is no suitable resource available for label­free cellular imaging.

Aim This study set out to address the gap of missing large­scale resources for training deep learning­based cell segmentation models for label­free 2D­microscopic images. An­ notating images for cell segmentation is complex and extraordinarily labour­intensive due to lack of algorithms to do the job for us. To provide labour, large teams of non­expert annotators are used, which introduce a whole other dimension of complexity in order to guarantee data quality. For the purpose of this thesis, we focus on the aspect of how to plan and assure quality in large­scale, complex image annotation projects.

Results We present LIVECell, a high­quality microscopic image dataset that is the largest of its kind to date, comprising over 1.6 million manually outlined cells from a diverse set of cell morphologies and culture densities. To create LIVECell, we selected a morphologically diverse set of cell types that pose different challenges for cell segmentation (Paper IV ­ Table 1) and prepared 5240 images balanced over cell types and cell culture density. Annotating microscopic images is challenging and the annotator team is expected to learn the task over time, meaning that there is risk of introducing time­dependent bias in annotation quality unless this is taken into account during planning. To avoid this risk, we divided the images into eight balanced batches using generalized subset designs (GSD) of cell type, timestamp and well as factors and uploaded the batches sequentially for annotation (see illustration in figure 3.5). During annotation we performed extensive quality assurance in two levels, firstly by the annotation managers and secondly by an experienced cell biologist.

To verify the morphological diversity of LIVECell, we characterized each cell by a multi­ variate description of commonly used shape­ and morphology­descriptors. Using PCA of image­wise averages, we then confirmed that LIVECell indeed spans a wide set of mor­ phologies (Paper IV ­ Figures 1). To demonstrate the utility of LIVECell, we designed benchmarks measuring different aspects of cell segmentation performance and trained CNN­based segmentation models that were evaluated on each benchmark (Paper IV ­ Fig­ ures 3 & 4). Finally, we demonstrated the benefit of the scale by training models on subsets of varied size and evaluating them on the complete test set (Paper IV ­ Figure 5).

33 3. Results

Image batch a b

Well Annotator skill

Cell culture density

Cell type Time

Figure 3.5: Illustration of the rationale behind uploading images for annotation in bal­ anced batches as given by DOE. The left­hand side (a) illustrates how a generalized sub­ set design is used to split images into batches balanced according to meta­data (cell type, cell culture density, well). The right­hand side (b) illustrates how the annotators skill increases over time and the batches are provided in sequence to minimize the risk of in­ troducing systematic bias in annotation quality due to annotators learning over time.

Discussion To date there is no family of machine learning methods close to the perfor­ mance of deep ANNs trained on large datasets for unstructured data problems. This means that data generation is more important than ever to maximize the benefit of machine learn­ ing. For complex problems, such as cell segmentation in microscopic images, manual an­ notation is currently the most reliable method to create quality responses for raw data. The large scale combined with the reliance on manual labour means that there are strong economic incentives to maximize the benefit of the data sent for annotation.

We know that machine learning models, not only deep learning­based ones, benefit not only from dataset scale, but also diversity. By training on diverse datasets, the resulting models will be more widely applicable. This has direct parallels to DOE, which aims to sys­ tematically maximize the diversity of datasets, although typically at small scale. Our study is directly inspired by DOE principles, where we carefully selected cell types to work with, ensured that the images were balanced over the conditions in question and that we explic­ itly used DOE­methods to split the datasets into batches to avoid introducing systematic bias in annotation quality.

To confirm the morphological diversity of LIVECell, we opted for a simple alternative such as PCA (Paper IV ­ Figure 1). By using a transparent model for dimension reduction com­ bined with interpretable morphology metrics, we could easily draw conclusions on the dif­ ferent morphologies represented by LIVECell. This is in contrast to the trend of using non­ linear dimension reduction methods (example in [107] ­ Figure 2) that reduce any data to two dimensions at the cost of interpretability. We argue that the principle of simplicity also applies to data analysis, and that simpler and interpretable methods should be used as long as they do the job.

34 3.5. Paper IV: Quality Assurance in Large Data Projects

Future perspective In the past few years, there has been intense discussion on bias and fairness of machine learning systems. For instance, it was demonstrated that commercial facial recognition systems displayed systematic bias due to both gender and skin color by performing significantly worse for female and dark­skinned faces [108]. Part of the reason why machine learning systems become biased is that most real­world datasets are biased due to the simple fact that they are collected in our society and reflects the biases and un­ fairness therewithin. In the case of facial recognition, this discussion spurred creation of new balanced datasets to make it possible to study the prevalence of systematic bias and how to avoid replicating it [109].

This thesis is not a study in bias and ethics of large­scale machine learning and the problem is not solved simply by balancing datasets. However, the discussion of biased datasets and the deliberate efforts to create more balanced versions is related to dataset design that we explored in Paper IV. Instead of simply grabbing whatever data is available to train our machine learning models, we ought to think about how we design our datasets for the purpose they should fulfill. Without deliberate effort, real­world data is very rarely balanced with respect to the different factors influencing the system, regardless if it is from applicant statistics in historical job applications or statistics on equipment breakdown in industrial processes. Knowing this and that machine learning systems simply get good on the data they train on, we can make the situation a bit better by thinking beforehand in an attempt to be aware of and account for the imbalances and biases we might face.

35 4 | Future Perspectives

While this thesis has shown several ways of how chemometrics and machine learning may benefit each other, there are still many open avenues to explore. Three out of four papers in this thesis describe how chemometric principles can be used in a machine learning context whereas this chapter outlines two ways of going the other way.

4.1. Separating Sources of Variation

One of the most widely used variants of PLS is orthogonal partial least squares regression (O­PLS) [110]. From a prediction perspective, O­PLS is equivalent to PLS. The main dif­ ference is that O­PLS explicitly separates predictive latent variables from other systematic variation in order to simplify interpretation. That is, the O­PLS model of X is:

T T X = TP + ToPo + E (4.1)

With n × k matrices of predictive scores T and orthogonal scores To, p × k matrices of predictive loadings P and orthogonal loadings Po, and n × p residual matrix E. Exploring the different sources separately is then simply a matter of inspecting the pairs T /P and

To/Po separate from each other.

In linear models, this separation is fairly straightforward to accomplish since we can de­ fine systematic variation unrelated to the response as orthogonal to y and separate the two types of variation using linear algebra. The biggest advantage of separating difference sources of variation is simplified interpretation. Inspecting the variation related to the re­ sponse unconfounded with other systematic variation is less complex than interpreting all variation at once. Even though prediction performance is equivalent, better possibilities for interpretation allows for better data­driven decision­making.

In the general case, this type of separation is conceptually related to that of disentangled representation learning. Even for complex and non­linear relationships there is a hypoth­ esis that real­world data is a mixture of several underlying phenomena that may be sep­ arated, or disentangled, from each other. In recent years there has been great interest to identify these underlying relationships, since separation sources of variation may increase both interpretability and usability as already demonstrated by O­PLS in the linear case. An example of disentangled representation learning is provided by attempts to separate the semantic content of text, meaning the underlying message, from the style, meaning the specific way it was written [111]. In facial recognition, there are methods for disen­ tangled representation learning that separate the face of the person from the pose of the face, meaning which way the person is looking [112]. Going one step further, there is ongo­

36 4.2. Rethinking PLS Weights ing work to develop causal representation learning [89], which refers to machine learning algorithms that aim to identify the underlying cause­effect relationships. Understanding what variation causes, rather than just correlates with, other variation would then provide even stronger possibilities for interpretation. Part of the reason why the AI­community aims for disentanglement and causality is that separated sources would make it easier to develop autonomous agents. An autonomous agent working with disentangled or causal representations of its environment would be better predispositioned to focus on the right information for the problem it tries to solve.

The transparency of linear latent variable methods has long benefited chemometrics. Along with most disciplines, data­driven problems are getting more complex and expectations on predictive models are growing. While we may adopt more expressive models for increased predictive performance, this often comes to the cost of sacrificing transparency. Adopting recent developments in disentangled, or even causal, representations may provide an av­ enue for performant and understandable predictive models for future chemical problems.

4.2. Rethinking PLS Weights

Few would argue against that PLS [9, 113] and its variants have been incredibly successful for many problems in chemometrics. Although successful, there are certain limitations of PLS such as sensitivity to outliers or being prone to overfitting. Here we present an alternative formulation of how to calculate the PLS model inspired by how we train other machine learning models such as ANNs.

We write the full PLS­model as: X = TP T + E (4.2) Y = UCT + F With n × k X­score matrix T , n × k Y­score matrix U, p × k X­loading matrix P , q × k Y­weight matrix Q, n × p residual matrix E in X , and n × q residual matrix F in Y, where T = U in case the model is fitting well.

During PLS model fitting, we also get the p × k PLS weight matrix W = [w1, w2, ..., wk].

The weight vectors, wi, are typically calculated sequentially, where the first weight vector pair is given by:

{w1, c1} = argmax Cov(t1, u1) (4.3) ∥w1∥=∥c1∥=1

Where the first X­score vector t1 = Xw1 is the projection of observations onto the weights, n n and Y­score vector u1 = Y c1 is the response projection and Cov(t1, u1): R × R → R:   1   Cov(t , u ) = t − E[t ] T u − E[u ] (4.4) 1 1 n − 1 1 1 1 1 denotes the empirical covariance between the score vectors. As mentioned in section 2.2.1, equation 4.3 is solved by the first eigenvector of the combined variance–covariance matrix XT YY T X, typically calculated using a variant of the iterative NIPALS­algorithm [66]. T T The corresponding loading vector p1 = X tt/tt tt, where t = Xw1, is then used to deflate

37 4. Future Perspectives

Training set - Inner relation PLS (RMSE = 0.748) L1-PLS (RMSE = 0.498) 3 6 6

2 4 4

1 2 2

0 0 0 t[1] y(pred) y(pred) 1 2 2

2 4 4

3 6 6 0 10 20 5 0 5 5 0 5 y(obs) y(obs) y(obs)

Figure 4.1: Toy problem results to illustrate possibilities of alternative formulation of PLS weights. From left to right, the true inner relation used to generate data (note the outlier), regular PLS results on validation set, alternative PLS­results using L1­norm as loss instead of covariance on validation set. Black lines indicate the identity relationship.

T the observation matrix as X1 = X−t1p1 . The next weight vector is then given by the same procedure using the deflated matrix X1 instead of the original X. Depending on which PLS­variant is used, the response matrix Y may also be deflated.

We note that we may in fact view equation 4.3 as a loss function as described in section 2.2.1 and restate weight identification as optimization of any loss function:

{w1, c1} = argmax L(X, Y , w1, c1) (4.5) ∥w1∥=∥c1∥=1

Knowing that loss­functions are powerful instruments to influence the behaviour of ma­ chine learning models (section 2.2.1), being able to tune PLS­behaviour in this way is tempt­ ing. As pointed out in relation to equation 2.9, the prediction by PLS according to equa­ tion 2.3 is equivalent to that by an ANN with a single hidden layer, meaning two weight matrices, and no activation function or bias vectors. Knowing that we can readily train arbitrarily deep ANNs using the back­propagation algorithm and gradient descent opti­ mization, we may use whatever loss function we like to find the weight vectors for our PLS model. A related formulation is provided by PCA which is closely related to PLS, which has already been rewritten in a similar fashion to be robust to extreme data corruption [114].

Thanks to the flexibility provided by auto­differentiation in modern software­frameworks for deep learning [40, 41], implementing and exploring ideas such as this is straightfor­ ward. To demonstrate the capabilities of our alternative PLS­formulation on a toy­problem, we generate a small regression dataset of 50 observations governed by a single latent vari­ able in the presence strong outlier. The data is generated by drawing scores for 50 observa­ tion, t1 from a normal distribution and generating the response by multiplying the scores by two and adding normally distributed noise with half the of the scores.

A strong outlier is set manually in the inner relation. Loadings, p1, are then generated by again drawing 500 elements from a normal distribution scaled to unit norm and are then used to generate the observation matrix X = tpT . We also generated a separate validation set without outliers with 100 observations using the same underlying relation.

38 We then fit both a single­component PLS­model both using regular covariance as well as using the L1­norm of the predicted residuals as given by: Xn T {w1, c1} = argmax yi − (Xw1c1 )i (4.6) ∥ ∥ ∥ ∥ w1 = c1 =1 i=1 as loss­function instead of the covariance. The L1­norm is well­known to be more robust to outliers compared to least squares alternatives such as mean­squared error or covari­ ance. The alternative PLS­model is implemented in the Python­based deep learning frame­ work PyTorch [41] and trained using the back­propagation algorithm and gradient descent (equation 2.8) for 500 iterations with learning­rate λ = 10−2 (taking far less than second on a mid­end laptop). The two models are then compared using the root­mean squared error (RMSE) of the validation set (see figure 4.1), showing that the PLS­model trained us­ ing L1­loss achieved substantially lower prediction error (RMSE = 0.5 compared to 0.75 of normal PLS). If the outlier is removed both models performed equally well (RMSE = 0.5), confirming that the L1­based loss function indeed reduced the PLS­model’s sensitivity to outliers.

Although merely a toy problem, these results serve to illustrate that we can easily give PLS­ based models new capabilities by letting our weight vectors maximize other functions than the covariance. Armed with whatever loss­functions machine learning can offer, we may have a way to express latent variables in a way different from what we are used to and is something that we leave for future research. We however conclude that formulating how PLS can be trained ANN­principles provides a pretty neat way of wrapping up this thesis.

39 Acknowledgements

Under dessa år är det många människor som på ett eller annat vis bidragit till att göra skrivandet av den här avhandlingen möjligt. Några av er vill jag tacka särskilt.

Stort tack till min handledare Johan Trygg som gav mig den här möjligheten. Jag är oer­ hört tacksam för att du alltid trott på mig och det jag gör och all den hjälp jag fått. Det har varit en resa som heter duga och jag tror vi båda lärt oss mycket på vägen. I would also like to thank my co­supervisor Olivier Cloarec, who enjoys geeking out just as much as I do, for all stimulating discussions during these years. Jag vill också tacka mina referensper­ soner Mats Josefsson och Erik Elmroth för den feedback jag fått.

Jag vill också tacka alla kollegor och gruppmedlemmar, både nya och gamla. Om vi bör­ jar med gamla gänget i CLiC­korridoren som välkomnade mig in i akademins värld. Tack min kära rumskamrat Tomas för alla timmar av diskussioner kring allt mellan himmel och jord, alla gånger jag lånat din borrhammare, för att du styrde upp grillen, och så vi­ dare, men inte för dina högljudda hörlurar, Izabella for all invaluable advice, and for ar­ ranging all things big and small, Frida för att du välkomnade mig till gruppen som min mentor min första sommar, Matilda för din smittsamma optimism och för att du också gillar evighetslånga diskussioner om tråkiga vuxensaker som barnuppfostran, politik och samhälle, David för att du delat med dig av all din erfarenhet, Daniel för gott samarbete och att du alltid är lika obotligt trevlig oavsett vad sker.

Tack till alla er underbara människor på Sartorius där jag jobbat de senaste tre åren, ni är dock alldeles för många för att nämna hela högen. Men jag vill i synnerhet tacka min ständige vapendragare Christoffer, as well as you other team members, current and for­ mer, not already mentioned; Emil, Chris, Brandon, Nick, Martin, Robert, Anne and Jonas for all the good times and great projects, och även mina nya kära medarbetare Elsa, Frida och Alexander för det fantastiska jobb ni gör. Also, special thanks to my other Sartorius­colleagues Tim J, Gillian, and all the rest of you for the great work we do to­ gether. I would also like to thank Oscar for believing in my work and getting me on board. Tack också till Henrik, Paul och Stefan med flera för er ständiga nyfikenhet och trevliga diskussioner, och Erik för att du delat med dig av din långa erfarenhet, din outtrötliga ny­ fikenhet på nya projekt och att du alltid säger vad du tycker. Tack också till er AW­vänner, ni vet vilka ni är och jag hoppas vi ses över en bägare snart igen. Jag hoppas inte någon känner sig glömd. Jag lovar att jag inte har glömt någon med flit.

40 Tack till er andra på universitet, som ni sprillans nya gruppmedlemmar Edvin, Mattias och Nikita, I hope we will work more together. Stort tack till Andreas och Hans för trevliga diskussioner och goda råd. Även tack till er andra jag haft ynnesten att dela kor­ ridor och lunchraster med över åren; Benny, Pär, Jocke, Kristina, Sebastian, Anna, Carl, Sergiu, Nabeel, Beatriz, Olena och säkert fler jag tappat namnet på. Jag vill också tacka er i kemiinstitutionens ledning och adminstration som fått allt att fungera genom åren, att jag kunnat trycka denna avhandling och att det alltid funnits kaffe i lunchrum­ met.

Jag vill också tacka er andra vänner jag lärt känna genom åren i BiT10, i föräldragrup­ per, på gymmet och överallt annars. Även tack till morsan, farsan, brorsor och hela tjocka släkten för att ni alltid funnits där.

Till sist vill jag tacka min underbara familj. Särskilt tack till dig Emma för alla dessa år tillsammans, att du alltid finns där och att du orkar med mig. Tack till mina underbara ungar Liam och Arvid för all den glädje ni sprider och att ni ger mig något värt att kämpa för, och nytillskottet Vera satte eld i baken på mig att skriva färdigt avhandlingen innan du kom till världen. Älskar er ♡

Tack för mig.

Rickard

41 Bibliography

[1] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal of Research and Development, vol. 3, no. 3, pp. 210–229, 1959.

[2] “Amazon Lookout for Vision.” https://aws.amazon.com/lookout-for-vision/. Accessed: 2021­02­25.

[3] “Machine learning can boost the value of wind energy.” https://deepmind.com/ blog/article/machine-learning-can-boost-value-wind-energy. Accessed 2021­02­25.

[4] E. Callaway, “’It will change everything’: DeepMind’s AI makes gigantic leap in solv­ ing protein structures.,” Nature, vol. 588, pp. 203–204, 2020.

[5] S. Wold, “Chemometrics; what do we mean with it, and what do we want from it?,” Chemometrics and Intelligent Laboratory Systems, vol. 30, no. 1, pp. 109–115, 1995.

[6] European Agency, “ICH guideline Q8 (R2) on pharmaceutical develop­ ment,” 2017. https://www.ema.europa.eu/en/documents/scientific-guideline/ international-conference-harmonisation-technical-requirements- registration-pharmaceuticals-human-use_en-11.pdf.

[7] H. Martens, “Quantitative big data: where chemometrics can contribute,” Journal of Chemometrics, vol. 29, no. 11, pp. 563–581, 2015.

[8] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1­3, pp. 37–52, 1987.

[9] P. Geladi and B. R. Kowalski, “Partial least­squares regression: A tutorial,” Analyt­ ica Chimica Acta, vol. 185, pp. 1–17, 1986.

[10] S. Malek, F. Melgani, and Y. Bazi, “One­dimensional convolutional neural net­ works for spectroscopic signal regression,” Journal of Chemometrics, vol. 32, no. 5, p. e2977, 2018.

[11] C. Cui and T. Fearn, “Modern practical convolutional neural networks for multi­ variate regression: Applications to NIR calibration,” Chemometrics and Intelligent Laboratory Systems, vol. 182, pp. 9–20, 2018.

[12] O. Publishing, OECD glossary of statistical terms, p. 119. Organisation for Eco­ nomic Co­operation and Development, 2008.

42 [13] D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: an overview of methods, challenges, and prospects,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1449–1477, 2015.

[14] T. Skotare, Multivariate integration and visualization of multiblock data in chem­ ical and biological applications. PhD thesis, Umeå universitet, 2019.

[15] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.

[16] J. Devlin, M.­W. Chang, K. Lee, and K. Toutanova, “BERT: Pre­training of deep bidirectional transformers for language understanding,” 2019.

[17] K. Sohn, D. Berthelot, C.­L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi­supervised learning with con­ sistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.

[18] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for con­ trastive learning of visual representations,” 2020.

[19] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

[20] N. S. Altman, “An introduction to kernel and nearest­neighbor nonparametric re­ gression,” The American , vol. 46, no. 3, pp. 175–185, 1992.

[21] C. Cortes and V. Vapnik, “Support­vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.

[22] Y. Freund and R. E. Schapire, “A decision­theoretic generalization of on­line learn­ ing and an application to boosting,” Journal of computer and system sciences, vol. 55, no. 1, pp. 119–139, 1997.

[23] J. T. Barron, “A general and adaptive robust loss function,” in Proceedings of the IEEE/CVF Conference on Computer Vision and (CVPR), June 2019.

[24] L. Rastrigin, “The convergence of the random search method in the extremal control of a many parameter system,” Automaton & Remote Control, vol. 24, pp. 1337– 1342, 1963.

[25] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pp. 39–43, Ieee, 1995.

[26] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated anneal­ ing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.

[27] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learn­ ing, pp. 1139–1147, PMLR, 2013.

[28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

43 [29] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural net­ works, vol. 61, pp. 85–117, 2015.

[30] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bulletin of mathematical biology, vol. 52, no. 1­2, pp. 99–115, 1943.

[31] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.

[32] A. G. Ivakhnenko, “Polynomial theory of complex systems,” IEEE transactions on Systems, Man, and Cybernetics, no. 4, pp. 364–378, 1971.

[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing sys­ tems, vol. 25, pp. 1097–1105, 2012.

[34] S. Webb, “Deep learning for biology,” Nature, vol. 554, no. 7693, 2018.

[35] C. Angermueller, T. Pärnamaa, L. Parts, and O. Stegle, “Deep learning for compu­ tational biology,” Molecular systems biology, vol. 12, no. 7, p. 878, 2016.

[36] A. C. Mater and M. L. Coote, “Deep learning in chemistry,” Journal of chemical information and modeling, vol. 59, no. 6, pp. 2545–2559, 2019.

[37] B. Sahiner, A. Pezeshk, L. M. Hadjiiski, X. Wang, K. Drukker, K. H. Cha, R. M. Sum­ mers, and M. L. Giger, “Deep learning in medical imaging and radiation therapy,” Medical physics, vol. 46, no. 1, pp. e1–e36, 2019.

[38] M. Ge, F. Su, Z. Zhao, and D. Su, “Deep learning analysis on microscopic imaging in materials science,” Materials Today Nano, p. 100087, 2020.

[39] J. N. Kutz, “Deep learning in fluid dynamics,” Journal of Fluid Mechanics, vol. 814, pp. 1–4, 2017.

[40] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large­scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265–283, 2016.

[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high­performance deep learning library,” in Advances in Neural Information Processing Systems 32, pp. 8024–8035, Curran Associates, Inc., 2019.

[42] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back­propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[43] H. J. Kelley, “Gradient theory of optimal flight paths,” Ars Journal, vol. 30, no. 10, pp. 947–954, 1960.

44 [44] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identi­ fying and attacking the saddle point problem in high­dimensional non­convex opti­ mization,” in Advances in Neural Information Processing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, eds.), vol. 27, pp. 2933– 2941, Curran Associates, Inc., 2014.

[45] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems (S. Ben­ gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa­Bianchi, and R. Garnett, eds.), vol. 31, pp. 6389–6399, Curran Associates, Inc., 2018.

[46] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J. Sohl­Dickstein, and S. Gan­ guli, “Statistical mechanics of deep learning,” Annual Review of Condensed Matter Physics, 2020.

[47] J. Huang and H.­T. Yau, “Dynamics of deep neural networks and neural tangent hierarchy,” in International Conference on Machine Learning, pp. 4542–4551, PMLR, 2020.

[48] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine­learning practice and the classical bias–variance trade­off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019.

[49] T. M. Mitchell, The need for biases in learning generalizations. Department of , Laboratory for Computer Science Research, Rutgers University, 1980.

[50] D. F. Gordon and M. Desjardins, “Evaluation and selection of biases in machine learning,” Machine learning, vol. 20, no. 1­2, pp. 5–22, 1995.

[51] K. Fukushima and S. Miyake, “Neocognitron: A self­organizing neural network model for a mechanism of visual pattern recognition,” in Competition and coop­ eration in neural nets, pp. 267–285, Springer, 1982.

[52] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.

[53] S. Hochreiter and J. Schmidhuber, “Long short­term memory,” Neural computa­ tion, vol. 9, no. 8, pp. 1735–1780, 1997.

[54] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, pp. 5998–6008, Curran Associates, Inc., 2017.

[55] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.

45 [56] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2­3, pp. 146–162, 1954.

[57] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pat­ tern analysis and machine intelligence, no. 6, pp. 679–698, 1986.

[58] D. G. Lowe, “Object recognition from local scale­invariant features,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 2, pp. 1150– 1157, Ieee, 1999.

[59] N. Dalal and B. Triggs, “ of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recogni­ tion (CVPR’05), vol. 1, pp. 886–893, Ieee, 2005.

[60] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net­ works,” in European conference on computer vision, pp. 818–833, Springer, 2014.

[61] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word repre­ sentations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

[62] H. Martens, T. Naes, and T. Naes, Multivariate Calibration. John Wiley & Sons, 1992.

[63] R. A. Fisher, The Design of Experiments. Oliver And Boyd; Edinburgh; London, 1937.

[64] I. Surowiec, L. Vikstrom, G. Hector, E. Johansson, C. Vikstrom, and J. Trygg, “Generalized subset designs in ,” Analytical chemistry, vol. 89, no. 12, pp. 6491–6497, 2017.

[65] P. F. de Aguiar, B. Bourguignon, M. Khots, D. Massart, and R. Phan­Than­Luu, “D­ optimal designs,” Chemometrics and intelligent laboratory systems, vol. 30, no. 2, pp. 199–210, 1995.

[66] H. Wold, “Soft modelling by latent variables: the non­linear iterative partial least squares (NIPALS) approach,” Journal of Applied Probability, vol. 12, no. S1, pp. 117–142, 1975.

[67] K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.

[68] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and appli­ cations,” Neural networks, vol. 13, no. 4­5, pp. 411–430, 2000.

[69] R. Tauler, B. Kowalski, and S. Fleming, “Multivariate curve resolution applied to spectral data from multiple runs of an industrial process,” Analytical chemistry, vol. 65, no. 15, pp. 2040–2047, 1993.

[70] W. H. Lawton and E. A. Sylvestre, “Self modeling curve resolution,” Technometrics, vol. 13, no. 3, pp. 617–633, 1971.

[71] D. D. Lee and H. S. Seung, “Learning the parts of objects by non­negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.

46 [72] D.­S. Cao, Y.­Z. Liang, Q.­S. Xu, Q.­N. Hu, L.­X. Zhang, and G.­H. Fu, “Exploring nonlinear relationships in chemical data using kernel­based methods,” Chemomet­ rics and Intelligent Laboratory Systems, vol. 107, no. 1, pp. 106–115, 2011.

[73] G. H. Tinnevelt and J. J. Jansen, “Resolving complex hierarchies in chemical mix­ tures: how chemometrics may serve in understanding the immune system,” Fara­ day discussions, vol. 218, pp. 317–338, 2019.

[74] M. J. Cejas and R. C. Barrick, “Chemometric mapping of polychlorinated dibenzo­p­ dioxin (PCDD) and dibenzofuran (PCDF) congeners from the Passaic River, NJ: In­ tegrated application of RSIMCA, PVA, and t­SNE,” Environmental Forensics, pp. 1– 17, 2020.

[75] L. Eriksson and E. Johansson, “Multivariate design and modeling in qsar,” Chemo­ metrics and intelligent laboratory systems, vol. 34, no. 1, pp. 1–19, 1996.

[76] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthog­ onal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.

[77] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.

[78] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[79] V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial intelligence review, vol. 22, no. 2, pp. 85–126, 2004.

[80] M. M. Breunig, H.­P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying density­ based local outliers,” in Proceedings of the 2000 ACM SIGMOD international con­ ference on Management of data, pp. 93–104, 2000.

[81] L. Eriksson, T. Byrne, E. Johansson, J. Trygg, and C. Vikström, Multi­and megavariate data analysis basic principles and applications, vol. 1. Umetrics Academy, 2013.

[82] S. Rabanser, S. Günnemann, and Z. Lipton, “Failing loudly: An empirical study of methods for detecting dataset shift,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché­Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.

[83] EPO, “Patent index 2019 ­ Statistics at a glance,” tech. rep., European Patent Office, 2020.

[84] L. Aristodemou and F. Tietze, “The state­of­the­art on intellectual property analyt­ ics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data,” World Patent In­ formation, vol. 55, pp. 37–51, 2018.

47 [85] Y. Lu, X. Xiong, W. Zhang, J. Liu, and R. Zhao, “Research on classification and sim­ ilarity of patent citation based on deep learning,” Scientometrics, vol. 123, no. 2, pp. 813–839, 2020.

[86] K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, 1972.

[87] A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Lin­ guistics, vol. 8, pp. 842–866, 2021.

[88] S. M. Lundberg and S.­I. Lee, “A unified approach to interpreting model predic­ tions,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4765–4774, Curran Associates, Inc., 2017.

[89] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Ben­ gio, “Towards causal representation learning,” 2021.

[90] J. Trygg, E. Holmes, and T. Lundstedt, “Chemometrics in metabonomics,” Journal of proteome research, vol. 6, no. 2, pp. 469–479, 2007.

[91] H. Li, “Minimap2: pairwise alignment for nucleotide sequences,” Bioinformatics, vol. 34, no. 18, pp. 3094–3100, 2018.

[92] J. Bergstra and Y. Bengio, “Random search for hyper­parameter optimization.,” Journal of machine learning research, vol. 13, no. 2, 2012.

[93] J. Köster and S. Rahmann, “Snakemake—a scalable bioinformatics workflow en­ gine,” Bioinformatics, vol. 28, no. 19, pp. 2520–2522, 2012.

[94] L. Wei, X. Xu, Gurudayal, J. Bullock, and J. W. Ager, “Machine learning optimiza­ tion of p­type transparent conducting films,” Chemistry of Materials, vol. 31, no. 18, pp. 7340–7350, 2019.

[95] National Transportation Safety Board, “Accident No.: HWY18MH010 ­ Collision between vehicle controlled by developmental automated driving system and pedes­ trian,” 2018.

[96] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out­of­ distribution examples in neural networks,” in International Conference on Learn­ ing Representations, 2017.

[97] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out­of­ distribution samples and adversarial attacks,” in Advances in Neural Information Processing Systems, pp. 7167–7177, 2018.

[98] K. Musgrave, S. Belongie, and S.­N. Lim, “A metric learning reality check,” in Euro­ pean Conference on Computer Vision, pp. 681–699, Springer, 2020.

[99] S. Narang, H. W. Chung, Y. Tay, W. Fedus, T. Fevry, M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan, et al., “Do transformer modifications transfer across implementations and applications?,” arXiv preprint arXiv:2102.11972, 2021.

48 [100] “paperswithcode ­ out­of­distribution detection.” https://paperswithcode.com/ task/out-of-distribution-detection. Accessed 2021­02­28.

[101] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee­ lakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert­Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Rad­ ford, I. Sutskever, and D. Amodei, “Language models are few­shot learners,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds.), vol. 33, pp. 1877–1901, Curran Asso­ ciates, Inc., 2020.

[102] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation modeling for aircraft engine run­to­failure simulation,” in 2008 international conference on prognostics and health management, pp. 1–9, IEEE, 2008.

[103] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Third International Conference on Learning Representations, 2015.

[104] P. Warden, “Speech Commands: A dataset for limited­vocabulary speech recogni­ tion,” arXiv preprint arXiv:1804.03209, 2018.

[105] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” in Readings in speech recognition, pp. 65–74, Elsevier, 1990.

[106] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Represent­ ing model uncertainty in deep learning,” in International Conference on Machine Learning, pp. 1050–1059, 2016.

[107] C. Stringer, T. Wang, M. Michaelos, and M. Pachitariu, “Cellpose: a generalist al­ gorithm for cellular segmentation,” Nature Methods, vol. 18, no. 1, pp. 100–106, 2021.

[108] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in Conference on fairness, accountability and transparency, pp. 77–91, PMLR, 2018.

[109] K. Karkkainen and J. Joo, “FairFace: Face attribute dataset for balanced race, gen­ der, and age for bias measurement and mitigation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1548–1558, 2021.

[110] J. Trygg and S. Wold, “Orthogonal projections to latent structures (O­PLS),” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 16, no. 3, pp. 119– 128, 2002.

[111] V. John, L. Mou, H. Bahuleyan, and O. Vechtomova, “Disentangled representa­ tion learning for non­parallel text style transfer,” in Proceedings of the 57th An­ nual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 424–434, Association for Computational Linguistics, July 2019.

49 [112] L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose­ invariant face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[113] S. Wold, M. Sjöström, and L. Eriksson, “PLS­regression: A basic tool of chemomet­ rics,” Chemometrics and Intelligent Laboratory Systems, vol. 58, pp. 109–130, Oct. 2001.

[114] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.

50