DEGREE PROJECT IN ENGINEERING PHYSICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Non-Parametric for Classification

JONATHAN WENGER

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Kungliga Tekniska högskolan (KTH)

Teknisk fysik

Degree Project

Non-Parametric Calibration for Classification Jonathan Wenger [email protected]

Supervisor: Prof. Dr. Hedvig Kjellström Examiner: Prof. Dr. Danica Kragic Submission Date: July 3, 2019

Abstract

Many applications for classification methods not only require high accuracy but also reliable estimation of predictive uncertainty. This is of particular importance in fields such as computer vision or robotics, where safety-critical decisions are made based on classification outcomes. However, while many current classification frameworks, in particular deep neural network architectures, provide very good results in terms of accuracy, they tend to incorrectly estimate their predictive uncertainty.

In this thesis we focus on probability calibration, the notion that a classifier’s confi- dence in a prediction matches the empirical accuracy of that prediction. We study calibration from a theoretical perspective and connect it to over- and underconfi- dence, two concepts first introduced in the context of active learning.

The main contribution of this work is a novel algorithm for classifier calibration. We propose a non-parametric calibration method which is, in contrast to existing approaches, based on a latent Gaussian process and specifically designed for multi- class classification. It allows for the incorporation of prior knowledge, can be applied to any classification method that outputs confidence estimates and is not limited to neural networks.

We demonstrate the universally strong performance of our method across different classifiers and benchmark data sets from computer vision in comparison to exist- ing classifier calibration techniques. Finally, we empirically evaluate the effects of calibration on querying efficiency in active learning.

i ii Sammanfattning

Många applikationer för klassificeringsmetoder kräver inte bara hög noggrannhet utan även tillförlitlig uppskattning av osäkerheten av beräknat utfall. Detta är av särskild betydelse inom områden som datorseende eller robotik, där säkerhetskritiska beslut fattas utifrån klassificeringsresultat. Medan många av de nuvarande klassi- ficeringsverktygen, i synnerhet djupa neurala nätverksarkitekturer, ger resultat när det gäller noggrannhet, tenderar de att felaktigt uppskatta strukturens osäkerhet.

I detta examensarbete fokuserar vi på sannolikhetskalibrering, d.v.s. hur väl en klas- sificerares förtroende för ett resultat stämmer överens med den faktiska empiriska säkerheten. Vi studerar kalibrering ur ett teoretiskt perspektiv och kopplar det till över- och underförtroende, två begrepp som introducerades första gången i samband med aktivt lärande.

Huvuddelen av arbetet är framtagandet av en ny algoritm för klassificeringskali- brering. Vi föreslår en icke-parametrisk kalibreringsmetod som, till skillnad från befintliga tillvägagångssätt, bygger på en latent Gaussisk process och som är speciellt utformad för klassificering av flera klasser. Algoritmen är inte begränsad till neu- rala nätverk utan kan tillämpas på alla klassificeringsmetoder som ger konfidens- beräkningar.

Vi demonstrerar vår metods allmänt starka prestanda över olika klassifikatorer och kända datamängder från datorseende i motsats till befintliga klassificeringskalibrering- stekniker. Slutligen utvärderas effektiviteten av kalibreringen vid aktivt lärande.

iii iv Acronyms

CNN convolutional neural network ECE expected calibration error ELBO evidence lower bound GP Gaussian process GPS global positioning system LA Laplace approximation MC Monte Carlo MCE maximum calibration error MCMC Markov chain Monte Carlo NLL negative log-likelihood NN neural network RKHS reproducing kernel Hilbert space SVGP scalable variational Gaussian process SVM support vector machine

v Notation

Scalars, Vectors and Matrices

θ scalar or (probability distribution) parameter x (column) vector A matrix or random variable tr A trace of the (square) matrix A

Probability Theory

p(x) probability density function or probability mass function p(y x) conditional density function | X D random variable X is distributed according to distribution ∼ D iid independent and identically distributed (µ, Σ) (multivariate) normal distribution with mean µ and co- N variance Σ (x µ, Σ) density of the (multivariate) normal distribution N | Cat(ρ) categorical distribution with category probabilities ρ Cat(x ρ) probability mass function of the categorical distribution | Beta(α, β) beta distribution with shape parameters α and β (µ, k) Gaussian process with mean function µ( ) and covariance GP · function k( , θ) · · | KL[p q] Kullback-Leibler divergence of probability distributions p || and q H(p) information-theoretic entropy of probability distribution p

vi Classification and Calibration

x feature vector y class label yˆ class prediction z output of a classifier, either a vector of class probabilities or logits zˆ confidence in prediction ECEp expected calibration error for 1 p ≤ ≤ ∞ N cardinality of the training data K number of classes M number of inducing points in a scalable variational approx- imation

vii viii Contents

Abstract i

Acronyms v

Notation vi

List of Tables xi

List of Figures xii

1 Introduction 1 1.1 Research Question and Contribution ...... 2 1.2 Related Work ...... 3 1.3 Societal Aspects, Ethics and Sustainability ...... 4 1.4 Organisation ...... 6

2 Background 7 2.1 Classification ...... 7 2.2 Uncertainty Representation ...... 9 2.3 Measures of Uncertainty Representation ...... 9 2.3.1 Negative Log-Likelihood and Cross Entropy ...... 10 2.3.2 Calibration and Sharpness ...... 10 2.3.3 Over- and Underconfidence ...... 13 2.4 Calibration Methods ...... 13 2.4.1 Binary Calibration ...... 14 2.4.2 Multi-class Calibration ...... 17 2.5 Relations between Measures of Uncertainty Representation ...... 18 2.5.1 Calibration, Over- and Underconfidence ...... 18 2.5.2 Sharpness, Over- and Underconfidence ...... 20

3 Gaussian Process Calibration 21 3.1 Definition ...... 21 3.2 Inference ...... 22 3.2.1 Inducing Points ...... 23 3.2.2 Bound on the Marginal Likelihood ...... 24 3.2.3 Computation of the Expectation Terms ...... 25 3.3 Prediction ...... 26

ix 3.4 Online Calibration ...... 27 3.5 Implementation ...... 27

4 Experiments 29 4.1 Synthetic Data ...... 31 4.2 Binary Benchmark Data ...... 33 4.3 Multi-class Benchmark Data ...... 34 4.4 Active Learning ...... 36

5 Conclusion 39 5.1 Summary ...... 39 5.2 Future Work ...... 40

Bibliography 41

A Additional Experimental Results 47

B Multivariate Normal Distribution 51

x List of Tables

2.1 Examples of common loss functions used in classification. Loss functions allow the comparison of different classification models by scoring them using samples from (X,Y ). We list a few common loss functions for a single input - output pair (x, y)...... 8

4.1 Calibration results on binary classification benchmark data sets. Average ECE1 and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets. Lowest calibration error per data set and classification model is indicated in bold. . . . 35 4.2 Calibration results on multi-class classification benchmark data sets. Average ECE1 and standard deviation of ten Monte-Carlo cross validation folds on multi-class benchmark data sets. Lowest calibration error per data set and classification model is indicated in bold...... 37

A.1 Accuracy after calibration on binary data. Average accuracy and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets...... 48 A.2 Accuracy after calibration on multi-class data. Average accu- racy and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets...... 49

xi List of Figures

1.1 Example classification task in autonomous driving. Segmented scenery of Tübingen from the cityscapes data set [3] with a bounding box around an object, demonstrating an example classification task for an autonomous car...... 2 1.2 Motivating example for calibration. We trained a neural network with one hidden layer on MNIST [19] and computed the classification error, the negative log-likelihood (NLL) and the expected calibration error (ECE1) over training epochs. We observe that while accuracy continues to improve on the test set, the ECE1 increases after 20 epochs. Note that this is different from classical overfitting, as the test error continues to decrease. This shows that training and calibration need to be considered independently. This can be mitigated by post- hoc calibration using our method (dashed red line). The uncertainty estimation is improved with maintained classification accuracy. . . .3

2.1 Illustration of the two approaches to modelling in classifica- tion. One can take one of two approaches when trying to model the latent relationship between inputs and outputs in the training data. Either one takes a discriminative approach, modelling the posterior fX,Y (y x) directly or a generative approach modelling the joint dis- | tribution fX,Y (x, y). Reprinted from [46]...... 8 2.2 Illustration of calibration and sharpness. Examples of reliabil- ity diagrams and confidence histograms for a miscalibrated and not sharp classifier, a calibrated, but not sharp classifier, a classifier which is both miscalibrated and sharp and finally a calibrated and sharp classifier. The last classifier is generally the most desirable out of the four shown as its confidence estimates match its empirical accuracy and they are sufficiently close to 0 and 1 to be informative...... 12 2.3 Effect of confidence boosting in active learning. Comparison of gradient and confidence boosting on various data sets with respect to accuracy and querying efficiency. Panel (a) shows learning curves for active and passive gradient and confidence boosting on the PenDigits data set. Gradient boosting displays better accuracy for less queried labels. Panel (b) compares the number of queries per learning epoch of gradient versus confidence boosting on different data sets. Figure reprinted from [49]...... 14

xii 2.4 Diagram illustrating probability calibration. Calibration meth- ods act post-hoc on the output of a classifier in order to improve its uncertainty representation. First, a small subset of the training data is split off and the classification model is trained on the remaining data. Then the split off data is classified by the model and is used along with the true labels to train the calibration method. Finally, when new data comes in, it is first classified by the underlying model and then the calibration method adjusts the resulting confidence output. 15 2.5 Illustration of the effect of probability calibration. Uncer- tainty contour plot of a synthetic binary classification problem in two-dimensional feature space. Red indicates probability of class 1 and blue indicates probability of class 0. The first panel shows an uncalibrated classifier. The second panel shows the uncertainty post- calibration. The underlying classifier is underconfident in the border region between the two classes, which is rectified by the calibration method...... 15 2.6 Modern NN architectures are miscalibrated. Confidence his- tograms (top) and reliability diagrams (bottom) of a simple and a modern neural network architecture’s confidence estimates on the CIFAR-100 data set [53]. The modern neural network displays lower error but is more overconfident and thus less calibrated. Graphs reprinted from [5]...... 18

3.1 Multi-class calibration using a latent Gaussian process. The top panel shows the latent function of multi-class GP calibration with prior mean µ(z) = log(z) on a synthetic calibration data set with four classes and 100 calibration samples. Shading represents a 95% credibility interval. The bottom panel shows input confidence from the calibration data and its labels. One can see that the calibration uncertainty is higher in regions with less input data...... 22 3.2 Taylor approximation of the log-softargmax function. Illustra- tion of the second-order Taylor approximation to the log-softargmax function (3.10) for a binary calibration problem with y = 0 and mean > of the variational distribution ϕn = (0, 0) ...... 25

4.1 Traffic scene from the KITTI data set. Still image captured from an example sequence of the KITTI data set [60] showing point clouds in white on black background, ground truth bounding boxes in color and a road overlay. The image at the top shows the camera image recorded by the stereo camera system with bounding boxes added. . 30 4.2 Sample scans from the PCam data set. Example images from the PCam data set [63] depicting scans of lymph node tissue. Samples with metastatic tissue in the center are indicated by green boxes and given a positive label...... 31 4.3 Sample digits from the MNIST data set. Randomly drawn samples from the MNIST database [19] of handwritten digits. . . . . 31

xiii 4.4 Samples from the ImageNet data set. Illustratory samples from the ImageNet data set showing a wide variety of different classes. Dur- ing classification images are rescaled to uniform dimensions. Reprinted from [64]...... 32 4.5 Reliability diagrams before and after GP calibration. Relia- bility diagrams for synthetic data with 10 classes and a train set with 100 data points showing the effect of GP calibration on a test set with 900 instances. The uncalibrated reliability diagram is styled after ef- fects often observed in modern network based image classifiers, which tend to be overconfident...... 34 4.6 Active learning and calibration. ECE1 and classification error for two Mondrian forests trained online on labels requested through an entropy query strategy on the KITTI data set. One Mondrian forest is calibrated at regularly spaced intervals (in gray) using GP calibration. Raw data and a Gaussian process regression up to the average number of queried samples across folds is shown...... 38 4.7 Effects of calibration on over- and underconfidence in ac- tive learning. Over- and underconfidence for two Mondrian forests trained online in an active fashion. The Mondrian forest which was calibrated in regularly spaced intervals (in gray) demonstrates a shift in over- and underconfidence to the ratio determined by Theorem 2.6. Raw data and a Gaussian process regression up to the average number of queried samples across folds is shown...... 38

xiv Chapter 1

Introduction

With the recent achievements in , in particular in the area of , the range of applications for learning methods has also increased signifi- cantly. Especially in challenging fields such as computer vision or speech recognition, important advancements have been made using powerful and complex network ar- chitectures, trained on very large data sets. Most of these techniques are used for classification tasks, e.g. object recognition as illustrated in Figure 1.1. We also con- sider classification in this thesis. However, in addition to achieving high classification accuracy, our goal is to also provide reliable uncertainty estimates for predictions. This is of particular relevance in safety-critical applications [1], such as autonomous driving and robotics. Reliable uncertainties can be used to increase a classifier’s pre- cision by reporting only class labels that are predicted with low uncertainty or for information theoretic analyses of what was learned and what was not. The latter is especially interesting in the context of active learning [2], where the learner actively selects the most relevant data samples for training via a query function based on the posterior predictive uncertainty of the model.

Unfortunately, current probabilistic classification approaches that inherently provide good uncertainty estimates, such as Gaussian processes, often suffer from a lower accuracy and a higher computational complexity on high-dimensional classification tasks compared to state-of-the-art convolutional neural network (CNN) architec- tures. It was recently observed that many modern CNNs are overconfident [4] and miscalibrated [5]. Here, calibration refers to adapting the confidence output of a classifier such that it matches its true probability of being correct. Originally devel- oped in the context of forecasting [6, 7], probability calibration has seen increased interest in recent years [5, 8–11], partly because of the popularity of CNNs which lack inherent uncertainty representation. Earlier studies show that also classical methods such as decision trees, boosting, SVMs and naive Bayes classifiers tend to be miscalibrated [8, 12–14]. Therefore, we claim that training and calibrating a classifier can be two different objectives that need to be considered separately, as exemplified in a toy example in Figure 1.2. Here, a simple neural network contin- ually improves its accuracy on the test set during training, but eventually overfits in terms of NLL and calibration error. A similar phenomenon was observed in [5] for more complex models. Calibration methods perform a post-hoc improvement to

1 Figure 1.1: Example classification task in autonomous driving. Segmented scenery of Tübingen from the cityscapes data set [3] with a bounding box around an object, demonstrating an example classification task for an autonomous car. uncertainty estimation using a small subset of the training data. In this thesis we develop a multi-class calibration method for arbitrary classifiers, to provide reliable predictive uncertainty estimates in addition to maintaining high accuracy.

We note that in contrast to recent approaches which strive to improve uncertainty estimation only for neural networks, including Bayesian neural networks [15, 16] and Laplace approximations (LA) [17, 18], our aim is a framework that is not based on tuning a specific classification method. This has the advantage that the method operates independently of the training process of the classifier and does not rely on training-specific values such as the curvature of the loss function as in LA methods.

1.1 Research Question and Contribution

The research question which will be examined in this thesis is the following. How can prediction uncertainty of a multi-class classifier, applied to computer vision prob- lems, be accurately represented independent of model specification? We made the following contributions in this thesis in an attempt to answer this question.

We show a theoretical link between calibration, over- and underconfidence, con- necting these formerly disparate concepts. Further, we demonstrate on a range of classification models and benchmark data sets that popular classification models are often not calibrated. The main contribution of this thesis is a new multi-class and model-agnostic approach to calibration, based on Gaussian processes. Finally, we study the relationship between active learning and calibration from a theoretical and empirical perspective.

2 train test 1 0.04 0.05 0.2 test + calibr. NLL error ECE 0.02 0.00 0.0 0 20 40 60 0 20 40 60 0 20 40 60 epoch epoch epoch Figure 1.2: Motivating example for calibration. We trained a neural network with one hidden layer on MNIST [19] and computed the classification error, the neg- ative log-likelihood (NLL) and the expected calibration error (ECE1) over training epochs. We observe that while accuracy continues to improve on the test set, the ECE1 increases after 20 epochs. Note that this is different from classical overfitting, as the test error continues to decrease. This shows that training and calibration need to be considered independently. This can be mitigated by post-hoc calibration using our method (dashed red line). The uncertainty estimation is improved with maintained classification accuracy.

1.2 Related Work

Estimation of uncertainty is of considerable interest in the machine learning com- munity at the moment. There are two main approaches in classification. First, by defining a model and loss function which inherently learns a good representation and second, by post-hoc calibration methods which transform output of the underlying model. Uncertainty estimation is also connected to adversarial robustness. Theo- retical results on calibration were previously considered in the fairness literature. Finally, calibration in a broader sense is studied in the regression setting and other applications. We give a short overview of related work in the following paragraphs.

Uncertainty Estimation for Neural Networks Uncertainty estimation in deep learning [20] is generally done by some form of reg- ularisation. Pereyra et al. [21] evaluate two output regularizers for deep NNs, a maximum entropy based confidence penalty and label smoothing. They find that both improve generalisation on common benchmark data sets. Kumar et al. [10] suggest a trainable measure of calibration as a regulariser in an attempt to improve calibration during training. Finally, Maddox et al. [22] employ an approximate technique using stochastic weight averaging to obtain an approx- imate posterior distribution over network weights. Bayesian model averaging is then performed by sampling from the resulting Gaussian distribution.

Gaussian Processes for Large-Scale Problems Gaussian processes provide a principled way to represent uncertainty, but generally perform subpar with regard to accuracy on high-dimensional problems and scalability for very large data sets. Hensman et al. [23] propose a variational inference technique to scale Gaussian processes to large data sets and perform inference for intractable likelihoods. Milios et al. [24] approximate Gaussian process classifiers, which tend to

3 have good uncertainty estimates by GP regression on transformed labels for improved scalability.

Calibration Methods for Classification Research on calibration goes back to statistical forecasting [6, 7] and approaches to provide uncertainty estimates for non-probabilistic binary classifiers [25–27]. More recently, Bayesian binning into quantiles [8] and Beta calibration [9] for binary clas- sification and temperature scaling [5] for multi-class problems were proposed. Guo et al. [5] also discovered that modern CNN architectures do not provide calibrated output. A theoretical framework for evaluating calibration in classification was sug- gested by Vaicenavicius et al. [11].

Adversarial Robustness Adversarial robustness is measured via the minimum perturbation in feature space needed to change the classification of a test sample. High uncertainty for adversar- ial samples is desirable. Croce et al. [28] introduce a regularizer which pushes the decision boundary away from data points and thus gives provable robustness guar- antees against adversarial samples. Kuleshov and Ermon [29] propose an algorithm for online re-calibration and assess performance against an adversary.

Algorithmic Fairness Calibration is also a topic in the algorithmic fairness literature [30, 31]. Here, cal- ibration is considered in the sense that if a certain probability is predicted for an outcome, then this probability should match the empirical fraction of the population with this outcome uniformly across all population subgroups.

Calibration Methods for Regression and Other Applications In a broader sense calibration can also be defined for regression. Kuleshov et al. [32] propose a procedure to calibrate an arbitrary regression algorithm and evaluate it on various network architectures. Song et al. [33] introduce the concept of distribution calibration and a method based on multi-output Gaussian processes. Finally, Jabbari et al. [34] use a shallow neural network to perform calibration in the discovery of causal structure from observational data.

1.3 Societal Aspects, Ethics and Sustainability

The impact of artificial intelligence and machine learning methods on society has been substantial in recent years and this trend is likely to continue. Entire industries such as production, transportation, media and entertainment, medicine and others were revolutionised. For example, machine learning methods such as recommender systems drive consumption, computer vision techniques perform quality control by classification and advertisements are targeted based on individual traits and inter- ests. This rapid shift has had and will have noticeable economic impact, in particular on the job market. Jobs such as accounting, translation or operation of vehicles are

4 likely to be replaced by automated systems in the future [35]. Widespread use of artificial intelligence also raises many questions regarding ethics, privacy, fairness and environmental impact.

One area which has been impacted heavily by automated statistical analysis is pri- vacy. It is routine business practice of social media companies to use personalised advertising as the main stream of revenue. This relies on building a statistical model of consumer behaviour, based on their interaction data with the specific site. It is very important to protect an individual’s right to privacy, in particular since many users of such a website are not aware of how their data is being used. These changes require careful analysis and possible regulatory action to protect the consumer [36]. One area where privacy is particularly crucial is facial recognition. Such technologies can be easily misused by organisations or governments to control and monitor.

The routine reliance on data in order to make decisions and the apparent objectiv- ity of statistical models can introduce unwanted bias. Fairness, the concept that subgroups of a population are treated equally in a model should be considered in or- der to avoid discrimination. There are many examples of systems like credit scores and crime risk being heavily skewed towards economically disadvantaged popula- tion groups or minorities [37], job platforms ranking people based on qualifications have been found to disproportionally undervalue women [38] and facial recognition software, trained predominantly on light skinned faces, fails when presented with a human face with darker complexion [39]. These examples underline the challenges when relying on automated systems learning from data and their ethical impact.

Computing also has a considerable environmental impact [40]. Many components of modern computers use rare materials which are extracted and manufactured under dangerous conditions, often in economically disadvantaged nations with low wage levels. Further computing in general and training large scale machine learning mod- els in particular has significant energy cost (e.g. Google Deepmind’s AlphaGo [41]). This raises questions of sustainability and the created societal value of a certain application with regard to its power usage.

This work specifically touches on many of the general aspects mentioned above. All benchmark data sets for our application are from computer vision, one specifically for autonomous driving. It is conceivable that our method could be used in facial recognition software at some point in the future. Further, our approach which we will outline later in this thesis is not robust against biased data and thus its use may raise questions of fairness. The main societal relevance of our thesis is in improv- ing classification systems. As mentioned above automated statistical classification is ubiquitous in modern society. By improving uncertainty representation in such methods we aim to make automated systems safer, easier to use and interpret and faster to improve. This work has a theoretical and research focus and is thus targeted towards the research community.

5 1.4 Organisation

We begin by introducing different measures of uncertainty representation and argue for their importance in active learning applications. We then motivate the problem of calibration of classification models and introduce existing binary and multi-class calibration methods. Next, we study the theoretical relationship between active learning and calibration and prove a theorem connecting over- and underconfidence and calibration.

Next, we outline a novel multi-class and model-agnostic approach to calibration, based on Gaussian processes, which have a number of desirable properties making them suitable as a calibration tool. This approach is non-parametric, can take prior knowledge into account and provides calibration uncertainty.

In the experimental section of this work, we demonstrate that popular classification models in computer vision and robotics are often not calibrated. We empirically com- pare our proposed approach to calibration versus state-of-the-art calibration meth- ods on a range of computer vision benchmark data sets and classification models. Our method exemplifies universally strong performance across different classifiers and data sets in contrast to existing classifier calibration techniques. Finally, we conclude this work with an empirical study of the effect of calibration on querying efficiency in active learning.

6 Chapter 2

Background

Suppose we are trying to learn the relationship between a set of inputs x and outputs y with the goal of predicting the output of unseen inputs. For example, we might be interested in predicting the classes of objects visible in an image in order to decide whether a robot can interact with them safely. If y takes on a discrete set of values or classes, we call this problem classification. This problem falls under the broader category of , meaning we have access to a set of training data N = (xn, yn) n=1 of examples of the relationship between inputs and outputs. More Drigorously,{ we} can formulate this problem as a form of function approximation, where inputs and outputs come from an underlying distribution which we are trying to uncover. Out of the many introductory texts on supervised learning and classification which exist, we relied mostly on [42–44] for this introduction. Taking a probabilistic view, we define the problem formally below.

2.1 Classification

Let be a vector space and a set of finite cardinality K = . Further, let X Y |Y| (Ω, ,P ) be a probability space and X :Ω , Y :Ω random variables on saidF space. We assume we have access to a training→ X data→ set Y of independent and D identically distributed samples from (X,Y ) of size N. The relationship between X and Y is fully determined by its joint density function fX,Y : R. X × Y → Modeling the relationship between X and Y comes down to approximating their joint density function. We call a function f : R a classifier or model and X × Y → yˆ = arg maxy∈Y f(x, y) for some x its class prediction. We will abuse notation K ∈ X and sometimes use f : R with output z = f(x), prediction yˆ = arg maxi(zi) X → and associated confidence score zˆ = maxi(zi) instead. Modeling the relationship defined by fX,Y can be approached in two ways, according to Ng and Jordan [45]. One can either take a generative approach and model the joint distribution p(x, y), or a discriminative approach and model the posterior p(y x) directly, i.e. learn | a mapping from inputs x to outputs y. These two approaches are illustrated in Figure 2.1.

In order to decide between classifiers a loss function (f, ) is used. It scores a L D

7 (a) Discriminative model (b) Generative model

Figure 2.1: Illustration of the two approaches to modelling in classification. One can take one of two approaches when trying to model the latent relationship between inputs and outputs in the training data. Either one takes a discriminative approach, modelling the posterior fX,Y (y x) directly or a generative approach | modelling the joint distribution fX,Y (x, y). Reprinted from [46].

Table 2.1: Examples of common loss functions used in classification. Loss functions allow the comparison of different classification models by scoring them using samples from (X,Y ). We list a few common loss functions for a single input - output pair (x, y).

Loss function Definition

0/1 loss 1y6=ˆy Squared loss (1 yf(x))2 − Exponential loss exp( αyf(x)) − Hinge loss max(0, 1 yf(x)) − Log loss (Cross entropy) y+1 log(f(x)) (1 y+1 ) log(1 f(x)) − 2 − − 2 − classifier by comparing the predictions and associated confidence scores of the clas- sifier on a set of inputs with the true outputs or labels. Some common examples of loss functions are presented in Table 2.1. As the space of all possible functions is too vast to be useful, one restricts the class of functions which model the relation- ship defined by fX,Y . This modelling task is where knowledge about the application where the data is coming from is essential. For example, sometimes a mechanistic understanding about a physical system is available or some rules about what type of data is classified into which class is known a priori. During training one uses the training data to compute the loss for a set of models from the chosen class to choose the best fitting one.

Often one also introduces a regularisation term (f) which penalises functions from the chosen class in different ways. This can be usefulR to combat overfitting, the phe- nomenon of modelling the training data too well, resulting in a lack of generalisation, i.e. small loss on the training data but large loss on independent data sampled from (X,Y ).

8 2.2 Uncertainty Representation

In this work we are particularly interested in uncertainty representation, i.e. how well a classifier is aware of what it does not know. This is important because in ap- plications it is often not sufficient to have high accuracy on a classification task. For example consider an autonomous robot which is deployed in a novel environment, such as a remote planet or a city destroyed by an earthquake. During navigation this robot has to make choices continuously of what type of objects are in its path and whether it is safe to interact with them. Some of these classifications can be safety-critical for example whether to drive over a ledge or to interact with a po- tential disaster victim. Proper uncertainty about its prediction allows the robot to refrain from making potentially dangerous decisions. For example when the robot has high uncertainty on whether it is safe to drive to a certain location, it can first ask for feedback on its camera image from a human supervisor on earth, before tak- ing action. It could also record high uncertainty predictions in order to obtain the true classification at a later date from an expert in order to improve its predictions in the future. This type of learning strategy is called active learning by uncertainty sampling.

We are interested in correctly modelling predictive uncertainty, the posterior proba- bility of the class prediction fX,Y (y x) or if viewed from the classifier perspective, the uncertainty a classifier has about| its prediction. One can further split the pre- dictive uncertainty into at least two types [16]. Epistemic uncertainty or model uncertainty is caused by uncertainty about the correct parameters and structure of the underlying model and aleatoric uncertainty or uncertainty caused by inherent noise in the training data.

2.3 Measures of Uncertainty Representation

So far we have not defined what it means to have good uncertainty representation. This is due to the fact that this comes down to the modelling choices made and can vary for different applications. One usually tries to measure how close the model f is to the true data distribution fX,Y . This can be done in different ways. One can define a metric on a space of probability measures (e.g. Wasserstein metric), one can measure the distance between probability distributions (e.g. KL divergence) if the true data distribution is known, or one can represent a probability distribution as an element of a reproducing kernel Hilbert space (e.g. maximum mean discrepancy). Typically these distances rely on samples from one or both distributions as fX,Y is usually unknown. Focussing on uncertainty representation means putting value on closeness with respect to the chosen statistical distance and not only on accuracy.

In the following, we will introduce a set of measures used to quantify uncertainty representation. We are particularly interested in the concept of calibration, the notion that a classifier’s uncertainty in its prediction matches its empirical accuracy.

9 In practice, we evaluate these measures on a test data set, which we assume to consist of i.i.d. samples from the ground truth distribution.

2.3.1 Negative Log-Likelihood and Cross Entropy We begin by defining cross entropy, an information-theoretic quantity which can be used to compare probability distributions and is commonly used as a loss function for example in .

Definition 2.1 Let f(y x) be a model approximating the conditional distribution of the data. We define| the cross entropy as

H(fY |X , f) = EX,Y [ log f] . (2.1) −

Remark 2.2 The following identity for the cross entropy holds

EX,Y [ log f] = H(fY |X ) + KL[fY |X f], − || where H is the information-theoretic entropy and KL[f f] the Kullback- Y |X || Leibler divergence. Since KL[ ] 0, we see that the ground truth distribution minimises the cross entropy. ·||· ≥

Cross-entropy in the context of machine learning is closely related to maximum like- lihood estimation. When fitting a family of parametric statistical models, maximum likelihood estimation (MLE) is used to identify the value of the parameter, for which the probability of the observed sample is maximised. Due to the monotonicity of the natural logarithm we can equivalently minimise the negative log-likelihood. In fact, when evaluating the negative log-likelihood on an i.i.d. sample from (X,Y ), it is given by n X = log f(yi xi), L − | i=1 which can be seen as a Monte-Carlo estimator of 2.1.

2.3.2 Calibration and Sharpness Originally introduced in the context of statistical forecasting [6, 7] calibration de- scribes how well the confidence of a classifier in its prediction matches the empirical frequency of its prediction being correct.

Definition 2.3 Let f be a model, yˆ its class prediction and zˆ the associated confidence score. A classifier is called calibrated if

P (ˆy = y zˆ = z) = z z [0, 1], | ∀ ∈

10 or equivalently E [1yˆ=y zˆ] =z. ˆ (2.2) | In order to measure degree of calibration, we define the expected calibration error [8] for 1 p < by ≤ ∞ p 1 ECEp = E [ zˆ E [1yˆ=y zˆ] ] p (2.3) | − | | and the maximum calibration error [8] by

ECE∞ = max z E [1yˆ=y zˆ = z] . (2.4) z∈[0,1] | − | |

In practice, we estimate the calibration error as suggested by Naeini et al. [8] by introducing a fixed binning θ0 < θ1 < < θB such that ··· 1 B ! p 1 X ECEp zˆ¯b accb , ≈ B − b=1 where 1 X zˆ¯b = zˆ Nb θb−1

Calibration on its own is not a sufficient criterium for confidence estimates of a classifier to be meaningful. In a two-class problem with equal prior probability for both classes, a classifier which chooses either of the two classes at random with probability 0.5 is calibrated. However it is immediately apparent that such a classifier is of little use. Intuitively, for a classifier’s confidence estimates to be meaningful, they need to be sufficiently close to 0 or 1, at least some of the time.

Definition 2.4 We define the sharpness of f by

4k2 sharp(f) = Var [ˆz] [0, 1]. (2.5) (k 1)2 ∈ −

The sharpness represents the scaled variance of the confidence in the predicted class. It is scaled such that it is always in the unit interval no matter the number of classes of the problem. Sharpness has been defined in various ways in previous works, re- flecting the fact that there are a multitude of measures of concentration for a random variable. Variations of this notion are also known as refinement [7, 47, 48].

11 Calibration can be visualised by plotting uncertainty estimates versus the empirical accuracy on a test set. This type of plot is called a reliability diagram [7, 13]. A calibrated classifier will display a perfect diagonal. An illustrative example is given in Figure 2.2. The example also shows confidence histograms illustrating the concept of sharpness.

1.0 Calibrated Output 1.0 Calibrated Output Classifier Output Classifier Output 0.8 0.8

0.6 0.6 Accuracy Accuracy 0.4

0.4 0.2

0.2 0.10 0.10

0.05 0.05 Sample Fraction 0.00 Sample Fraction 0.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Maximum Probability zmax Maximum Probability zmax (a) Miscalibrated and not sharp (b) Calibrated and not sharp

1.0 Calibrated Output 1.0 Calibrated Output Classifier Output 0.9 Classifier Output 0.8 0.8

0.6 0.7 Accuracy Accuracy 0.4 0.6 0.5 0.2 0.4

0.0 0.3

0.15 0.15

0.10 0.10

0.05 0.05 Sample Fraction 0.00 Sample Fraction 0.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Maximum Probability zmax Maximum Probability zmax (c) Miscalibrated and sharp (d) Calibrated and sharp

Figure 2.2: Illustration of calibration and sharpness. Examples of reliability diagrams and confidence histograms for a miscalibrated and not sharp classifier, a calibrated, but not sharp classifier, a classifier which is both miscalibrated and sharp and finally a calibrated and sharp classifier. The last classifier is generally the most desirable out of the four shown as its confidence estimates match its empirical accuracy and they are sufficiently close to 0 and 1 to be informative.

12 2.3.3 Over- and Underconfidence In the context of active learning, where only the most informative data is queried for labels, an accurate representation of uncertainty is important in order for the classifier to obtain informative samples. Informative samples are those that improve the classifier’s accuracy on future data. In particular, obtaining more samples in regions of the input space, which are misclassified with the current model and less for regions which the classifier already predicts correctly as these are uninformative is desirable. Over- and underconfidence, introduced in [49] capture this notion.

Definition 2.5 Let zˆ [0, 1] be the confidence score output by a model f at x. We define the ∈ overconfidence of f as the expected confidence on the misclassified samples

o(f) = E [ˆz yˆ = y] | 6 and analogously underconfidence as the average uncertainty on the correctly classified samples u(f) = E [1 zˆ yˆ = y] . − |

Overconfidence measures the average confidence a classifier has in the samples it clas- sifies wrongly. When using an uncertainty sampling based strategy in active learning, this means that wrongly classified samples are rarely requested as the classifier has high confidence in its prediction for them. The flip-side of this is underconfidence. It describes the average uncertainty of the classifier about its correctly classified samples. If a classifier has high underconfidence it queries many samples which it already classifies correctly, i.e. predominantly uninformative samples. Ideally both over- and underconfidence are low. It is also important to note that both quantities are by definition independent of accuracy.

In [49] these notions of introspective capability of a classifier were used to improve uncertainty sampling in the context of active learning. They introduced a variant of a gradient boosting algorithm which weighted those samples more which were wrongly classified with high confidence. Figure 2.3 shows how this strategy improved accuracy compared to regular gradient boosting and lowered the number of queried labels.

2.4 Calibration Methods

Calibration methods were originally developed to provide probabilistic output for discriminative models such as support vector machines (SVMs). They were later adapted to be used as post-hoc methods to improve uncertainty representation by lowering calibration error. They work by using a small subset of the training data and subsequently adjusting the confidence output of the underlying model. A dia- grammatic explanation of calibration is shown in Figure 2.4. Figure 2.5 illustrates the effect of calibration in a binary classification problem on prediction uncertainty.

13 (a) Learning curves of passive and active (b) Number of new label queries per gradient and confidence boosting on the epoch on different data sets. PenDigits data set.

Figure 2.3: Effect of confidence boosting in active learning. Comparison of gradient and confidence boosting on various data sets with respect to accuracy and querying efficiency. Panel (a) shows learning curves for active and passive gradient and confidence boosting on the PenDigits data set. Gradient boosting displays better accuracy for less queried labels. Panel (b) compares the number of queries per learning epoch of gradient versus confidence boosting on different data sets. Figure reprinted from [49].

Calibration has seen a resurgence of interest in recent years, partly due to the pop- ularity of large neural network architectures and their lack of calibration [5] even when combined with principled Bayesian approaches [50]. An example of this is shown in Figure 2.6. In this section, we introduce the most prevalent methods.

2.4.1 Binary Calibration We begin by introducing common binary calibration methods. We denote a binary calibration method by v : R R. It transforms the confidence for the positive class → z1 and then computes the calibrated confidence for the negative class by 1 v(z1). − Originally introduced in the context of SVMs, Platt Scaling [25, 26] is a parametric method designed to output calibrated posterior probabilities for non-probabilistic binary classifiers. It works by fitting a logistic regression model to the model output using the negative log-likelihood as a loss function. Let z1 R be the output of a model. The probabilistic score computed via Platt scaling is defined∈ as 1 v(z1) = , 1 + exp( az1 b) − − where a, b R are the parameters determined in the fitting procedure. The para- metric assumption∈ made corresponds to the case where the scores of each class are normally distributed with identical variance across classes [51].

Isotonic Regression [27] is a non-parametric approach to mapping non-probabilistic classifier scores to probabilities. It relaxes the assumption

14 new data

training data Classifier prediction

Calibration Method calibration data calibration

Figure 2.4: Diagram illustrating probability calibration. Calibration methods act post-hoc on the output of a classifier in order to improve its uncertainty repre- sentation. First, a small subset of the training data is split off and the classification model is trained on the remaining data. Then the split off data is classified by the model and is used along with the true labels to train the calibration method. Finally, when new data comes in, it is first classified by the underlying model and then the calibration method adjusts the resulting confidence output.

Classification Uncertainty Calibrated Uncertainty 2

1

0

1 −

2 − 2 1 0 1 2 2 1 0 1 2 − − − − Figure 2.5: Illustration of the effect of probability calibration. Uncertainty contour plot of a synthetic binary classification problem in two-dimensional feature space. Red indicates probability of class 1 and blue indicates probability of class 0. The first panel shows an uncalibrated classifier. The second panel shows the un- certainty post-calibration. The underlying classifier is underconfident in the border region between the two classes, which is rectified by the calibration method.

15 of a sigmoidal relationship between the model scores and empirical frequencies made by Platt scaling to an isotonic (non-decreasing) one. The following model

v(z1) = m(z1) + ε is assumed for the probabilistic scores. The isotonic function m is found by minimis- ing a squared loss function. In practice, piece-wise constant solutions can be found by using the pair-adjacent violators (PAV) algorithm [52].

Beta Calibration Specifically designed for probabilistic classifiers with output range [0, 1], Beta calibration [9, 51] is a recently introduced parametric approach to calibration. Here, a calibration map family is defined based on the likelihood ratio between two Beta distributions. This parametric assumption is appropriate if the marginal class distributions follow Beta distributions. The model is given by 1 v(z1) = b , (1−z1) 1 + exp( c) za − 1 where a, b, c R are parameters. One theoretical advantage of Beta calibration over Platt scaling∈ is that it defines a richer family of calibration maps. For example, the identity map emerges for a = 1, b = 1 and c = 0, which is not part of the sigmoid family. When applying Platt scaling to a calibrated classifier, the result will be miscalibrated.

Histogram Binning Histogram Binning [12] is a straightforward approach to minimising the calibration error. The classifier output range is binned into a fixed number of bins 0 = θ1 < θ2 < < θB+1 = 1 ··· B+1 with thresholds θi i=1 . Then the empirical accuracy in each bin is computed on { } B the calibration data set giving values ai i=1. The calibration map is then defined by the piecewise constant map { }

v(z1) = aj for θj < z1 θj+1. ≤ The bin edges can be determined for example by equal width or equal frequency.

Bayesian Binning into Quantiles BBQ [8] extends the histogram binning ap- proach in a Bayesian fashion. Here, multiple equal-frequency binning models are constructed and scored. A binning model M is scored as follows

Score = P (M)P ( M). D | The marginal likelihood P ( M) can be computed in closed form under the fol- lowing assumptions. All samplesD | are iid and each bin’s class distribution is modelled as a binomial random variable. We assume a Beta(αb, βb) prior on the parameter of the binomial distribution in bin b. Then the marginal likelihood is given by

B N 0 Y Γ( B ) Γ(mb + αb) Γ(nb + βb) P ( M) = N 0 D | Γ(αb) Γ(βb) b=1 Γ(Nb + B )

16 0 where N is the equivalent sample size controlling the influence of the prior, Nb is the total number of samples in bin b and nb and mb are the number of class 0 and class 1 instances in bin b respectively. The parameters of the Beta priors are set to N 0 N 0 αb = pb and βb = (1 pb), where pb is the midpoint of bin b. The prior over B B − binning models P (M) is chosen as uniform. The above score is then used to perform model averaging across all possible binning models in a given size range.

2.4.2 Multi-class Calibration Up until recently no true multi-class calibration methods existed. Calibration was performed by extending binary calibration methods in a one-vs-all fashion. We K K denote a multi-class calibration method by v : R R . It is applied directly to the output confidence vector z of a multi-class classifier.→

Extension of Binary Models Multi-class calibration can be done by defining a set of binary calibration problems using a one-versus-all approach. Zadrozny and Elkan [27] propose to form K binary classification problems by treating all other classes Ci i6=j as one class. For a new input sample the K trained classifiers are then calibrated{ } using some calibration method for binary classification. For a new datapoint the output vector formed by the normalised prediction of all k calibrated classifiers is then used as a confidence estimate. As most modern classifiers are inherently multi-class, this approach is not feasible anymore. We instead use a one-vs-all approach for the output z of the multi-class classifier, train a calibration method on each split and average their predictions.

Temperature Scaling Introduced as a calibration method for neural networks, temperature scaling [5] is a multi-class extension of Platt scaling. Guo et al. [5] showed that modern neural networks architectures are miscalibrated (see also Fig- ure 2.6) and benefit from a scaling procedure. For an output logit vector z of a neural network and a temperature parameter T > 0, the calibrated confidence is defined as  z  exp z  v(z) = σ = T , (2.6) T PK zk  j=1 exp T where all functions are applied component-wise. The parameter T is found by opti- mising the negative log-likelihood on a validation data set. It is important to note that the predicted class does not change when applying this transformation due to the monotonicity of the logistic function. This ensures that the accuracy of the model is the same after scaling.

Matrix Scaling [5] Similarly, a more general extension to Platt scaling can be defined by using a linear transformation of the logits exp (Az + b) v(z) = σ (Az + b) = , PK k=1 exp (Az + b)j K×K K for a matrix A R and a vector b R . Again these parameters are opti- mized with respect∈ to the negative log-likelihood.∈ However, this variant has proven ineffective [5].

17 On Calibration of Modern Neural Networks

Chuan Guo *1 Geoff Pleiss *1 Yu Sun *1 Kilian Q. Weinberger 1

Abstract LeNet (1998) ResNet (2016) CIFAR-100 CIFAR-100 1.0 Confidence calibration – the problem of predict- ing probability estimates representative of the 0.8 true correctness likelihood – is important for 0.6 classification models in many applications. We Accuracy Accuracy discover that modern neural networks, unlike 0.4

those from a decade ago, are poorly calibrated. Avg. confidence Avg. confidence % of Samples Through extensive experiments, we observe that 0.2 depth, width, weight decay, and Batch Normal- 0.0 ization are important factors influencing calibra- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 tion. We evaluate the performance of various 1.0 Outputs Outputs post-processing calibration methods on state-of- 0.8 Gap Gap the-art architectures with image and document classification datasets. Our analysis and exper- 0.6 iments not only offer insights into neural net- 0.4 work learning, but also provide a simple and Accuracy straightforward recipe for practical settings: on 0.2 Error=44.9 Error=30.6 most datasets, temperature scaling – a single- 0.0 parameter variant of Platt Scaling – is surpris- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ingly effective at calibrating predictions. Confidence

Figure 2.6: ModernFigure 1. NNConfidence architectures histograms (top) are and miscalibrated. reliability diagrams Confidence his- tograms (top) and(bottom) reliability for a 5-layer diagrams LeNet (left) (bottom) and a 110-layer of a simple ResNet (right) and a modern neural 1. Introduction network architecture’son CIFAR-100. confidence Refer to estimates the text below on for the detailed CIFAR-100 illustration. data set [53]. The Recent advances in deep learning have dramatically im- modern neural networkIf the detection displays network lower is error not able but to is confidently more overconfident predict and thus less proved neural network accuracy (Simonyan & Zisserman, calibrated. Graphsthe reprinted presence or from absence [5]. of immediate obstructions, the car 2015; Srivastava et al., 2015; He et al., 2016; Huang et al., should rely more on the output of other sensors for braking. 2016; 2017). As a result, neural networks are now entrusted Alternatively, in automated health care, control should be with making complex decisions in applications, such as ob- 2.5 Relationspassed between on to human doctors Measures when the confidence of Uncertainty of a dis- Repre- ject detection (Girshick, 2015), speech recognition (Han- ease diagnosis network is low (Jiang et al., 2012). Specif- nun et al., 2014), and medical diagnosis (Caruanasentation et al., ically, a network should provide a calibrated confidence 2015). In these settings, neural networks are an essential measure in addition to its prediction. In other words, the component of larger decision making pipelines.While the importance of different measures of uncertainty quantification is applica- probability associated with the predicted class label should In real-world decision making systems, classificationtion specific, net- manyreflect of themits ground are truth inherently correctness linked. likelihood. Here we will study more closely works must not only be accurate, but alsohow should calibration, indicate over- and underconfidence and sharpness are linked. Calibrated confidence estimates are also important for when they are likely to be incorrect. As an example, con- model interpretability. Humans have a natural cognitive in- sider a self-driving car that uses a neural network to detect 2.5.1 Calibration,tuition for Over- probabilities and (Cosmides Underconfidence & Tooby, 1996). Good pedestrians and other obstructions (Bojarski et al., 2016). Since over- and underconfidenceconfidence estimates are provide properties a valuable independent extra bit of of infor- accuracy, they seem * 1 mation to establish trustworthiness with the user – espe- Equal contribution, alphabetical order. Cornellto be University. at first glance also independent of calibration, a property defined through clas- Correspondence to: Chuan Guo , Geoff cially for neural networks, whose classification decisions Pleiss , Yu Sun . are But often as difficult it turns to interpret. out there Further, is a quite good important probability connection. The closer a classifierestimates is to being can calibrated, be used to incorporate the closer neural the ratio networks between into its over- under- Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017.confidence Copyright 2017 is determinedother probabilistic by the odds models. of Forthe example, classifier one making can improve a correct prediction. by the author(s). performance by combining network outputs with a lan- Theorem 2.6 Let 1 p < q , then the following relationship between over-, underconfi- dence≤ and the expected≤ ∞ calibration error holds:

o(f)P(ˆy = y) u(f)P(ˆy = y) ECEp ECEq (2.7) | 6 − | ≤ ≤

18 Proof. By linearity and the law of total expectation it holds that

E [ˆz] = E [ˆz + E [1yˆ=y zˆ] E [1yˆ=y zˆ]] = E [ˆz E [1yˆ=y zˆ]] + P(ˆy = y). | − | − | Conversely, by decomposing the average confidence we have

E [ˆz] = E [ˆz yˆ = y] P(ˆy = y) + E [ˆz yˆ = y] P(ˆy = y) | 6 6 | = E [ˆz yˆ = y] P(ˆy = y) + (1 E [1 zˆ yˆ = y])P (ˆy = y) | 6 6 − − | = o(f)P(ˆy = y) + (1 u(f))P(ˆy = y). 6 − Combining the above we obtain

E [ˆz E [1yˆ=y zˆ]] = o(f)P(ˆy = y) u(f)P(ˆy = y). − | 6 − Now, since f(x) = x p is convex for 1 p < , we have by Jensen’s inequality | | ≤ ∞ p p E [ˆz E [1yˆ=y zˆ]] E [ zˆ E [1yˆ=y zˆ] ] | − | | ≤ | − | | and finally by Hölder’s inequality with 1 p < q it follows that ≤ ≤ ∞ p 1 q 1 ECEp = E [ zˆ E [1yˆ=y zˆ] ] p E [ zˆ E [1yˆ=y zˆ] ] q = ECEq, | − | | ≤ | − | | which concludes the proof.

While a similar result for ECE1 was shown in the context of fairness in [30] for different population groups in , the sharpness of the bound and the generalisation X to 1 p < q are original results to the best of our knowledge. ≤ ≤ ∞ Corollary 2.7 Assume f is calibrated, then

o(f)P(ˆy = y) = u(f)P(ˆy = y), (2.8) 6 i.e. the odds of making a correct prediction determine the ratio between over- and underconfidence. Assuming P(ˆy = y) / 0, 1 we obtain 6 ∈ { } o(f) P(ˆy = y) = . u(f) P(ˆy = y) 6

Proof. Since f is calibrated we have by definition

p 1 ECEp = E [ zˆ E [1yˆ=y zˆ] ] p = 0, | − | | i.e. the calibration gap is zero. By Theorem 2.6 we have

o(f)P(ˆy = y) u(f)P(ˆy = y) = 0. 6 − Rearranging terms concludes the proof.

The relationship described in Theorem 2.7 was previously established in the fairness literature by [30, 31]. The authors show that for each population group the above holds under separate calibration of each group.

19 2.5.2 Sharpness, Over- and Underconfidence While the previous subsection established a relationship between calibration and over- and underconfidence it does not yet provide us with a way to minimise them. A fixed ratio of the two can still imply both to be high. Here we will establish how sharpness influences over- and underconfidence. Intuitively a sharp classifier makes either very confident or very uncertain predictions. If the classifier is calibrated these have high and low accuracy respectively. The definition of over- and underconfidence then suggests that increased sharpness under calibration should reduce both. This subsection formalises this heuristic argument.

Proposition 2.8 The following relationship between sharpness and over- / underconfidence of f holds:

2 4k 2 sharp(f) = (P (ˆy = y) Var [ˆz yˆ = y] + (o(f) E [ˆz]) (k 1)2 6 | 6 − (2.9) − 2 + P (ˆy = y) Var [1 zˆ yˆ = y] + (u(f) E [1 zˆ]) ) − | − −

Proof. Using the law of total variance, we obtain

Var [ˆz] = E [Var [ˆz 1yˆ=y]] + Var [E [ˆz 1yˆ=y]] | |  2 = E Var [ˆz 1yˆ=y] + (E [ˆz 1yˆ=y] E [E [ˆz 1yˆ=y]]) | | − | 2 = P(ˆy = y) Var [ˆz yˆ = y] + (E [ˆz yˆ = y] E [ˆz]) 6 | 6 | 6 − 2 + P(ˆy = y) Var [ˆz yˆ = y] + (E [ˆz yˆ = y] E [ˆz]) | | − 2 = P(ˆy = y) Var [ˆz yˆ = y] + (E [ˆz yˆ = y] E [ˆz]) 6 | 6 | 6 − 2 + P(ˆy = y) Var [1 zˆ yˆ = y] + (E [ˆz yˆ = y] 1 + 1 E [ˆz]) − | | − − 2 = P(ˆy = y) Var [ˆz yˆ = y] + (o(f) E [ˆz]) 6 | 6 − 2 + P(ˆy = y) Var [1 zˆ yˆ = y] + (u(f) E [1 zˆ]) . − | − − Now the result follows directly from the definition of sharpness.

The combination of Theorem 2.7 and Theorem 2.8 imply that for a calibrated clas- sifier for which one of the regularity conditions

o(f) E [ˆz] or u(f) E [1 zˆ] ≤ ≤ − holds, a sufficient increase in sharpness of f decreases both over- and underconfidence as long as the individual variances can be controlled. In the rest of this thesis we will focus solely on improving calibration and leave simultaneous calibration and increase in sharpness based on this theoretical result for future work. We hypothesise that the difficulty lies in calibration and controlling the variance terms simultaneously.

20 Chapter 3

Gaussian Process Calibration

We outline our non-parametric calibration method in the following sections. Our aim is to develop a calibration algorithm, which is inherently multi-class, suitable for arbitrary classifiers, makes no parametric assumption on the shape of the calibration map and can take prior knowledge into account. These desired properties readily lead to our approach using a latent Gaussian process [54]. This has the added benefit that we obtain calibration uncertainty providing us with information about how much we can trust the calibration map in different regions of its input space.

3.1 Definition

Assume a one-dimensional Gaussian process prior over the latent function f(z), i.e.

f (µ( ), k( , θ)) ∼ GP · · · | with mean function µ, kernel k and kernel parameters θ. A common kernel choice motivated by a smoothness assumption of the calibration map is the squared expo- nential kernel with added noise  2  2 (zi zj) 2 k(zi, zj) = σ exp −2 + δijσnoise. (3.1) − 2l | {z } | {z } Gaussian noise squared exponential

Further, let the calibrated output be given by the softargmax inverse link function applied to the latent process evaluated at the model output

exp(f(zj)) v(z) = σ(f(z)) = . (3.2) j j PK k=1 exp(f(zk)) Note the similarity to multi-class Gaussian process classification, but in contrast we consider one shared latent function applied to each component of z individually instead of K latent functions. We use the categorical likelihood

K Y [y=k] Cat(y σ(f(z))) = σ(f(z)) (3.3) | k k=1

21 ) 0 z ( f 5 −

2 class 0 0.0 0.2 0.4 0.6 0.8 1.0 z

Figure 3.1: Multi-class calibration using a latent Gaussian process. The top panel shows the latent function of multi-class GP calibration with prior mean µ(z) = log(z) on a synthetic calibration data set with four classes and 100 calibration samples. Shading represents a 95% credibility interval. The bottom panel shows input confidence from the calibration data and its labels. One can see that the calibration uncertainty is higher in regions with less input data. to obtain a prior on the class prediction. Making the prior assumption that the given classifier is calibrated and no further calibration is necessary corresponds to either µ(z) = log(z) if the inputs are confidence estimates, or to the identity function µ(z) = z if the inputs are logits. The formulation is inspired by temperature scaling defined in (2.6). We replace the linear map by a Gaussian process to allow for a more flexible calibration map able to incorporate prior knowledge concerning its shape. An example of a latent function for a synthetic data set is shown in Figure 3.1. If the latent function f is monotonically increasing in its domain, the accuracy of the underlying classifier is unchanged.

3.2 Inference

In order to infer the calibration map, we need to fit the underlying Gaussian process based on the confidence predictions or logits and ground truth classes in the calibra- tion set. By our choice of likelihood, the posterior is not analytically tractable. In order for our method to scale to large data sets we only retain a sparse representation of the input data, making inference of the latent Gaussian process computationally less intensive. We approximate the posterior through a scalable variational inference method [23].

According to our definition of the latent Gaussian process and the inverse link func- tion we obtain the joint distribution of the calibration data (Z, y) and latent variables f is given by

N N Y Y p(y, f) = p(y f)p(f) = p(yn fn)p(f) = Cat(yn σ(fn)) (f µ, Σf ), | | | N | n=1 n=1

22 N > NK > where y 1,...,K , f = (f1, f2, . . . , fN ) R and fn = (f(zn1) . . . , f(znK )) K ∈ { } ∈ ∈ R . The covariance matrix Σf has block-diagonal structure by independence of the calibration data, as follows   A1,1 0 . ···. . NK×NK Σf =  . .. .  R   ∈ 0 AN,N ··· and each submatrix is given via the kernel function

 1 1 1 K  k(zi , zj θ) k(zi , zj θ) . | ···. . | K×K Ai,j =  . .. .  R ,   ∈ k(zK , z1 θ) k(zK , zK θ) i j | ··· i j | where i, j 1,...,N . If performance is critical, a further diagonal assumption can be made.∈ { This would} correspond to the assumption that confidence estimates for classes for a given data point are independent. Note that we drop the explicit dependence on Z and θ throughout to lighten the notation.

3.2.1 Inducing Points Our goal is to compute the posterior p(f y). In order to reduce the computational | complexity from ((NK)3) we focus on inducing point methods. We define M O > M M inducing inputs W = (w1, . . . , wM ) R and inducing variables u R with the goal to only retain a sparse representation∈ of our Gaussian process∈ at these points. The joint distribution is given by       f µf Σf Σf,u p(f, u) = , > . (3.4) N u µu Σf,u Σu where the covariance matrix between calibration data and inducing points is given by  1 1  k(z , u1) k(z , uM )) 1 ··· 1  . .. .   . . .   K K  NK×M Σf,u = k(z1 , u1) k(z1 , uM )) R  ···  ∈  . .. .   . . .  K K k(z , u1) k(z , uM )) N ··· N and the covariance matrix at the inducing points by   k(u1, u1 θ) k(u1, uM θ) . | ···. . | M×M Σu =  . .. .  R .   ∈ k(uM , u1 θ) k(uM , uM θ) | ··· | Using Bayes’ theorem and the conditional independence of y and u given f, the joint can be factorised as

p(y, f, u) = p(y f)p(f u)p(u). (3.5) | |

23 We aim to find a variational approximation q(u) = (u m, S) to the posterior N | p(u y). For general treatments on variational inference we refer interested readers to [55,| 56].

3.2.2 Bound on the Marginal Likelihood We find the variational parameters m and S, the locations of the inducing inputs w and the kernel parameters θ by optimising a lower bound to the marginal log- likelihood log p(y). To begin consider the following bound, derived by marginalisa- tion and Jensen’s inequality

log p(y u) Ep(f|u) [log p(y f)] . (3.6) | ≥ | We then substitute eq. (3.6) into the lower bound to the evidence (ELBO) as follows

log p(y) = KL [q(u) p(u y)] + ELBO(q(u)) k | ELBO(q(u)) ≥ = Eq(u) [log p(y, u)] Eq(u) [log q(u)] − = Eq(u) [log p(y u)] KL [q(u) p(u)]  | −  k Eq(u) Ep(f|u) [log p(y f)] KL [q(u) p(u)] ≥ | − k (3.7) = Eq(f) [log p(y f)] KL [q(u) p(u)] | − k " N # Y = Eq(f) log p(yn fn) KL [q(u) p(u)] | − k n=1 N X = Eq(fn) [log p(yn fn)] KL [q(u) p(u)] , | − k n=1 where the second to last equality holds by independence of the training data and Z q(f) := p(f u)q(u) du. | By eq. (3.4) and Theorem B.3 we obtain

−1 −1 > p(f u) = (f µf + Σf,uΣ (u µu), Σf Σf,uΣ Σ ). | N | u − − u f,u −1 With q(u) = (u m, S) and A := Σf,uΣ , we have N | u Z > q(f) := p(f u)q(u) du = (f µf + A(m µu), Σf + A(S Σu)A ). (3.8) | | {z } N | − − q(f,u) as q(f, u) is again normally distributed by Theorem B.4 and marginals of multivari- ate normal distributions are normally distributed by Theorem B.1.

24 log p(y f ) n | n Taylor approx.

0.0 0.5 − 1.0 − 1.5 − 2.0 − 2.5 − 3.0 − 3.5 −

2.0 1.5 1.0 0.5 2.0 0.0 − 1.5 1.0 0.5 2 − − 0.5 1.−0 f n − 0.0 0.5 − f 1 1.0 1.5 n 1.5 2.−0 2.0 − Figure 3.2: Taylor approximation of the log-softargmax function. Illustration of the second-order Taylor approximation to the log-softargmax function (3.10) for a binary calibration problem with y = 0 and mean of the variational distribution > ϕn = (0, 0) .

3.2.3 Computation of the Expectation Terms In order to obtain the variational objective eq. (3.7) we need to compute the expected value terms

" yn # exp(fn ) [log p(yn fn)] = log Eq(fn) Eq(fn) PK k | k=1 exp(fn ) " # (3.9) K   yn X k = mn Eq(fn) log exp fn . − k=1 with respect to the K-dimensional marginals of q(f) Z q(fn) = p(fn u)q(u) du = (fn ϕn,Cn), | N | which are normally distributed. To compute the intractable expectation terms (3.9), we use a second order Taylor approximation of

yn exp(fn ) h(fn) := log p(yn fn) = log (3.10) PK k | k=1 exp(fn ) at fn = ϕn. An illustration is shown in Figure 3.2. We begin by computing the Hessian of the log-softargmax. We have " # " K # 2 exp(fy) X Df log σ(f)y = D log = Df fy log exp(fk) = ey σ(f) f PK − − k=1 exp(fk) k=1

25 where ey is the y-th unit vector. Then   2 ∂ ∂ D log σ(f)y = σ(f)1,..., σ(f)K f − ∂f ∂f   σ(f)1(1 σ(f)1) σ(f)1σ(f)2 − − ··· =  σ(f)1σ(f)2 σ(f)2(1 σ(f)2)  −  − . − ···.  . .. = σ(f)σ(f)> diag(σ(f)). − Note, that somewhat surprisingly this expression does not depend on y. Now we obtain by using x>Mx = trx>Mx, the linearity of the trace and its invariance under cyclic permutations, that

Eq(fn) [log p(yn fn)] = Eq(fn) [h(fn)] |  > 1 > 2 Eq(fn) h(ϕn) + Dfn h(ϕn) (fn ϕn) + (fn ϕn) Df h(ϕn)(fn ϕn) ≈ − 2 − n − 1 h >  >  i = h(ϕn) + Eq(fn) (fn ϕn) σ(ϕn)σ(ϕn) diag(σ(ϕn)) (fn ϕn) 2 − − − 1 h h >i  > i = h(ϕn) + tr Eq(fn) (fn ϕn)(fn ϕn) σ(ϕn)σ(ϕn) diag(σ(ϕn)) 2 − − − 1 h  > i = log p(yn ϕn) + tr Cn σ(ϕn)σ(ϕn) diag(σ(ϕn)) | 2 − 1  h > i  = log p(yn ϕn) + tr σ(ϕn) Cnσ(ϕn) tr[Cndiag(σ(ϕn))] | 2 − 1  > >  = log p(yn ϕn) + σ(ϕn) Cnσ(ϕn) diag(Cn) σ(ϕn) , | 2 − which can be computed in (K2) by expressing the term inside the parentheses O as a double sum over K terms. Computing the KL-divergence term in (3.7) is in (M 3). Therefore, computing the objective (3.7) has complexity (NK2 + M 3). O O As we assume M N most of the computational expense will be in computing the  N expectations. Note that this can be remedied through parallelisation as all N expectation terms can be computed independently. Now, the optimisation to find the variational parameters, the kernel parameters and the inducing point locations can be performed by using either a gradient-based optimiser or as in our case by automatic differentiation.

3.3 Prediction

Now that the latent process is fit to the calibration data, we can predict calibrated uncertainties for new input data. Given the approximate posterior

p(f, u y) q(f, u) := p(f u)q(u), | ≈ |

26 predictions at new inputs Z∗ are obtained via Z p(f∗ y) = p(f∗ f, u)p(f, u y) dfdu | | | Z p(f∗ f, u)p(f u)q(u) dfdu ≈ | | Z = p(f∗ u)q(u) du |

Note, that p(f∗ y) is Gaussian by Theorem B.4 as in eq. (3.8). Means and variances | K 2 of a latent value f∗ R can be computed in (KM ). The class prediction y∗ is then obtained by evaluating∈ the integral O Z p(y∗ y) = p(y∗ f∗)p(f∗ y) df∗ | | | via Monte-Carlo integration. While inference and prediction have higher compu- tational cost than in other calibration methods, it is comparatively small to the training time of the underlying classifier, since usually only a small fraction of the data is necessary for calibration.

3.4 Online Calibration

Streaming sparse Gaussian process approximations [57] could allow for an extension of our approach to the online setting. This is particularly interesting in active learn- ing applications where we aim to calibrate as data is coming in sequentially.

The comparatively higher computational cost of Gaussian process calibration is remedied in the online setting by three factors. First, calibration is completely independent of model training and prediction. Calibration can be done in parallel to online classification and be incorporated once it is completed. In the stream- ing setting fewer samples for calibration can be requested and there is ample time between them to calibrate.

3.5 Implementation

All calibration methods from Section 2.4 were implemented using Python 3.6 and are available as a package with documentation at https://www.github.com/JonathanWenger/pycalib. An example script demonstrating the use of pycalib is shown in Listing 3.1. We implemented the GP calibration’s outlined inference method using gpflow [58], a Python framework for Gaussian process models which builds on tensorflow [59]. This allows for automatic differentiation to obtain the gradient of the variational objective eq. (3.7) with respect to the variational parameters m and S, the loca- tions of the inducing inputs w and the kernel parameters θ. While these can be derived analytically, automatic differentiation reduces implementation length and complexity.

27 Listing 3.1: Code usage example of pycalib. Python code demonstrating on how to calibrate a classifier on the MNIST data set using Gaussian process calibration. 1 # Package imports 2 import numpy as np 3 import sklearn 4 from sklearn.ensemble import RandomForestClassifier 5 import pycalib.calibration_methods as calm 6 7 # Seed and data size 8 seed = 0 9 n_test = 10000 10 n_calib = 1000 11 12 # Download MNIST data 13 X, y = sklearn.datasets.fetch_openml(’mnist_784’, version=1, & return_X_y=True , cache=True) 14 X = X / 255. 15 y = np.array(y, dtype=i n t ) 16 17 # Split data into train, calibration and test 18 X_train, X_test, y_train , y_test = sklearn.model_selection. & train_test_split(X, y, test_size=n_test , random_state=seed) 19 X_train, X_calib, y_train , y_calib = sklearn.model_selection. & train_test_split(X_train , y_train , test_size=n_calib , & random_state=seed) 20 21 # Train classifier 22 rf = RandomForestClassifier(random_state=seed) 23 rf. fit(X_train, y_train) 24 p_uncal = rf .predict_proba(X_test) 25 26 # Predict and calibrate output 27 gpc = calm.GPCalibration(n_classes=10, random_state=seed) 28 gpc. fit(rf.predict_proba(X_calib), y_calib) 29 p_pred = gpc.predict_proba(p_uncal)

28 Chapter 4

Experiments

We experimentally evaluate our approach against the calibration methods presented in Section 2.4. We use a range of different classifiers on a set of binary and multi-class computer vision benchmark data sets and calibrate subsequently. Besides convolu- tional neural networks, we are also interested in ensemble methods such as boosting and forests. These models are still relevant in computer vision and robotics due to their comparatively smaller computational cost during training and prediction when compared to large scale neural network architectures.

All methods and experiments were implemented in Python 3.6. We either used the authors’ original code, if available or re-implemented calibration methods based on the respective publications. For our GP-based method we used a log mean function and a sum kernel consisting of an RBF and a white noise kernel as in eq. (3.1). All experiments were performed with the implementation described in Section 3.5.

We report the average ECE1 estimated with 100 bins over 10 Monte-Carlo cross validation runs. This means we sample a calibration data set without replacement from the total data each time. Thus splits may contain the same data points. We choose this cross validation strategy as it allows us to choose the size of the calibration set and the number of validation splits freely. We used the following data sets with indicated train, calibration and test splits:

• KITTI [60, 61]: Stream-based urban traffic scenes with features [62] from seg- mented 3D point clouds. 8 or 2 classes, dimension 60, train: 16000, calibration: 1000, test: 8000. • PCam [63]: Detection of metastatic tissue in histopathologic scans of lymph node sections converted to gray scale. 2 classes, dimension 96 96, train: 22768, calibration: 1000, test: 9000. × • MNIST [19]: Handwritten digit recognition. 10 classes, dimension 28 28, train: 60000, calibration: 1000, test: 9000. × • ImageNet 2012 [64]: Image database of natural objects and scenes. 1000 classes, train: 1.2 million, calibration: 1000, test: 9000.

We give a more detailed explanation of each data set in the following paragraphs.

29 KITTI

The KITTI vision benchmark suite [60] is an autonomous driving data set captured via video cameras, a laser scanner and a GPS localisation system. The sensors were mounted on a car driving through various German urban traffic environments. We use a version of the data set from the benchmark suite as described in [61], which con- sists of 18 streams of annotated, segmented point clouds, which were concatenated to one stream. Each feature vector represents points in a bounding box around an object. Subsequently, a 60-dimensional feature vector is computed via point feature histograms [62]. These features are particularly suitable to the online setting as they are computable in real time and low dimensional compared to the original data. For the binary setting we group all classes which are no cars into one class. An illustra- tion of the KITTI data set is shown in Figure 4.1. Each of the indicated bounding boxes contains a point cloud which is then converted to a feature representation as described.

Figure 4.1: Traffic scene from the KITTI data set. Still image captured from an example sequence of the KITTI data set [60] showing point clouds in white on black background, ground truth bounding boxes in color and a road overlay. The image at the top shows the camera image recorded by the stereo camera system with bounding boxes added.

PCam

The PatchCamelyon [63] data set consists of 96x96 color images depicting histopatho- logical scans of lymph node tissue. Each image is labelled according to the presence of metastatic tissue. As medical applications are becoming increasingly important in machine learning and particularly computer vision, this data set provides a suitable benchmark. Detection of metastatic tissue is a clinically relevant task and thus mo- tivates the choice of this data set. A subset of the images is shown in Figure 4.2. In our application we converted the images to gray scale to reduce the dimensionality of the data set to allow for training of ensemble methods without preceding feature extraction. The color uniformity in the data for different scans justifies this choice.

30 Figure 4.2: Sample scans from the PCam data set. Example images from the PCam data set [63] depicting scans of lymph node tissue. Samples with metastatic tissue in the center are indicated by green boxes and given a positive label.

MNIST

The MNIST database [19] of handwritten digits is a commonly used computer vision benchmark data set containing 70,000 size-normalised and centered 28x28 pixel gray scale images. The images depict handwritten digits 0 through 9 and thus the data set has 10 classes. Some random samples from the data set are shown in Figure 4.3.

Figure 4.3: Sample digits from the MNIST data set. Randomly drawn samples from the MNIST database [19] of handwritten digits.

ImageNet

ImageNet [64] is an annotated database of more than 14 million images of everyday objects and scenery classified into over 20,000 different categories. It was published as a benchmark vision data set as part of a computer vision research competition. Here, we use a curated subset of the database with 1000 different classes and a labelled validation set of 50,000 images. A subset of images contained in the data set is shown in Figure 4.4.

4.1 Synthetic Data

We begin by outlining a procedure to generate synthetic calibration data directly, without having to classify data first, i.e. a procedure to generate vectors of probabil- ities. In this way we can test calibration methods and generate illustratory figures.

31 Figure 4.4: Samples from the ImageNet data set. Illustratory samples from the ImageNet data set showing a wide variety of different classes. During classification images are rescaled to uniform dimensions. Reprinted from [64].

We draw confidence estimates for the predicted class from a Beta distribution, ap- ply a given function miscalibrating the confidence estimates and finally sample class predictions. In this way we can control the shape of the confidence histogram and degree of miscalibration.

We begin by sampling a set of confidence estimates from a Beta distribution scaled 1 to the interval [ K , 1]. We do this to be able to choose the distribution of confidence estimates freely. We obtain  1  1 zˆ 1 Beta(α, β) + ∼ − K K where α, β (0, ). Next, we sample ground truth class labels for a multi-class problem from∈ a categorical∞ distribution y Categorical(ρ), ∼

32 where ρk = p(y = Ck) [0, 1] determines the marginal class probabilities. Since we are aiming to generate∈ miscalibrated predictions, we now sample the correct 1 prediction not with probability zˆ, but with probability g(ˆz), where g :[ K , 1] 1 → [ K , 1] is called the miscalibration function. It specifies the mapping from predicted confidence to accuracy of our synthetic classification output. This allows us to specify the degree and type of miscalibration for the synthetic experiment. For example, we can emulate predictions from a random forest by choosing a function which is always larger than the identity. This produces underconfident predictions as is often observed for ensemble methods. Thus, we sample the predicted class labels from a two-point distribution, which is a Bernoulli distribution with arbitrary support a and b, rather than 0, 1 . We have { } yˆ TwoPoint (g(ˆz), a, b) ∼ where a = y and b = j for j uniformly sampled from the remaining classes 1,...,K { }\ y . { } Finally, we need to generate the predicted probabilities for the other classes besides yˆ. There are two constraints. One, zˆ has to stay maximal and two, the zi need to sum to one as they are representing posterior class probabilities. We achieve this by using a conditional stick-breaking process conditioned on zˆ being maximal. We set θ0 =z ˆ and sample (K 1) times from a Beta distribution − θk Beta(1, γ) ∼ where γ (0, ) and rescale each time such that ∈ ∞ k ! k ! X X max 1 θl (K k 1)θ0, 0 θk+1 min θ0, 1 θl (4.1) − − − − ≤ ≤ − l=0 l=0

This implies θ0 θk for k 1,...K 1 . Finally, in order to remove the dependence between the class≥ probabilities∈ { introduced− } by the monotonicity from eq. (4.1) we uniformly draw probabilities for the non-predicted classes 1,...,K y out of K { }\{ } θk=1. This completes the synthetic data generation. Figure 4.5 illustrates the use of GP calibration on a synthetic data set generated with this procedure.

4.2 Binary Benchmark Data

We trained two boosting variants (AdaBoost [65, 66], XGBoost [67]), two forest variants (Mondrian Forest [68], Random Forest [69]) and a simple one layer neural network on the binary KITTI and PCam data sets. We were interested in boosting and forests as they show are typically underconfident in contrast to neural networks which are overconfident.

We report the average ECE1 in Table 4.1. For binary problems all calibration meth- ods perform similarly with the exception of isotonic regression, which has particularly low calibration error on the KITTI data set. However due to its piece-wise constant

33 Reliability Diagram Reliability Diagram

1.0 Calibrated Output 1.0 Calibrated Output Classifier Output Classifier Output 0.8 0.8

0.6 0.6

Accuracy 0.4 Accuracy 0.4

0.2 0.2

0.0 0.0 0.2

0.05 0.1

Sample Fraction 0.0 Sample Fraction 0.00 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Maximum Probability zmax Maximum Probability zmax

(a) No calibration: ECE1 = 0.293 (b) GP calibration: ECE1 = 0.069

Figure 4.5: Reliability diagrams before and after GP calibration. Reliability diagrams for synthetic data with 10 classes and a train set with 100 data points showing the effect of GP calibration on a test set with 900 instances. The uncali- brated reliability diagram is styled after effects often observed in modern network based image classifiers, which tend to be overconfident. calibration map the resulting confidence distribution of the predicted class has a set of singular peaks instead of a smooth distribution. While GP calibration is competi- tive across data sets and classifiers, it does not outperform any of the other methods and is computationally more expensive. Hence, if exclusively binary problems are of interest, a simple calibration method such as isotonic regression or Beta calibration should be preferred. The simple layer neural network on the KITTI data set is al- ready well calibrated, nonetheless all calibration methods except isotonic regression and GP calibration increase the ECE1.

4.3 Multi-class Benchmark Data

Aside the aforementioned classification models, which were trained on MNIST, we also calibrated pre-trained convolutional neural network architectures on ImageNet. The following CNNs were used:

• AlexNet [70] • VGG19 [71] • ResNet50, ResNet152 [72] • DenseNet121, DenseNet201 [73] • Inception v4 [74] • SE ResNeXt50, SE ResNeXt101[75, 76]

All binary calibration methods were extended to the multi-class setting in a one-vs- all manner. Temperature scaling was applied to logits for all CNNs and otherwise

34 Table 4.1: Calibration results on binary classification benchmark data sets. Average ECE1 and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets. Lowest calibration error per data set and classification model is indicated in bold.

Data Set Model Uncal. Platt Isotonic Beta BBQ Temp. GPcalib KITTI AdaBoost .4301 .0182 .0018 .0134 .0021 .0180 .0016 .0190 .0055 .0185 .0017 .0192 .0018 ± ± ± ± ± ± KITTI XGBoost .0434 .0198 .0019 .0114 .0026 .0178 .0015 .0184 .0038 .0204 .0009 .0186 .0017 ± ± ± ± ± ± KITTI Mondr. Forest .0546 .0198 .0011 .0142 .0018 .0252 .0099 .0218 .0035 .0200 .0008 .0202 .0018 ± ± ± ± ± ± KITTI Rand. Forest .0768 .0147 .0027 .0135 .0030 .0159 .0027 .0652 .0469 .0126 .0020 .0182 .0032 ± ± ± ± ± ± KITTI 1 layer NN .0153 .0285 .0034 .0121 .0043 .0174 .0026 .0178 .0056 .0280 .0015 .0156 .0020 ± ± ± ± ± ± PCam AdaBoost .2506 .0409 .0020 .0335 .0047 .0397 .0024 .0330 .0077 .0381 .0033 .0389 .0032 ± ± ± ± ± ± PCam XGBoost .0605 .0378 .0010 .0323 .0058 .0356 .0028 .0312 .0110 .0399 .0020 .0332 .0039 ± ± ± ± ± ± PCam Mondr. Forest .0415 .0428 .0024 .0291 .0066 .0349 .0040 .0643 .0161 .0427 .0013 .0347 .0043 ± ± ± ± ± ± PCam Rand. Forest .0798 .0237 .0035 .0233 .0052 .0293 .0053 .0599 .0084 .0210 .0013 .0285 .0023 ± ± ± ± ± ± PCam 1 layer NN .2090 .0717 .0051 .0297 .0092 .0501 .0049 .0296 .0102 .0542 .0015 .0461 .0034 ± ± ± ± ± ± 35 directly to probability scores.

The average expected calibration error is shown in Table 4.2. While binary methods still perform reasonably well for 10 classes in the case of MNIST, they worsen calibra- tion in the case of 1000 classes on ImageNet. Moreover, they also skew the posterior predictive distribution so much that accuracy is sometimes severely affected, disqual- ifying them from use (see Table A.2 in the appendix). Temperature scaling preserves the underlying accuracy of the classifier by definition. Even though GP calibration has no such guarantees, our experiments show very little effect on accuracy (see Table A.2). GP calibration outperforms temperature scaling for boosting methods on MNIST. These tend to be severely underconfident and in the case of AdaBoost have low confidence overall. Only our method is able to handle this. Both tempera- ture scaling and GP calibration perform well across CNN architectures on ImageNet, whereas GP calibration performs particularly well on CNNs which demonstrate high accuracy. Further, in contrast to all other methods, GP calibration preserves low ECE1 for Inception v4. We attribute this desirable behaviour, also seen in the binary case, to the prior assumption that the underlying classification method is already calibrated. The increased flexibility of the non-parametric latent function and its prior assumptions allow our approach to better adapt to various classifiers and data sets.

4.4 Active Learning

We hypothesise that a better uncertainty estimation of the posterior through cal- ibration leads to an improved learning process when performing active learning. To evaluate this, we use the multi-class KITTI data set, for which we trained two Mondrian forests. These are particularly suited to the online setting, as they are computationally efficient to train and have the same distribution whether trained online or in batch. We randomly shuffled the data 10 times and request samples based on an entropy query strategy with a threshold of 0.5. Any samples above the threshold are used for training. Both forests are trained for 1000 samples and subse- quently one uses 200 samples exclusively for calibration in regularly spaced intervals.

We report the expected calibration and classification error in Figure 4.6. As we can see, the calibration initially incurs a penalty on accuracy for the calibrated forest, as fewer samples are used for training. This penalty is remedied over time through more efficient querying. The same error of the uncalibrated Mondrian forest is reached after a pass through the entire data while less samples overall were requested.

A look at the influence of calibration on over- and underconfidence in Figure 4.7 illustrates the effect of Theorem 2.6 and the reason for the more conservative la- bel requests and therefore improved efficiency. Underconfidence is reduced at the expense of overconfidence leading to a more conservative sampling strategy, which does not penalise accuracy in the long run.

36 Table 4.2: Calibration results on multi-class classification benchmark data sets. Average ECE1 and standard deviation of ten Monte-Carlo cross validation folds on multi-class benchmark data sets. Lowest calibration error per data set and classification model is indicated in bold.

one-vs-all Data Set Model Uncal. Platt Isotonic Beta BBQ Temp. GPcalib MNIST AdaBoost .6121 .2267 .0137 .1319 .0108 .2222 .0134 .1384 .0104 .1567 .0122 .0414 .0085 ± ± ± ± ± ± MNIST XGBoost .0740 .0449 .0021 .0176 .0018 .0184 .0014 .0207 .0020 .0222 .0015 .0180 .0014 ± ± ± ± ± ± MNIST Mondr. Forest .2163 .0357 .0049 .0282 .0021 .0383 .0057 .0762 .0111 .0208 .0012 .0213 .0020 ± ± ± ± ± ± MNIST Rand. Forest .1178 .0273 .0039 .0207 .0042 .0259 .0070 .1233 .0005 .0121 .0012 .0148 .0021 ± ± ± ± ± ± MNIST 1 layer NN .0262 .0126 .0031 .0140 .0017 .0168 .0018 .0186 .0027 .0195 .0060 .0239 .0023 ± ± ± ± ± ± ImageNet AlexNet .0354 .1143 .0128 .2771 .0118 .2321 .006 .1344 .006 .0336 .0038 .0354 .0024 ± ± ± ± ± ± ImageNet VGG19 .0375 .1018 .0083 .2656 .0481 .2484 .0069 .1642 .0136 .0347 .0036 .0351 .0042 ± ± ± ± ± ± ImageNet ResNet50 .0444 .0911 .0086 .2632 .054 .2239 .0077 .1627 .0119 .0333 .0032 .0333 .0024 ± ± ± ± ± ± ImageNet ResNet152 .0525 .0862 .0098 .2374 .0238 .2177 .0159 .1665 .0076 .0328 .003 .0336 .0032 ± ± ± ± ± ± ImageNet DenseNet121 .0369 .0941 .0076 .2374 .011 .2277 .009 .1536 .0105 .0333 .0034 .0331 .0038 ± ± ± ± ± ± ImageNet DenseNet201 .0421 .0923 .0066 .2306 .0195 .2195 .015 .1602 .0071 .0319 .0029 .0336 .004 ± ± ± ± ± ± ImageNet Inception v4 .0311 .0852 .0062 .2795 .0408 .1628 .0095 .1569 .0117 .0460 .0061 .0307 .0017 ± ± ± ± ± ± ImageNet SE ResNeXt50 .0432 .0837 .0038 .2570 .0391 .1723 .0179 .1717 .0206 .0462 .0028 .0311 .0033 ± ± ± ± ± ± ImageNet SE ResNeXt101 .0571 .0837 .007 .2718 .0367 .1660 .0098 .1513 .0084 .0435 .0061 .0317 .0031 ± ± ± ± ± ± 37 0.08 MF entropy 0.12 MF entropy GPCalibration 0.06 ECE error 0.04 0.10

0.02 1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000 queried samples queried samples

Figure 4.6: Active learning and calibration. ECE1 and classification error for two Mondrian forests trained online on labels requested through an entropy query strategy on the KITTI data set. One Mondrian forest is calibrated at regularly spaced intervals (in gray) using GP calibration. Raw data and a Gaussian process regression up to the average number of queried samples across folds is shown.

0.70 0.14

0.65 0.12

0.10 MF entropy

overconfidence 0.60 underconfidence MF entropy GPCalibration 0.08 1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000 queried samples queried samples Figure 4.7: Effects of calibration on over- and underconfidence in active learning. Over- and underconfidence for two Mondrian forests trained online in an active fashion. The Mondrian forest which was calibrated in regularly spaced intervals (in gray) demonstrates a shift in over- and underconfidence to the ratio determined by Theorem 2.6. Raw data and a Gaussian process regression up to the average number of queried samples across folds is shown.

38 Chapter 5

Conclusion

In this final chapter we will give an overview of this thesis and its conclusions. Finally, we will give a detailed description of further research directions with regard to calibration, our specific approach based on Gaussian processes and the connection between calibration and active learning.

5.1 Summary

This thesis concerned itself with uncertainty representation in classification with applications in computer vision. We began by introducing different notions of un- certainty representation with a particular focus on calibration. Next, we demon- strated a theoretical connection between over- and underconfidence, two concepts from active learning, and probabilistic calibration. We showed that under perfect calibration, the ratio between over- and underconfidence is determined by the odds of the classifier making a correct prediction. The main contribution of this thesis is a novel multi-class calibration method for arbitrary classifiers based on a latent Gaussian process allowing for the incorporation of prior knowledge. Its parame- ters are inferred through an adaption of scalable variational inference for Gaussian processes. We tested our method against state-of-the-art calibration methods on a range of classifiers on a set of benchmark data sets from computer vision. We found that our method performed well universally across classifiers and data sets which we attributed to its non-parametric nature. In particular, it showed low calibration error for boosting methods and high accuracy CNNs. It was also the only method which did not worsen calibration for models which were already calibrated. We fur- ther found that on binary classification problems most binary calibration methods perform better than multi-class approaches. However, all binary calibration meth- ods extended via a one-vs-all approach fail for multi-class problems, in particular when the number of classes grows large. Finally, we empirically studied the impact of calibration on querying efficiency in active learning. Our experiment showed that probability calibration can improve uncertainty sampling and result in less queries overall. It also empirically demonstrated the theoretical relationship between over- and underconfidence outlined in the beginning.

39 5.2 Future Work

There are many opportunities for further work on the topics covered in this thesis. We describe three areas in more detail. First, general study of calibration, second, analysis and extension of our proposed calibration approach and finally the relation- ship between calibration and active learning.

The need for and effectiveness of calibration suggests that accuracy and uncertainty estimation benefit from being treated separately. It would be of considerable inter- est to develop a theoretical backing for this empirical observation. Further, to judge the usefulness of calibration on various data sets, in particular when sample size is small, a thorough analysis of the effect of calibration set size would be beneficial. The impact on accuracy could be studied in a set of experiments similar to the ones performed here. We hypothesise that there are different optimal ratios between training and calibration data for different classifiers and calibration methods. In particular parametric methods may need less data, but will not be as adaptable to different classifiers as non-parametric ones.

Our proposed inference approach for GP calibration can be used with an arbitrary kernel and choice of hyperparameters. Calibration could possibly be improved by a different choice of kernel and prior distribution over hyperparameters. Further, in our implementation we fixed the sample size for the Monte-Carlo inference pro- cedure. Here, one could either save computational expense or improve accuracy by analysing what effect the number of samples has on calibration error. Likewise a more detailed analysis of the Taylor approximation to the expectation terms in the variational objective might yield insights into how close the variational approx- imation is to the true distribution. The GP calibration approach could further be improved by enforcing a monotone latent process, e.g. via derivative observations [77], to obtain a guarantee on the preservation of accuracy. Finally, it might be possible to extend the variational approach to the online setting (e.g. [57]) which would allow for continuous calibration.

A more thorough study of the effect of calibration on active learning could shed more light on its benefits. We noticed in our experiments that if the underlying classifier did not receive sufficient pre-training it did not benefit from calibration enough to close the gap to the uncalibrated classifier. Similarly the optimal size of the calibration data set in the online setting has not yet been studied. A further possibly fruitful direction of research is the development of a switching strategy between calibration and training in the online setting. This is reminiscent of an explore - exploit strategy and also tightly connected to the sample selection for calibration. Finally, it is possible calibration could be improved by selecting samples for calibration not by the query criterion, but via what we call active calibration. The concept of using an active learning query strategy, which switches between requesting samples for model training and probability calibration based on the uncertainty of the latent Gaussian process of our calibration method.

40 Bibliography

[1] Dario Amodei et al. “Concrete Problems in AI Safety”. In: CoRR abs/1606.06565 (2016). [2] Burr Settles. Active learning literature survey. Tech. rep. 55-66. University of Wisconsin, Madison, 2010, p. 11. [3] Marius Cordts et al. “The Cityscapes Dataset for Semantic Urban Scene Under- standing”. In: The IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR). 2016. [4] Balaji Lakshminarayanan et al. “Simple and scalable predictive uncertainty es- timation using deep ensembles”. In: Advances in Neural Information Processing Systems. 2017, pp. 6402–6413. [5] Chuan Guo et al. “On calibration of modern neural networks”. In: Proceedings of the 34th International Conference on Machine Learning (ICML). 2017. [6] Allan H. Murphy. “A New Vector Partition of the Probability Score”. In: Jour- nal of Applied Meteorology (1962-1982) 12.4 (1973), pp. 595–600. [7] Morris H. DeGroot and Stephen E. Fienberg. “The Comparison and Evaluation of Forecasters”. In: Journal of the Royal Statistical Society. Series D (The Statistician) 32.1/2 (1983), pp. 12–22. [8] Mahdi Pakdaman Naeini et al. “Obtaining Well Calibrated Probabilities Using Bayesian Binning”. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA. Ed. by Blai Bonet and Sven Koenig. AAAI Press, 2015, pp. 2901–2907. [9] Meelis Kull et al. “Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers”. In: Proceedings of the 20th International Conference on Artificial Intelligence and . Vol. 54. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR, 2017, pp. 623–631. [10] Aviral Kumar et al. “Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings”. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. 2018, pp. 2810–2819. [11] Juozas Vaicenavicius et al. “Evaluating model calibration in classification”. In: Proceedings of Machine Learning Research. Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 3459–3467.

41 [12] Bianca Zadrozny and Charles Elkan. “Obtaining calibrated probability esti- mates from decision trees and naive Bayesian classifiers”. In: Proceedings of the 18th International Conference on Machine Learning. 2001, pp. 609–616. [13] Alexandru Niculescu-Mizil and Rich Caruana. “Predicting good probabilities with supervised learning”. In: Proceedings of the 22nd International Conference on Machine Learning. ACM. 2005, pp. 625–632. [14] Alexandru Niculescu-Mizil and Rich Caruana. “Obtaining Calibrated Proba- bilities from Boosting.” In: UAI. 2005, p. 413. [15] David J. C. MacKay. “A Practical Bayesian Framework for Backpropagation Networks”. In: Neural Computation 4.3 (1992), pp. 448–472. [16] Yarin Gal. “Uncertainty in Deep Learning”. PhD thesis. University of Cam- bridge, 2016. [17] James Martens and Roger Grosse. “Optimizing neural networks with kronecker- factored approximate curvature”. In: International conference on machine learn- ing. 2015, pp. 2408–2417. [18] Jimmy Ba et al. “Distributed Second-Order Optimization using Kronecker- Factored Approximations”. In: ICLR. 2017. [19] Yann LeCun et al. “Gradient-Based Learning Applied to Document Recogni- tion”. In: Proceedings of the IEEE. Vol. 86/11. 1998, pp. 2278–2324. [20] Alex Kendall and Yarin Gal. “What uncertainties do we need in bayesian deep learning for computer vision?” In: Advances in Neural Information Processing Systems 30. 2017, pp. 5574–5584. [21] Gabriel Pereyra et al. “Regularizing Neural Networks by Penalizing Confident Output Distributions”. In: 5th International Conference on Learning Repre- sentations, ICLR. 2017. [22] Wesley Maddox et al. “A Simple Baseline for Bayesian Uncertainty in Deep Learning”. In: arXiv preprint arXiv:1902.02476 (2019). [23] James Hensman et al. “Scalable Variational Gaussian Process Classification”. In: Proceedings of AISTATS. 2015. [24] Dimitrios Milios et al. “Dirichlet-based Gaussian Processes for Large-scale Cal- ibrated Classification”. In: Advances in Neural Information Processing Systems 31. 2018, pp. 6008–6018. [25] John C. Platt. “Probabilistic Outputs for Support Vector Machines and Com- parisons to Regularized Likelihood Methods”. In: Advances in Large-Margin Classifiers. MIT Press, 1999, pp. 61–74. [26] Hsuan-Tien Lin et al. “A note on Platt’s probabilistic outputs for support vector machines”. In: Machine learning 68.3 (2007), pp. 267–276. [27] Bianca Zadrozny and Charles Elkan. “Transforming Classifier Scores into Ac- curate Multiclass Probability Estimates”. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and . KDD ’02. Edmonton, Alberta, Canada: ACM, 2002, pp. 694–699.

42 [28] Francesco Croce et al. “Provable Robustness of ReLU networks via Maximiza- tion of Linear Regions”. In: AISTATS (2019). [29] Volodymyr Kuleshov and Stefano Ermon. “Estimating Uncertainty Online Against an Adversary.” In: AAAI. 2017, pp. 2110–2116. [30] Geoff Pleiss et al. “On fairness and calibration”. In: Advances in Neural Infor- mation Processing Systems. 2017, pp. 5680–5689. [31] Jon Kleinberg. “Inherent Trade-Offs in Algorithmic Fairness”. In: SIGMET- RICS Perform. Eval. Rev. 46.1 (2018), pp. 40–40. [32] Volodymyr Kuleshov et al. “Accurate Uncertainties for Deep Learning Using Calibrated Regression”. In: Proceedings of the 35th International Conference on Machine Learning. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 2796–2804. [33] Hao Song et al. “Distribution Calibration for Regression”. In: Proceedings of the 36th International Conference on Machine Learning (2019). [34] Fattaneh Jabbari et al. “Obtaining Accurate Probabilistic Causal Inference by Post-Processing Calibration”. In: NIPS Workshop on Causal Inference and Machine Learning. 2017. [35] Carl Benedikt Frey and Michael A. Osborne. “The Future of Employment: How Susceptible Are Jobs to Computerisation?” In: Oxford Martin 114 (Jan. 2013). [36] Sandra Wachter and Brent Mittelstadt. “A right to reasonable inferences: Re- thinking data protection law in the age of Big Data and AI”. In: Columbia Business Law Review 2019 (Apr. 2019). [37] Julia Angwin et al. Machine Bias. 2016. url: https://www.propublica. org/article/machine-bias-risk-assessments-in-criminal-sentencing (visited on 06/16/2019). [38] Preethi Lahoti et al. “iFair: Learning Individually Fair Data Representations for Algorithmic Decision Making”. In: IEEE 35th International Conference on Data Engineering (ICDE) (2018), pp. 1334–1345. [39] Joy Buolamwini and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification”. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Ed. by Sorelle A. Friedler and Christo Wilson. Vol. 81. Proceedings of Machine Learning Re- search. New York, NY, USA: PMLR, 2018, pp. 77–91. [40] Eric D. Williams et al. “The 1.7 Kilogram Microchip: Energy and Material Use in the Production of Semiconductor Devices”. In: Environmental Science and Technology 36.24 (2002), pp. 5504–5510. [41] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529 (2016), pp. 484–489. [42] Trevor Hastie et al. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc., 2001.

43 [43] Christopher M. Bishop. Pattern Recognition and Machine Learning (Informa- tion Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. [44] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. [45] Andrew Y. Ng and Michael I. Jordan. “On Discriminative vs. Generative Clas- sifiers: A Comparison of Logistic Regression and Naive Bayes”. In: Proceedings of the 14th International Conference on Neural Information Processing Sys- tems: Natural and Synthetic. NIPS’01. Vancouver, British Columbia, Canada: MIT Press, 2001, pp. 841–848. [46] Afshine Amidi and Shervine Amidi. CS 229 Stanford - Machine Learning. 2018. url: https://stanford.edu/~shervine/teaching/cs-229.html (visited on 10/21/2018). [47] Allan H. Murphy and Robert L. Winkler. “Diagnostic verification of probability forecasts”. In: International Journal of Forecasting 7 (1992), pp. 435–455. [48] Ira Cohen and Moises Goldszmidt. “Properties and Benefits of Calibrated Clas- sifiers”. In: 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD. Springer, 2004, pp. 125–136. [49] D. Mund et al. “Active online confidence boosting for efficient object classifi- cation”. In: 2015 IEEE International Conference on Robotics and Automation (ICRA). 2015, pp. 1367–1373. [50] Gia-Lac Tran et al. “Calibrating Deep Convolutional Gaussian Processes”. In: arXiv preprint arXiv:1805.10522 (2018). [51] Meelis Kull et al. “Beyond sigmoids: How to obtain well-calibrated probabil- ities from binary classifiers with beta calibration”. In: Electronic Journal of Statistics 11.2 (2017), pp. 5052–5080. [52] Miriam Ayer et al. “An Empirical Distribution Function for Sampling with In- complete Information”. In: The Annals of Mathematical Statistics 26.4 (1955), pp. 641–647. [53] Alex Krizhevsky et al. “CIFAR-100 (Canadian Institute for Advanced Re- search)”. In: (2009). url: http://www.cs.toronto.edu/~kriz/cifar.html. [54] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. [55] David M Blei et al. “Variational inference: A review for statisticians”. In: Jour- nal of the American Statistical Association 112.518 (2017), pp. 859–877. [56] Cheng Zhang et al. “Advances in Variational Inference”. In: IEEE transactions on pattern analysis and machine intelligence (2018). [57] Thang D Bui et al. “Streaming sparse Gaussian process approximations”. In: Advances in Neural Information Processing Systems. 2017, pp. 3299–3307. [58] Alexander G. de G. Matthews et al. “GPflow: A Gaussian process library using TensorFlow”. In: Journal of Machine Learning Research 18.40 (2017), pp. 1–6.

44 [59] Martin Abadi et al. TensorFlow: Large-Scale Machine Learning on Heteroge- neous Systems. Software available from tensorflow.org. 2015. [60] Andreas Geiger et al. “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite”. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2012. [61] Alexander Narr et al. “Stream-based active learning for efficient and adaptive classification of 3d objects”. In: Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE. 2016, pp. 227–233. [62] Michael Himmelsbach et al. “Real-time object classification in 3D point clouds using point feature histograms”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2009, pp. 994–1000. [63] Bastiaan S Veeling et al. “Rotation equivariant CNNs for digital pathology”. In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2018, pp. 210–218. [64] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211– 252. [65] Yoav Freund and Robert E Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting”. In: Journal of computer and system sciences 55.1 (1997), pp. 119–139. [66] Trevor Hastie et al. “Multi-class adaboost”. In: Statistics and its Interface 2.3 (2009), pp. 349–360. [67] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting Sys- tem”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16. San Francisco, California, USA: ACM, 2016, pp. 785–794. [68] Balaji Lakshminarayanan et al. “Mondrian Forests: Efficient Online Random Forests”. In: Proceedings of the 27th International Conference on Neural In- formation Processing Systems - Volume 2. NIPS’14. Montreal, Canada: MIT Press, 2014, pp. 3140–3148. [69] Leo Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32. [70] Alex Krizhevsky et al. “ImageNet Classification with Deep Convolutional Neu- ral Networks”. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. NIPS’12. Lake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 1097–1105. [71] S. Liu and W. Deng. “Very deep convolutional neural network based image classification using small training sample size”. In: 3rd IAPR Asian Conference on Pattern Recognition (ACPR). 2015, pp. 730–734. [72] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778. [73] Gao Huang et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

45 [74] Christian Szegedy et al. “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”. In: AAAI. 2016. [75] Saining Xie et al. “Aggregated Residual Transformations for Deep Neural Net- works”. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 5987–5995. [76] Jie Hu et al. “Squeeze-and-Excitation Networks”. In: IEEE Conference on Computer Vision and Pattern Recognition. 2018. [77] Jaakko Riihimäki and Aki Vehtari. “Gaussian processes with monotonicity information”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, pp. 645–652. [78] Thomas B Schön and Fredrik Lindsten. Manipulating the multivariate Gaus- sian density. Tech. rep. 2011. url: http://users.isy.liu.se/en/rt/schon/ Publications/SchonL2011.pdf.

46 Appendix A

Additional Experimental Results

In this section we present some additional results from the experiments conducted in Chapter 4. The accuracy of the classifiers after calibration is shown in Table A.1 for the binary and in Table A.2 for the multi-class experiments. One very clear result from the multi-class experiments is that one-vs-all calibration methods are not suitable to problems with a large number of classes. Most of them drop significantly in accuracy, disqualifying them from use.

47 Table A.1: Accuracy after calibration on binary data. Average accuracy and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets.

Data Set Model Uncal. Platt Isotonic Beta BBQ Temp. GPcalib KITTI AdaBoost .9463 .9499 .0009 .9497 .0006 .9499 .0009 .9444 .0072 .9463 .0006 .9465 .0005 ± ± ± ± ± ± KITTI XGBoost .9674 .9673 .0007 .9660 .0017 .9671 .0011 .9640 .0043 .9674 .0006 .9675 .0006 ± ± ± ± ± ± KITTI Mondr. Forest .9536 .9539 .0004 .9523 .0021 .9532 .0009 .9439 .0041 .9536 .0003 .9536 .0004 ± ± ± ± ± ± KITTI Rand. Forest .9639 .9628 .0011 .9616 .0017 .9625 .0017 .8922 .0055 .9639 .0007 .9637 .0007 ± ± ± ± ± ± KITTI 1 layer NN .9620 .9644 .0007 .9686 .0012 .9684 .0009 .9647 .0060 .9620 .0006 .9620 .0006 ± ± ± ± ± ± PCam AdaBoost .7586 .7609 .0030 .7644 .0025 .7610 .0030 .7638 .0032 .7586 .0022 .7588 .0020 ± ± ± ± ± ± PCam XGBoost .8086 .8065 .0013 .8050 .0018 .8068 .0015 .8020 .0066 .8086 .0016 .8084 .0016 ± ± ± ± ± ± PCam Mondr. Forest .7946 .7976 .0013 .7954 .0032 .7976 .0017 .7950 .0027 .7946 .0012 .7946 .0013 ± ± ± ± ± ± PCam Rand. Forest .8487 .8484 .0015 .8473 .0016 .8482 .0016 .8110 .0041 .8487 .0007 .8483 .0008 ± ± ± ± ± ± PCam 1 layer NN .5925 .6239 .0070 .6504 .0019 .6487 .0031 .6458 .0082 .5925 .0008 .5779 .0041 ± ± ± ± ± ± 48 Table A.2: Accuracy after calibration on multi-class data. Average accuracy and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets.

one-vs-all Data Set Model Uncal. Platt Isotonic Beta BBQ Temp. GPcalib MNIST AdaBoost .7311 .6601 .0097 .6787 .0049 .6642 .009 .6540 .0061 .7311 .0009 .7289 .0020 ± ± ± ± ± ± MNIST XGBoost .9333 .933 .0011 .9312 .0014 .9331 .0011 .9274 .0022 .9333 .0006 .9333 .0006 ± ± ± ± ± ± MNIST Mondr. Forest .9133 .9144 .0015 .9118 .0014 .9142 .0015 .7475 .0138 .9133 .0008 .9132 .0008 ± ± ± ± ± ± MNIST Rand. Forest .9448 .9461 .0012 .9445 .001 .9453 .0010 .0004 .0003 .9448 .0006 .9457 .0011 ± ± ± ± ± ± MNIST 1 layer NN .9625 .9624 .0007 .9620 .0011 .9626 .0011 .9557 .0026 .9625 .0007 .9517 .0011 ± ± ± ± ± ± ImageNet AlexNet .5649 .3437 .0072 .3476 .0050 .3490 .0076 .1861 .0039 .5649 .0031 .5626 .0037 ± ± ± ± ± ± ImageNet VGG19 .7247 .4475 .0074 .4584 .0055 .4496 .0079 .2584 .0105 .7247 .0026 .7233 .0036 ± ± ± ± ± ± ImageNet ResNet50 .7600 .4654 .0085 .4731 .0080 .4780 .0087 .2648 .0088 .7600 .0027 .7587 .0025 ± ± ± ± ± ± ImageNet ResNet152 .7850 .4790 .0088 .4919 .0109 .4938 .0078 .2747 .0079 .7850 .0043 .7834 .0044 ± ± ± ± ± ± ImageNet DenseNet121 .7451 .4598 .0072 .4698 .0102 .4615 .0097 .2402 .0055 .7451 .0039 .7430 .0027 ± ± ± ± ± ± ImageNet DenseNet201 .7702 .4754 .0070 .4804 .0100 .4823 .0037 .2583 .0060 .7702 .0035 .7707 .0023 ± ± ± ± ± ± ImageNet Inception v4 .8000 .4939 .0055 .5060 .0119 .5051 .0121 .2610 .0096 .8000 .0032 .8009 .0043 ± ± ± ± ± ± ImageNet SE ResNeXt50 .7914 .4865 .0107 .4999 .0058 .4965 .0076 .3154 .0058 .7914 .0043 .7890 .0035 ± ± ± ± ± ± ImageNet SE ResNeXt101 .8021 .4963 .0076 .5111 .0079 .5040 .0098 .2525 .0067 .8021 .0029 .8018 .0026 ± ± ± ± ± ± 49 50 Appendix B

Multivariate Normal Distribution

Here we collect some useful results on the multivariate normal distribution, which are n n×n used throughout this thesis. Let x (µ, Σ), where µ R and Σ R positive definite. Assume that we can partition∼ N x, its mean and∈ covariance as∈ follows:

x  µ   Σ Σ  x = 1 1 , 1 1,2 . x2 ∼ N µ2 Σ2,1 Σ2 Then the following theorems hold.

Theorem B.1 (Marginalization) Let x (µ, Σ), then the marginal distribution of x2 is given by ∼ N

x2 (µ2, Σ2). ∼ N Proof. The result follows directly from the definition of the multivariate normal distribution, the rules of integration and standard application of linear algebra.

Theorem B.2 (Affine Transformation) n n×n Let µ R , Σ R symmetric, positive definite, then for x (µ, Σ), m∈×n ∈ m ∼ N A R and b R , we have ∈ ∈ y = Ax + b (Aµ + b,AΣA>). ∼ N Proof. One can prove this by checking the characteristic function

h is>yi φy(s) = E e and using some linear algebra. The result follows by the uniqueness property of characteristic functions. A detailed proof can be found in any introductory book on probability theory.

Theorem B.3 (Conditioning) Let x (µ, Σ), then the conditional distribution of x1 x2 is Gaussian and ∼ N |

51 given by −1  x1 x2 µ1 + Σ1,2Σ (x2 µ2), Σ , | ∼ N 2 − 1|2 where −1 Σ = Σ1,1 Σ1,2Σ Σ2,1. 1|2 − 2

Proof. A proof is available in any standard textbook on probability theory.

Analogously we consider this result in the other direction.

Corollary B.4 Let p(x2) = (x2 µ2, Σ2) and p(x1 x2) = (x1 Mx2 + b, Σ1|2), where n N n | n ×n | n N | x1 R 1 , x2 R 2 ,M R 1 2 , b R 1 . Then the joint distribution of x1 ∈ ∈ ∈ ∈ and x2 is given by      x1 Mx2 + b p(x1, x2) = , Σ¯ N x2 µ2 (B.1)  >  ¯ Σ1|2 + MΣ2M MΣ2 Σ = > . Σ2M Σ2

Proof. A proof can be found in Appendix A.2 of [78].

52 TRITA -EECS-EX-2019:495

www.kth.se