Non-Parametric Calibration for Classification

DEGREE PROJECT IN ENGINEERING PHYSICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Non-Parametric Calibration for Classification JONATHAN WENGER KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Kungliga Tekniska högskolan (KTH) Teknisk fysik Degree Project Non-Parametric Calibration for Classification Jonathan Wenger [email protected] Supervisor: Prof. Dr. Hedvig Kjellström Examiner: Prof. Dr. Danica Kragic Submission Date: July 3, 2019 Abstract Many applications for classification methods not only require high accuracy but also reliable estimation of predictive uncertainty. This is of particular importance in fields such as computer vision or robotics, where safety-critical decisions are made based on classification outcomes. However, while many current classification frameworks, in particular deep neural network architectures, provide very good results in terms of accuracy, they tend to incorrectly estimate their predictive uncertainty. In this thesis we focus on probability calibration, the notion that a classifier’s confidence in a prediction matches the empirical accuracy of that prediction. We study calibration from a theoretical perspective and connect it to over- and underconfidence, two concepts first introduced in the context of active learning. The main contribution of this work is a novel algorithm for classifier calibration. We propose a non-parametric calibration method which is, in contrast to existing approaches, based on a latent Gaussian process and specifically designed for multi- class classification. It allows for the incorporation of prior knowledge, can be applied to any classification method that outputs confidence estimates and is not limited to neural networks. We demonstrate the universally strong performance of our method across different classifiers and benchmark data sets from computer vision in comparison to existing classifier calibration techniques. Finally, we empirically evaluate the effects of calibration on querying efficiency in active learning. i ii Sammanfattning Många applikationer för klassificeringsmetoder kräver inte bara hög noggrannhet utan även tillförlitlig uppskattning av osäkerheten av beräknat utfall. Detta är av särskild betydelse inom områden som datorseende eller robotik, där säkerhetskritiska beslut fattas utifrån klassificeringsresultat. Medan många av de nuvarande klassi- ficeringsverktygen, i synnerhet djupa neurala nätverksarkitekturer, ger resultat när det gäller noggrannhet, tenderar de att felaktigt uppskatta strukturens osäkerhet. I detta examensarbete fokuserar vi på sannolikhetskalibrering, d.v.s. hur väl en klas- sificerares förtroende för ett resultat stämmer överens med den faktiska empiriska säkerheten. Vi studerar kalibrering ur ett teoretiskt perspektiv och kopplar det till över- och underförtroende, två begrepp som introducerades första gången i samband med aktivt lärande. Huvuddelen av arbetet är framtagandet av en ny algoritm för klassificeringskalibrering. Vi föreslår en icke-parametrisk kalibreringsmetod som, till skillnad från befintliga tillvägagångssätt, bygger på en latent Gaussisk process och som är speciellt utformad för klassificering av flera klasser. Algoritmen är inte begränsad till neurala nätverk utan kan tillämpas på alla klassificeringsmetoder som ger konfidens- beräkningar. Vi demonstrerar vår metods allmänt starka prestanda över olika klassifikatorer och kända datamängder från datorseende i motsats till befintliga klassificeringskalibrering- stekniker. Slutligen utvärderas effektiviteten av kalibreringen vid aktivt lärande. iii iv Acronyms CNN convolutional neural network ECE expected calibration error ELBO evidence lower bound GP Gaussian process GPS global positioning system LA Laplace approximation MC Monte Carlo MCE maximum calibration error MCMC Markov chain Monte Carlo NLL negative log-likelihood NN neural network RKHS reproducing kernel Hilbert space SVGP scalable variational Gaussian process SVM support vector machine v Notation Scalars, Vectors and Matrices θ scalar or (probability distribution) parameter x (column) vector A matrix or random variable tr A trace of the (square) matrix A Probability Theory p(x) probability density function or probability mass function p(y x) conditional density function j X D random variable X is distributed according to distribution ∼ D iid independent and identically distributed (µ, Σ) (multivariate) normal distribution with mean µ and co- N variance Σ (x µ, Σ) density of the (multivariate) normal distribution N j Cat(ρ) categorical distribution with category probabilities ρ Cat(x ρ) probability mass function of the categorical distribution j Beta(α; β) beta distribution with shape parameters α and β (µ, k) Gaussian process with mean function µ( ) and covariance GP · function k( ; θ) · · j KL[p q] Kullback-Leibler divergence of probability distributions p jj and q H(p) information-theoretic entropy of probability distribution p vi Classification and Calibration x feature vector y class label y^ class prediction z output of a classifier, either a vector of class probabilities or logits z^ confidence in prediction ECEp expected calibration error for 1 p ≤ ≤ 1 N cardinality of the training data K number of classes M number of inducing points in a scalable variational approximation vii viii Contents Abstract i Acronyms v Notation vi List of Tables xi List of Figures xii 1 Introduction 1 1.1 Research Question and Contribution . .2 1.2 Related Work . .3 1.3 Societal Aspects, Ethics and Sustainability . .4 1.4 Organisation . .6 2 Background 7 2.1 Classification . .7 2.2 Uncertainty Representation . .9 2.3 Measures of Uncertainty Representation . .9 2.3.1 Negative Log-Likelihood and Cross Entropy . 10 2.3.2 Calibration and Sharpness . 10 2.3.3 Over- and Underconfidence . 13 2.4 Calibration Methods . 13 2.4.1 Binary Calibration . 14 2.4.2 Multi-class Calibration . 17 2.5 Relations between Measures of Uncertainty Representation . 18 2.5.1 Calibration, Over- and Underconfidence . 18 2.5.2 Sharpness, Over- and Underconfidence . 20 3 Gaussian Process Calibration 21 3.1 Definition . 21 3.2 Inference . 22 3.2.1 Inducing Points . 23 3.2.2 Bound on the Marginal Likelihood . 24 3.2.3 Computation of the Expectation Terms . 25 3.3 Prediction . 26 ix 3.4 Online Calibration . 27 3.5 Implementation . 27 4 Experiments 29 4.1 Synthetic Data . 31 4.2 Binary Benchmark Data . 33 4.3 Multi-class Benchmark Data . 34 4.4 Active Learning . 36 5 Conclusion 39 5.1 Summary . 39 5.2 Future Work . 40 Bibliography 41 A Additional Experimental Results 47 B Multivariate Normal Distribution 51 x List of Tables 2.1 Examples of common loss functions used in classification. Loss functions allow the comparison of different classification models by scoring them using samples from (X; Y ). We list a few common loss functions for a single input - output pair (x; y)...........8 4.1 Calibration results on binary classification benchmark data sets. Average ECE1 and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets. Lowest calibration error per data set and classification model is indicated in bold. 35 4.2 Calibration results on multi-class classification benchmark data sets. Average ECE1 and standard deviation of ten Monte-Carlo cross validation folds on multi-class benchmark data sets. Lowest calibration error per data set and classification model is indicated in bold..................................... 37 A.1 Accuracy after calibration on binary data. Average accuracy and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets. 48 A.2 Accuracy after calibration on multi-class data. Average accuracy and standard deviation of ten Monte-Carlo cross validation folds on binary benchmark data sets. 49 xi List of Figures 1.1 Example classification task in autonomous driving. Segmented scenery of Tübingen from the cityscapes data set [3] with a bounding box around an object, demonstrating an example classification task for an autonomous car. .2 1.2 Motivating example for calibration. We trained a neural network with one hidden layer on MNIST [19] and computed the classification error, the negative log-likelihood (NLL) and the expected calibration error (ECE1) over training epochs. We observe that while accuracy continues to improve on the test set, the ECE1 increases after 20 epochs. Note that this is different from classical overfitting, as the test error continues to decrease. This shows that training and calibration need to be considered independently. This can be mitigated by post- hoc calibration using our method (dashed red line). The uncertainty estimation is improved with maintained classification accuracy. .3 2.1 Illustration of the two approaches to modelling in classification. One can take one of two approaches when trying to model the latent relationship between inputs and outputs in the training data. Either one takes a discriminative approach, modelling the posterior fX;Y (y x) directly or a generative approach modelling the joint dis- j tribution fX;Y (x; y). Reprinted from [46]. .8 2.2 Illustration of calibration and sharpness. Examples of reliabil- ity diagrams and confidence histograms for a miscalibrated and not sharp classifier, a calibrated, but not sharp classifier, a classifier which is both miscalibrated and sharp and finally a calibrated and sharp classifier. The last classifier is generally the most desirable out of the four shown as its confidence estimates match its empirical accuracy and they are sufficiently close to 0 and 1 to be informative. 12

Load more