Calibration Techniques for Binary Classification Problems
Total Page:16
File Type:pdf, Size:1020Kb
Calibration Techniques for Binary Classification Problems: A Comparative Analysis Alessio Martino a, Enrico De Santis b, Luca Baldini c and Antonello Rizzi d Department of Information Engineering, Electronics and Telecommunications, University of Rome ”La Sapienza”, Via Eudossiana 18, 00184 Rome, Italy Keywords: Calibration, Classification, Supervised Learning, Support Vector Machine, Probability Estimates. Abstract: Calibrating a classification system consists in transforming the output scores, which somehow state the con- fidence of the classifier regarding the predicted output, into proper probability estimates. Having a well- calibrated classifier has a non-negligible impact on many real-world applications, for example decision mak- ing systems synthesis for anomaly detection/fault prediction. In such industrial scenarios, risk assessment is certainly related to costs which must be covered. In this paper we review three state-of-the-art calibration techniques (Platt’s Scaling, Isotonic Regression and SplineCalib) and we propose three lightweight proce- dures based on a plain fitting of the reliability diagram. Computational results show that the three proposed techniques have comparable performances with respect to the three state-of-the-art approaches. 1 INTRODUCTION tion from X . The seminal example is data clustering, where aim of the learning system is to return groups Classification is one of the most important problems (clusters) of data in such a way that patterns belong- falling under the machine learning and, specifically, ing to the same cluster are more similar with respect to under the supervised learning umbrella. Generally patterns belonging to other clusters (Jain et al., 1999; speaking, it is possible to sketch three main families: Martino et al., 2017b; Martino et al., 2018b; Martino clustering, regression/function approximation, classi- et al., 2019; Di Noia et al., 2019). fication. These problems mainly differ on the nature Synthesizing a classifier (predictive model) con- of the process to be modelled by the learning system sists in feeding some hx;yi pairs to a training algo- (Martino et al., 2018a). rithm in such a way to automatically learn the under- More into details, let P : X ! Y be an orientated lying model structure. In other words, the classifier process from the input space X (domain) towards the learns a decision function f that, given an input x, re- output space Y (codomain) and let hx;yi be a generic turns a predicted class labely ˆ, i.e. a prediction regard- input-output pair drawn from P , that is y = P (x). ing the class that pattern may belong to: In supervised learning a finite set = hX;Yi of S yˆ = f (x) (1) input-output pairs is supposed to be known and com- mon supervised learning tasks can be divided in clas- Eq. (1) is usually referred to as hard classifica- sification and function approximation. In the former tion. Probabilistic classifiers can also return a pos- case, the output space Y is a non-normed space and terior probability P(outputjinput) which can be use- output values usually belong to a finite categorical set ful for many real-world applications, for example of possible values. Conversely, in the latter case, the condition-based maintenance, decision support sys- output space is a normed space (usually R). In unsu- tems or anomaly/fault detection as operators usually pervised learning there are no output values and reg- want to know the probability of a specific equipment ularities have to be discovered using only informa- to fail given some input (known) state/conditions (De Santis et al., 2018b). Trivially, probabilistic clas- a https://orcid.org/0000-0003-1730-5436 sifiers can be ’forced’ to return hard predictions by b https://orcid.org/0000-0003-4915-0723 letting c https://orcid.org/0000-0003-4391-2598 d yˆ = argmaxP(Y = yjX) (2) https://orcid.org/0000-0001-8244-0015 y 487 Martino, A., De Santis, E., Baldini, L. and Rizzi, A. Calibration Techniques for Binary Classification Problems: A Comparative Analysis. DOI: 10.5220/0008165504870495 In Proceedings of the 11th International Joint Conference on Computational Intelligence (IJCCI 2019), pages 487-495 ISBN: 978-989-758-384-1 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved NCTA 2019 - 11th International Conference on Neural Computation Theory and Applications that is, for a given input pattern x 2 X, the classifier follows: assigns the output label y 2 Y which corresponds to • scores/probabilities go on the x-axis the maximum posterior probability. Albeit not all classifiers are probabilistic classi- • empirical probabilities P(yjs(x) = s), namely the fiers, some classifiers such as Support Vector Machine ratio between the number of patterns in class y (SVM) (Boser et al., 1992; Cortes and Vapnik, 1995; with score s and the total number of patterns with Scholkopf¨ and Smola, 2002; Cristianini and Shawe- score s, go on the y-axis Taylor, 2000) or Na¨ıve Bayes may return a score s(x) and if the classifier is well-calibrated, then all points which somewhat states the ’confidence’ in the predic- lie on the y = x line (i.e., the scores are equal to the tion of a given pattern x. As regards Na¨ıve Bayes, this empirical probabilities). In case of binary classifi- score can be seen as the probability estimate for class cation, the empirical probabilities regard the positive membership. However, this score is not calibrated instances only (i.e., the ratio between the number of (Domingos and Pazzani, 1996). For SVMs, the score positive instances having score s and the total number is basically the distance with respect to the separat- of instances with score s). ing hyperplane: the sign of s(x) determines whether Since scores are normally real-valued scalars, it is x has been classified as positive or negative, whereas quite impossible to quantify the number of data points the magnitude of s(x) determines the distance with re- sharing the same score1. In this case, a binning pro- spect to the hyperplane. Conversely to Na¨ıve Bayes, cedure is needed: SVMs’ scores not only are not calibrated, but also are • on the x-axis, the average score value within the not bounded in [0;1], albeit some re-scaling can be bin is considered performed (Zadrozny and Elkan, 2002). • on the y-axis, we get the ratio between the number Formally speaking, a classifier is said to be well- of patterns in class y lying in a given bin and the calibrated if P(yjs(x) = s), that is, the probability for total number of patterns lying in the same bin. a pattern x to belong to a label y converges to the score s(x) = s as the number of samples tends to infin- In works such as (Zadrozny and Elkan, 2002) and ity (Murphy and Winkler, 1977; Zadrozny and Elkan, (Niculescu-Mizil and Caruana, 2005) the authors pro- 2002). In plain terms, the calibration of a classifica- posed to consider 10 equally-spaced bins in range tion system consists in mapping the scores (or not- [0;1], regardless of the distribution of the scores calibrated probability estimates) into proper probabil- within that range. For some datasets, however, this ity estimates bounded in range [0;1] by definition. might not be a good choice and suitable alternatives The aim of this paper is to investigate amongst which somewhat consider the available samples are: several calibration techniques by considering binary • The Scott’s rule (Scott, 1979) evaluates the bin classification problems using SVM as classification width according to the number of samples (scores) system. The remainder of this paper is structured as n and their standard deviation s as follows: in Section 2 we give an overview of exist- 3:5 · s ing calibration techniques and figures of merit for ad- bin witdh = 1=3 dressing the goodness of the calibration along with n three new lightweight procedures to be compared • The Freedman–Diaconis rule (Freedman and Dia- with state-of-the-art approaches; in Section 3 we de- conis, 1981) evaluates the bin width as follows scribe the datasets used for experiments, along with 2 · IQR comparative results amongst the considered methods; bin witdh = 1=3 Section 4 concludes the paper, suggesting future re- n search and applications. where IQR is the inter-quantile range • The Sturges’ formula (Sturges, 1926) evaluates the number of bins as follows 2 AN OVERVIEW OF number of bins = 1 + dlog2ne CALIBRATION TECHNIQUES where d·e denotes the ceiling function • The square root choice, where the number of bins 2.1 Current Approaches is given by p In order to quantify the calibration of a given classifier number of bins = n the reliability diagram is usually employed (Murphy 1This counting procedure will return the (trivial) value and Winkler, 1977). The reliability diagram is built as of 1 for any s(x). 488 Calibration Techniques for Binary Classification Problems: A Comparative Analysis 2 n However, using a single binning, even if evaluated ac- and a weights vector w 2 R such that xi ≥ xi−1 and cording to one of the four alternatives above, might wi > 0 for all i = 1;:::;n, then the isotonic regression not be a good choice, especially if data do not fol- of a function f (x) consists in finding a function g(x) low a specific underlying distribution (e.g., uniform according to a mean squared error criterion distribution in case of uniform binning or normal dis- n tribution in case of the Sturges’s formula). To this 2 ∑ wi(g(xi) − f (xi)) (6) end, in (Naeini et al., 2015), the Authors proposed i=1 the Bayesian Binning into Quantiles technique, which considers different binning (and their combination) in where g(x) must be a piecewise non-decreasing (iso- order to make the calibration procedure more robust.