Calibration: What and Why

Classifier Tutorial ECML PKDD 2020

Prof. Peter Flach [email protected] classifier-calibration.github.io/ Table of Contents

Introduction Taking inspiration from forecasting Why are we interested in calibration? Common sources of miscalibration A first look at some calibration techniques

Multi-class calibration What’s so special about multi-class calibration? Definitions of calibration for more than two classes

Summary and tutorial overview

2 / 37 Table of Contents

Introduction Taking inspiration from forecasting Why are we interested in calibration? Common sources of miscalibration A first look at some calibration techniques

Multi-class calibration What’s so special about multi-class calibration? Definitions of calibration for more than two classes

Summary and tutorial overview

3 / 37 Taking inspiration from forecasting

Weather forecasters started thinking about calibration a long time ago (Brier, 1950). I A forecast ‘70% chance of rain’ should be followed by rain 70% of the time. This is immediately applicable to binary classification: I A prediction ‘70% chance of spam’ should be spam 70% of the time. and to multi-class classification: I A prediction ‘70% chance of setosa, 10% chance of versicolor and 20% chance of virginica’ should be setosa/versicolor/virginica 70/10/20% of the time. In general: I A predicted probability (vector) should match empirical (observed) probabilities. Q: What does ‘x% of the time’ mean?

4 / 37 Forecasting example

Let’s consider a small toy example: I Two predictions of ‘10% chance of rain’ were both followed by ‘no rain’. I Two predictions of ‘40% chance of rain’ were once followed by ‘no rain’, and once by ‘rain’. I Three predictions of ‘70% chance of rain’ were once followed by ‘no rain’, and twice by ‘rain’. I One prediction of ‘90% chance of rain’ was followed by ‘rain’. Q: Is this forecaster well-calibrated?

5 / 37 Over- and under-estimates

pˆ y This forecaster is doing a pretty decent job: 0 0.1 0 I ‘10%’ chance of rain’ was a slight over-estimate 1 0.1 0 (y¯ = 0/2 = 0%); 2 0.4 0 I ‘40%’ chance of rain’ was a slight under-estimate 3 0.4 1 (y¯ = 1/2 = 50%); 4 0.7 0 I ‘70%’ chance of rain’ was a slight over-estimate 5 0.7 1 (y¯ = 2/3 = 67%); 6 0.7 1 I ‘90%’ chance of rain’ was a slight under-estimate 7 0.9 1 (y¯ = 1/1 = 100%).

6 / 37 Visualising forecasts: the reliability diagram

pˆ y 0 0.1 0 1 0.1 0 2 0.4 0 3 0.4 1 4 0.7 0 5 0.7 1 6 0.7 1 7 0.9 1

7 / 37 Changing the numbers slightly

pˆ y 0 0.1 0 1 0.2 0 2 0.3 0 3 0.4 1 4 0.6 0 5 0.7 1 6 0.8 1 7 0.9 1

8 / 37 Or should we group forecasts differently?

pˆ y 0 0.1 0 1 0.2 0 2 0.3 0 3 0.4 1 4 0.6 0 5 0.7 1 6 0.8 1 7 0.9 1

9 / 37 Or not at all?

pˆ y 0 0.1 0 1 0.2 0 2 0.3 0 3 0.4 1 4 0.6 0 5 0.7 1 6 0.8 1 7 0.9 1

10 / 37 Binning or pooling predictions is a fundamental notion

We need bins to evaluate the degree of calibration: I In order to decide whether a weather forecaster is well-calibrated, we need to look at a good number of forecasts, say over one year. I We also need to make sure that there are a reasonable number of forecasts for separate probability values, so we can obtain reliable empirical estimates. I Trade-off: large bins give better empirical estimates, small bins allows a more fine-grained assessment of calibration. But adjusting forecasts in groups also gives rise to practical calibration methods: I empirical binning I (aka ROC convex hull)

11 / 37 Why are we interested in calibration?

To calibrate means to employ a known scale with known properties. I E.g., additive scale with a well-defined zero, so that ratios are meaningful. For classifiers we want to use the probability scale, so that we can I justifiably use default decision rules (e.g., maximum posterior probability); I adjust these decision rules in a straightforward way to account for different class priors or misclassification costs; I combine probability estimates in a well-founded way. Q: Is the probability scale additive? Q: How would you combine probability estimates from several well-calibrated models?

12 / 37 Optimal decisionsI

Denote the cost of predicting class j for an instance of true class i as C(Yˆ = j Y = i). | The expected cost of predicting class j for instance x is

C(Yˆ = j X = x) = P(Y = i X = x)C(Yˆ = j Y = i) | | | Xi where P(Y = i X = x) is the probability of instance x having true class i (as would be | given by the Bayes-optimal classifier). The optimal decision is then to predict the class with lowest expected cost:

Yˆ ∗ = argmin C(Yˆ = j X = x) = arg min P(Y = i X = x)C(Yˆ = j Y = i) j | j | | Xi

13 / 37 Optimal decisionsII

In binary classification we have:

C(Yˆ = + X = x) = P(+ x)C(+ +) + 1 P(+ x) C(+ ) | | | − | |− C(Yˆ = X = x) = P(+ x)C( +) + 1 P(+ x) C( ) −| | −| − |  −|− On the optimal decision boundary these two expected costs are equal, which gives

C(+ ) C( ) P(+ x) = |− − −|− c | C(+ ) C( )+ C( +) C(+ +) , |− − −|− −| − | This gives the optimal threshold on the hypothetical Bayes-optimal probabilities. It is also the best thing to do in practice – as long as the probabilities are well-calibrated!

14 / 37 Optimal decisions III

Without loss of generality we can set the cost of true positives and true negatives to zero; c = cFP is then the cost of a false positive in proportion to the combined cost cFP+cFN of one false positive and one false negative. I E.g., if false positives are 4 times as costly as false negatives then we set the decision threshold to 4/(4 + 1) = 0.8 in order to only make positive predictions if we’re pretty certain. Similar reasoning applies to changes in class priors: I if we trained on balanced classes but want to deploy with 4 times as many positives compared to negatives, we lower the decision threshold to 0.2; I more generally, if we trained for class ratio r and deploy for class ratio r 0 we set the decision threshold to r/(r + r 0). Cost and class prior changes can be combined in the obvious way.

15 / 37 Common sources of miscalibration

Underconfidence: a classifier thinks it’s worse at separating classes than it actually is. I Hence we need to pull predicted probabilities away from the centre. Overconfidence: a classifier thinks it’s better at separating classes than it actually is. I Hence we need to push predicted probabilities toward the centre. A classifier can be overconfident for one class and underconfident for the other, in which case all predicted probabilities need to be increased or decreased.

16 / 37 Predicting Good Probabilities With

COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 Underconfidence example 0.05

0 0 1 1

0.8 0.8

0.6 0.6 I Underconfidence typically gives sigmoidal distortions. 0.4 0.4 I To calibrate these means to pull predicted

Fraction of Positives 0.2 probabilities away from the centre. 0.2 Fraction of Positives

0 0 1 Source: (Niculescu-Mizil and 1 Caruana, 2005) 0.8 0.8

0.6 0.6 17 / 37 0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value

Figure 1. Histograms of predicted values and reliability diagrams for boosted decision trees.

COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value

Figure 2. Histograms of predicted values and reliability diagrams for boosted trees calibrated with Platt’s method.

COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value

Figure 3. Histograms of predicted values and reliability diagrams for boosted trees calibrated with Isotonic Regression. ture of the problems. For example, we could conclude that neural nets are similar to SVMs and boosted trees after they the LETTER and HS problems, given the available fea- have been calibrated using Platt’s method. tures, have well defined classes with a small number of Examining the histograms and reliability diagrams for lo- cases in the “gray” region, while in the SLAC problem the gistic regression and bagged trees shows that they be- two classes have high overlap with significant uncertainty have similar to neural nets. Both learning algorithms are for most cases. It is interesting to note that neural networks well calibrated initially and post-calibration does not help with a single sigmoid output unit can be viewed as a linear them on most problems. Bagged trees are helped a little classifier (in the span of it’s hidden units) with a sigmoid by post-calibration on the MEDIS and LETTER.P2 prob- at the output that calibrates the predictions. In this respect lems. While it is not surprising that pre- Predicting Good Probabilities With Supervised Learning

0.25 Returning to Figure 6, we see that the histograms of the pre- 0.2 dicted values before calibration (first column) from the ten 0.15

BST-DT 0.1 different models display wide variation. The max margin

0.05 methods (SVM, boosted trees, and boosted stumps) have

0 0.25 the predicted values massed in the center of the histograms,

0.2 causing a sigmoidal shape in the reliability plots. Both 0.15 and Isotonic Regression are effective at fitting

0.1 BST-STMP this sigmoidal shape. After calibration the prediction his- 0.05 tograms extend further into the tails near predicted values 0 0.25 of 0 and 1. 0.2

0.15 For methods that are well calibrated (neural nets, bagged SVM 0.1 trees, random forests, and logistic regression), calibration 0.05 with Platt Scaling actually moves probability mass away 0 0.25 from 0 and 1. It is clear from looking at the reliability di- 0.2 agrams for these methods that the sigmoid has difficulty 0.15

ANN fitting the predictions in the tails of these well-calibrated 0.1

0.05 methods.

0 0.25 Overall, if one examines the probability histograms before 0.2 and after calibration, it is clear that the histograms are much 0.15 more similar to each other after Platt Scaling. Calibration 0.1 LOGREG significantly reduces the differences between the probabil- 0.05

0 ities predicted by the different models. Of course, calibra- 0.25 tion is unable to fully correct the predictions from the infe- 0.2 rior models such as decision trees and naive bayes. 0.15

BAG-DT 0.1 0.05 5. Learning Curve Analysis 0 0.25

0.2 In this section we present a learning curve analysis of the 0.15 two calibration methods, Platt Scaling and Isotonic Regres- DT 0.1 sion. The goal is to determine how effective these calibra- 0.05 tion methods are as the amount of data available for cali- 0 0.25 bration varies. For this analysis we use the same models as 0.2 in Section 4, but here we vary the size of the calibration set 0.15

RF from 32 cases to 8192 cases by factors of two. To measure 0.1

0.05 calibration performance we examine the squared error of

0 the models. 0.25 0.2 The plots in Figure 7 show the average squared error over 0.15

KNN the eight test problems. For each problem, we perform ten 0.1 trials. Error bars are shown on the plots, but are so nar- Overconfidence example 0.05 0 row that they may be difficult to see. Calibration learning 0.25 curves are shown for nine of the ten learning methods (de- 0.2 I Overconfidence is very common, and cision trees are left out). usually a consequence 0.15 of over-counting

evidence. NB 0.1 I Here, distortions are inverse-sigmoidal The nearly horizontal lines in the graphs show the squared I Calibrating these means 0.05 to push predicted probabilities toward the centre. 0 error prior to calibration. These lines are not perfectly Source: (Niculescu-Mizil and horizontal only because the test sets change as more data Figure 6. HistogramsCaruana, 2005)and reliability diagrams for SLAC. is moved into the calibration sets. Each plot shows the 18 / 37 squared error after calibration with Platt’s method or Iso- ates reliability plots that have an inverted sigmoid shape. tonic Regression as the size of the calibration set varies While Platt Scaling is still helping to improve calibration, from small to large. When the calibration set is small (less it is clear that a sigmoid is not the right transformation to than about 200-1000 cases), Platt Scaling outperforms Iso- calibrate Naive Bayes models. Isotonic Regression is a bet- tonic Regression with all nine learning methods. This hap- ter choice to calibrate these models. pens because Isotonic Regression is less constrained than A first look at some calibration techniques

Parametric calibration involves modelling the score distributions within each class. I Platt scaling = Logistic calibration can be derived by assuming that the scores within both classes are normally distributed with the same variance (Platt, 2000). I Beta calibration employs Beta distributions instead, to deal with scores already on a [0, 1] scale (Kull et al., 2017). I Dirichlet calibration for more than two classes (Kull et al., 2019). Non-parametric calibration often ignores scores and employs ranks instead. I E.g., isotonic regression = pool adjacent violators = ROC convex hull (Zadrozny and Elkan, 2001; Fawcett and Niculescu-Mizil, 2007).

These techniques will be more fully discussed in Part III of the tutorial.

19 / 37 BETA CALIBRATION MEELIS KULL,TELMO M. SILVA FILHO AND PETER FLACH [email protected], [email protected], [email protected]

CALIBRATED CLASSIFIER LOGISTIC CALIBRATION BETA CALIBRATION BETA CALIBRATION FAMILY

A binary classifier is calibrated, if: • Also known as Platt scaling [Platt 2000] • Our novel contribution. a = 0.2 a = 1 a = 5 1 b = 0.2

• it outputs a probability of the instance to be • Fits a parametric family with 2 parameters: • Fits a parametric family with 3 parameters: 0.5

positive instead of outputting a class label; 1 ) 0 1 µbeta(s; a, b, c)= 1 µ (s; , )= a logistic s+ c s b = 1 1+1 e b • and that probability is calibrated, i.e. it is 1 + 1 /(e · ) (1 s) 0.5

c s; a, b, equal to the proportion of positives among ( .⇣ ⌘ 0 beta all instances with the same predicted prob- • Family contains only sigmoids. • Sigmoids, inverse sigmoids, identity and more. µ 1 b = 5 ability; 0.5 • Logistic calibration is perfect if the class- • Beta calibration is perfect if the class- + + 0 conditional score densities f and f are Gaus- conditional score densities f and f are 0 0.5 1 0 0.5 1 0 0.5 1 s Platt scalingsian with equal variance. beta distributions. WORKS FOR ANY COST CONTEXT 0.8 1.00 1.00 fpos (fpos + fneg) 6 fpos (fpos + fneg) The same calibrated classifier works for any false 0.6 0.75 0.75 TAKE HOME MESSAGES 4 p

f2 0.4 0.50 0.50 f f p positive and false negative cost context without f f f2 neg pos 0.2 neg pos 0.25 logistic map 2 Beta calibration: retraining: 0.25 beta map

0.0 0.00 0 0.00 −1 0 1 2 −1 0 1 2 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 • Well-founded: derived from beta distribution; 1. Learn a calibrated classifier; s s s s • Easily-implemented: logistic regression after 2. Apply the classifier on the given test in- • Easily implemented by fitting logistic regres- • Easily implemented by fitting logistic regres- 1 p(s; w, m) = log-transform; stance to obtain an estimate pˆ of its proba- sion on the single1 + feature exp( w(s s.m)) sion on two features ln(s) and ln(1 s). − −2 bility to be positive; w = (µpos µneg)/σ , m = (µpos + µneg)/2 − • Better calibrated probabilities than from lo- gistic in our experiments on 3 model classes. 3. Determine the costs cFP and cFN per false 20 / 37 positive and false negative; SCORE HISTOGRAMS AND CALIBRATION MAPS (ADA-O) 4. Predict positive if p>cˆ FP/(cFP+cFN), oth- CODE AND PACKAGES erwise predict negative. Prediction and cost-sensitive decision making The source code for experi- have been separated. This is required when ments, beta calibration pack- the misclassification costs are not known during ages for Python and R and tu- model training. torials for both languages: https://betacal.github.io

CLASSIFIER CALIBRATION ACKNOWLEDGEMENTS If the classifier outputs non-calibrated probabili- ties or any real-valued scores, then it can still be We were supported by the SPHERE Interdisciplinary Research calibrated: Collaboration, funded by the UK Engineering and Physical German Diabetes Landsat Tic-tac Vowel Sciences Research Council under grant EP/K031910/1. TSF 1. Learn a calibration map µ from classifier was financially supported by CNPq (Brazilian National Coun- outputs to calibrated probabilities; cil for Scientific and Technological Development). XPERIMENTS ON ATASETS LOG LOSS 2. Apply the classifier on the given test in- E 41 D ( - ) stance to obtain the non-calibrated score s; CD CD CD 1 2 3 4 1 2 3 4 1 2 3 4 MULTICLASS? beta beta beta logistic logistic logistic 3. Remap the score into a calibrated probabil- uncalibrated isotonic isotonic isotonic uncalibrated uncalibrated Dirichlet calibration is the upcoming general- ity pˆ = µ(s). isation to multi-class classifier calibration avail- Ada-O (p-value=1.01e 12) Naive Bayes (p-value=6.88e 17) Ada-S (p-value=4.73e 15) able at https://dircal.github.io BETA CALIBRATION MEELIS KULL,TELMO M. SILVA FILHO AND PETER FLACH [email protected], [email protected], [email protected]

CALIBRATED CLASSIFIER LOGISTIC CALIBRATION BETA CALIBRATION BETA CALIBRATION FAMILY

A binary classifier is calibrated, if: • Also known as Platt scaling [Platt 2000] • Our novel contribution. a = 0.2 a = 1 a = 5 1 b = 0.2

• it outputs a probability of the instance to be • Fits a parametric family with 2 parameters: • Fits a parametric family with 3 parameters: 0.5

positive instead of outputting a class label; 1 ) 0 1 µbeta(s; a, b, c)= 1 µ (s; , )= a logistic s+ c s b = 1 1+1 e b • and that probability is calibrated, i.e. it is 1 + 1 /(e · ) (1 s) 0.5

c s; a, b, equal to the proportion of positives among ( .⇣ ⌘ 0 beta all instances with the same predicted prob- • Family contains only sigmoids. • Sigmoids, inverse sigmoids, identity and more. µ 1 b = 5 ability; 0.5 • Logistic calibration is perfect if the class- • Beta calibration is perfect if the class- + + 0 conditional score densities f and f are Gaus- conditional score densities f and f are 0 0.5 1 0 0.5 1 0 0.5 1 s sian with equal variance. Beta calibrationbeta distributions. WORKS FOR ANY COST CONTEXT 0.8 1.00 1.00 fpos (fpos + fneg) 6 fpos (fpos + fneg) The same calibrated classifier works for any false 0.6 0.75 0.75 TAKE HOME MESSAGES 4 p f2 0.4 0.50 0.50 f f p positive and false negative cost context without f f f2 neg pos 0.2 neg pos 0.25 logistic map 2 Beta calibration: retraining: 0.25 beta map

0.0 0.00 0 0.00 −1 0 1 2 −1 0 1 2 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 • Well-founded: derived from beta distribution; 1. Learn a calibrated classifier; s s s s • Easily-implemented: logistic regression after 2. Apply the classifier on the given test in- • Easily implemented by fitting logistic regres- • Easily implemented by fitting logistic regres- 1 log-transform; stance to obtain an estimate pˆ of its proba- p(s; a, b, c) = sion on the single feature s. sion on two features1 +ln( exp(sa)lnands b ln(1ln(1s) c)s). − − − − bility to be positive; a = αpos αneg, b = βneg βpos − − • Better calibrated probabilities than from lo- gistic in our experiments on 3 model classes. 3. Determine the costs cFP and cFN per false 21 / 37 positive and false negative; SCORE HISTOGRAMS AND CALIBRATION MAPS (ADA-O) 4. Predict positive if p>cˆ FP/(cFP+cFN), oth- CODE AND PACKAGES erwise predict negative. Prediction and cost-sensitive decision making The source code for experi- have been separated. This is required when ments, beta calibration pack- the misclassification costs are not known during ages for Python and R and tu- model training. torials for both languages: https://betacal.github.io

CLASSIFIER CALIBRATION ACKNOWLEDGEMENTS If the classifier outputs non-calibrated probabili- ties or any real-valued scores, then it can still be We were supported by the SPHERE Interdisciplinary Research calibrated: Collaboration, funded by the UK Engineering and Physical German Diabetes Landsat Tic-tac Vowel Sciences Research Council under grant EP/K031910/1. TSF 1. Learn a calibration map µ from classifier was financially supported by CNPq (Brazilian National Coun- outputs to calibrated probabilities; cil for Scientific and Technological Development). XPERIMENTS ON ATASETS LOG LOSS 2. Apply the classifier on the given test in- E 41 D ( - ) stance to obtain the non-calibrated score s; CD CD CD 1 2 3 4 1 2 3 4 1 2 3 4 MULTICLASS? beta beta beta logistic logistic logistic 3. Remap the score into a calibrated probabil- uncalibrated isotonic isotonic isotonic uncalibrated uncalibrated Dirichlet calibration is the upcoming general- ity pˆ = µ(s). isation to multi-class classifier calibration avail- Ada-O (p-value=1.01e 12) Naive Bayes (p-value=6.88e 17) Ada-S (p-value=4.73e 15) able at https://dircal.github.io

Isotonic regression

+ + + + + + + + + + ------+ +

1 + +

.8 + + -

.6 + - +

.4 + - - + - - - Calibrated scores

.2 -

0 - - 0 .2 .4 .6 .8 1 Original scores Source: (Flach, 2016)

22 / 37

6

1 Table of Contents

Introduction Taking inspiration from forecasting Why are we interested in calibration? Common sources of miscalibration A first look at some calibration techniques

Multi-class calibration What’s so special about multi-class calibration? Definitions of calibration for more than two classes

Summary and tutorial overview

23 / 37 What’s so special about multi-class calibration?

Similar to classification, some methods are inherently multi-class but most are not.

This leads to (at least) three different ways of defining what it means to be fully multiclass-calibrated. I Many recent papers use the (weak) notion of confidence calibration.

Evaluating multi-class calibration is in its full generality still an open problem.

24 / 37 Definitions of calibration for more than two classes

The following definitions of calibration are equivalent for binary classification but increasingly stronger for more than two classes:

Confidence calibration: only consider the highest predicted probability.

Class-wise calibration: only consider marginal probabilities.

Multi-class calibration: consider the entire vector of predicted probabilities.

25 / 37 Confidence calibration

This was proposed by (Guo et al., 2017), requiring that among all instances where the probability of the most likely class is predicted to be c, the expected accuracy is c. (We call this ‘confidence calibration’ to distinguish it from the stronger notions of calibration.)

Formally, a probabilistic classifier pˆ : ∆k is confidence-calibrated, if for any X → confidence level c [0, 1], the actual proportion of the predicted class, among all ∈ possible instances x being predicted this class with confidence c, is equal to c:

P(Y = i pˆi (x) = c) = c where i = argmax pˆj (x). | j

26 / 37 Class-wise calibration

Originally proposed by (Zadrozny and Elkan, 2002), this requires that all one-vs-rest probability estimators obtained from the original multiclass model are calibrated.

Formally, a probabilistic classifier pˆ : ∆k is classwise-calibrated, if for any class X → i and any predicted probability qi for this class, the actual proportion of class i, among all possible instances x getting the same prediction pˆi (x) = qi , is equal to qi :

P(Y = i pˆi (x) = qi ) = qi for i = 1,..., k. |

27 / 37 Multi-class calibration

This is the strongest form of calibration for multiple classes, subsuming the previous two definitions.

A probabilistic classifier pˆ : ∆k is multiclass-calibrated if for any prediction X → vector q = (q1,..., qk ) ∆k , the proportions of classes among all possible instances ∈ x getting the same prediction pˆ(x) = q are equal to the prediction vector q:

P(Y = i pˆ(x) = q) = qi for i = 1,..., k. |

28 / 37 Reminder: binning needed

For practical purposes, the conditions in these definitions need to be relaxed. This is where binning comes in.

Once we have the bins, we can draw a reliability diagram as in the two-class case. For class-wise calibration, we can show per-class reliability diagrams or a single averaged one.

The degree of calibration is assessed using the gaps in the reliability diagram. All of this will be elaborated in the next part of the tutorial.

29 / 37 Table of Contents

Introduction Taking inspiration from forecasting Why are we interested in calibration? Common sources of miscalibration A first look at some calibration techniques

Multi-class calibration What’s so special about multi-class calibration? Definitions of calibration for more than two classes

Summary and tutorial overview

30 / 37 Important points to remember

Only well-calibrated probability estimates are worthy to be called probabilities: otherwise they are just scores that happen to be in the [0, 1] range.

Binning will be required in some form: instance-based probability evaluation metrics such as Brier score or log-loss always measure calibration plus something else.

In multi-class settings, think carefully about which form of calibration you need: e.g., confidence-calibration is too weak in a cost-sensitive setting.

31 / 37 What happens next

14.45 - Telmo Silva-Filho: Evaluation metrics and proper scoring rules Expected/maximum calibration error; proper scoring rules; hypothesis test for calibration 15.30 - Break and preparation for hands-on session 15.50 - Hao Song: Calibrators Binary approaches; multi-class approaches; regularisation and Bayesian treatments; implementation 16.50 - Miquel Perello-Nieto: Hands-on session 17.30 - Peter Flach, Hao Song: Advanced topics and conclusion Cost curves; calibrating for F-score; regressor calibration All times in CEST. We thank Meelis Kull (U Tartu, Estonia) for his help in preparing this material.

32 / 37 ReferencesI

G. W. Brier. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review, 78(1):1–3, 1950. T. Fawcett and A. Niculescu-Mizil. PAV and the ROC convex hull. , 68(1):97–106, 2007. P. A. Flach. ROC analysis. In Encyclopedia of Machine Learning and . Springer, 2016. C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. In 34th International Conference on Machine Learning, pages 1321–1330, Sydney, Australia, 2017. M. Kull, T. M. Silva Filho, and P. Flach. Beyond Sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of , 11(2):5052–5080, 2017.

33 / 37 ReferencesII

M. Kull, M. Perello-Nieto, M. Kängsepp, T. S. Filho, H. Song, and P. Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems (NIPS’19), pages 12316–12326, 2019. A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In 22nd International Conference on Machine Learning (ICML’05), pages 625–632, New York, New York, USA, 2005. ACM Press. J. Platt. Probabilities for SV Machines. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large-Margin Classifiers, pages 61—-74. MIT Press, 2000. B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In 18th International Conference on Machine Learning (ICML’01), pages 609—-616, 2001.

34 / 37 References III

B. Zadrozny and C. Elkan. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In 8th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’02, pages 694—-699, New York, New York, USA, 2002. ACM Press.

35 / 37 Acknowledgements

I The work of MPN was supported by the SPHERE Next Steps Project funded by the UK Engineering and Physical Sciences Research Council (EPSRC), Grant EP/R005273/1. I The work of PF and HS was supported by The Alan Turing Institute under EPSRC Grant EP/N510129/1. I The work of MK was supported by the Estonian Research Council under grant PUT1458. I The background used in the title slide has been modified by MPN from an original picture by Ed Webster with license CC BY 2.0.

36 / 37 Calibration: What and Why

Classifier Calibration Tutorial ECML PKDD 2020

Prof. Peter Flach [email protected] classifier-calibration.github.io/