Temporal Probability Calibration

Temporal Probability Calibration Tim Leathart 1 Maksymilian Polaczuk 1 Abstract 0:35 Deep Averaging Network In many applications, accurate class probability Recurrent Neural Network Transformer estimates are required, but many types of mod- 0:3 els produce poor quality probability estimates de- 0:25 spite achieving acceptable classification accuracy. Even though probability calibration has been a hot 0:2 topic of research in recent times, the majority of this has investigated non-sequential data. In this 0:15 paper, we consider calibrating models that produce class probability estimates from sequences 0:1 of data, focusing on the case where predictions are obtained from incomplete sequences. We Expected Calibration Error 0:05 show that traditional calibration techniques are not sufficiently expressive for this task, and pro- 0 pose methods that adapt calibration schemes de- 22 23 24 25 26 27 28 29 pending on the length of an input sequence. Ex- Sequence Length perimental evaluation shows that the proposed methods are often substantially more effective at Figure 1. Test expected calibration error (ECE) of common sequen- calibrating probability estimates from modern se- tial architectures for sequences of different lengths from the Large quential architectures for incomplete sequences Movie Review dataset. As the sequence length increases, the level across a range of application domains. of calibration for each model changes substantially. 1. Introduction can be used to minimise expected monetary losses to the Sequential data is abundant in the modern world, commonly lender if the probability of defaulting is accurately estimated. seen in forms such as natural language (Harper & Konstan, In semi-autonomous vehicles, low estimated confidence for 2016; Rajpurkar et al., 2016), video streams (Cordts et al., the most likely class for some object may indicate that extra 2016) and financial trading patterns (Brown et al., 2013). caution is required, possibly alerting the driver to intervene. Modern approaches to making predictions from these kinds Sometimes, an accurate probability distribution, rather than of data typically involves using model architectures such a hard classification, is required to have a functioning system as deep averaging networks (Iyyer et al., 2015), recurrent at all. For example, if a model is used to predict the winner arXiv:2002.02644v2 [cs.LG] 15 Feb 2020 neural networks (Hochreiter & Schmidhuber, 1997; Cho of a sports game in order to automatically set betting odds, et al., 2014) and more recently, transformers (Vaswani et al., accurate probability estimates are necessary. 2017; Devlin et al., 2018; Radford et al., 2019). Even though most modern machine learning algorithms na- In many domains, predicting the most likely class label yî tively produce an estimated probability distribution over for an instance i with features xi and label yi is sufficient the class labels for a given instance, it is not always the for classification tasks. However, it is often the case in real case that they closely reflect the true probabilities of each applications that an estimated probability distribution pî class (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017; over the labels can improve the quality or usefulness of the Leathart et al., 2019; Kumar & Sarawagi, 2019). Models system. For example, an automated loan approval system for which this is true are said to be poorly calibrated. Prob- ability calibration is an additional step one can apply when 1Sportsflare AI. Correspondence to: Tim Leathart <tim@sportsflare.io>. training a model f, where its class probability estimates (or logits) are used as inputs for another model π that scales them appropriately to better match the true probabilities. Temporal Probability Calibration This work considers situations in which we wish to obtain proxy metrics that aim to emulate this intuition. Common class probability estimates for a sequence at any time during metrics for evaluating the quality of probability estimates the formation of the sequence. This situation is fairly com- include negative log likelihood (NLL) mon, e.g., offering a “help” article to a website user while n they are typing a description of the issue they are facing, 1 X NLL = − y log ^p (2) predicting if an investor should buy or sell an option up until n i i i=1 the expiration date, or predicting the outcome of a sports game given information about the current state of the game. and the Brier score (Brier, 1950), also known as mean For these types of problems, the prediction task typically squared error (MSE) gets easier as the end of the full sequence draws nearer—if n the score in a football game is 1-0 with one minute remain- 1 X MSE = (y − ^p )2: (3) ing, we should be much more confident in our prediction n i i of the winner than if the score is 1-0 at half-time. Figure1 i=1 shows how expected calibration error, a commonly used Technically speaking, neither of these metrics directly mea- calibration metric described in Section 2.1, changes for the sure calibration, as for each example i, the estimated prob- Large Movie Review dataset (Maas et al.) as the sequence ability ^pi is compared to its label yi rather than the true length increases for several different models. In this exam- probability distribution over the label space. However, they ple, deep average networks and transformers have poorer are good proxy metrics, and possess convenient properties calibration for shorter sequences than longer, and vice-versa for machine learning such as being differentiable, applicable for the recurrent network. Intuitively, a global calibration to individual examples, and easy to compute. strategy that applies the same calibration to sequences of any length will not be suitable for these models. Of course, obtaining a true probability distribution over the label space for a single example is usually not possible in In this paper, we propose several simple strategies to adapt practice, as typically, only labels are supplied. Expected cal- calibration schemes to better handle incomplete sequences, ibration error (ECE) (Naeini et al., 2015) is a binning-based and evaluate them against traditional, global calibration approach that attempts to approximate this distribution em- methods. The paper is structured as follows. First, an pirically. The probability space [0; 1] is split into K bins, introduction to probability calibration is provided, where and the test examples are grouped into the bins based on definitions, existing approaches and evaluation methods are their estimated probabilities. The accuracy and average discussed. Then, our proposed temporal probability calibra- confidence of each bin Bk are computed as tion techniques are described, considering both discrete and n continuous sequences of fixed or variable length. Experi- 1 Xk acc(B ) = (^y = y ) (4) ments are described and their results discussed. Finally, we k n I i i go over conclusions and future work. k i n 1 Xk conf(B ) = p^ : (5) k n i 2. Probability Calibration k i A probabilistic classifier is said to be perfectly calibrated ECE is then defined as when the probability estimates for each example exactly K match the true class probabilities of the example. For in- X nk ECE = acc(Bk) − conf(Bk) : (6) stance, for those examples that are assigned a confidence of n 75% by a perfectly calibrated classifier, 75% of them should k=1 actually be classified correctly. For a perfectly calibrated model, the accuracy and con- More formally, for a probabilistic classifier f for an M- fidence of each bin should be equal. Even though ECE class classification task and predicted probability distribu- measures (average) calibration directly, it is not without tion ^p = [^p1;:::; p^M ], the proportions of classes for all problems—the choice for number of bins is arbitrary, prob- possible instances that would be assigned the prediction ^p abilities for individual examples are discarded in favour of by f are equal to ^p (Kull et al., 2019): aggregated bins, and its applicability to multiclass problems is debated (Nixon et al., 2019; Leathart, 2019). Strategies P(Y = m j f(X) = ^p) =p ^m for m 2 f1 :::Mg: (1) such as classwise-ECE (Kull et al., 2019) have been proposed to better handle the multiclass case. 2.1. Evaluating Calibration Accuracy and confidence are often compared visually in Without an infinite number of samples, the condition in (1) reliability diagrams, where they are plotted against each is not possible to achieve. However, there exist several other (DeGroot & Fienberg, 1983). In reliability diagrams, Temporal Probability Calibration perfect calibration is shown by a straight diagonal line. Re- 2.3. Nonparametric Calibration Methods gions where the curve sits above the diagonal represent Histogram binning (Zadrozny & Elkan, 2001) is a simple underconfidence, and regions where the curve sits under the nonparametric approach to probability calibration. In his- diagonal represent overconfidence. togram binning, the model’s output space is split into K bins, typically by equal-width or equal-frequency strate- 2.2. Parametric Calibration Methods gies. A calibrated probability per bin is assigned such that One of the most well-known approaches to probability cali- the MSE for each bin is minimised, which turns out to be bration is Platt scaling (Platt, 1999), in which a univariate equal to the percentage of positive examples in each bin logistic regression model with parameters (α; β) is learned respectively. The calibrated probability estimate for a test to minimise NLL between a binary model’s outputs f(xi) example is given by the assigned value for the bin that and the labels yi. Calibrated probabilities pî for an instance it lands in. Naeini et al.(2015) proposed an extension of i can be obtained by histogram binning called Bayesian binning into quantiles, which performs Bayesian model averaging over all possible p^ = π(f(x ); α; β) = σ (αf(x ) + β) (7) i i i equal-frequency binning schemes.

Load more