Temporal Probability

Tim Leathart 1 Maksymilian Polaczuk 1

Abstract 0.35 Deep Averaging Network In many applications, accurate class probability Transformer estimates are required, but many types of mod- 0.3 els produce poor quality probability estimates de- 0.25 spite achieving acceptable classification accuracy. Even though probability calibration has been a hot 0.2 topic of research in recent times, the majority of this has investigated non-sequential data. In this 0.15 paper, we consider calibrating models that pro- duce class probability estimates from sequences 0.1 of data, focusing on the case where predictions are obtained from incomplete sequences. We Expected Calibration Error 0.05 show that traditional calibration techniques are not sufficiently expressive for this task, and pro- 0 pose methods that adapt calibration schemes de- 22 23 24 25 26 27 28 29 pending on the length of an input sequence. Ex- Sequence Length perimental evaluation shows that the proposed methods are often substantially more effective at Figure 1. Test expected calibration error (ECE) of common sequen- calibrating probability estimates from modern se- tial architectures for sequences of different lengths from the Large quential architectures for incomplete sequences Movie Review dataset. As the sequence length increases, the level across a range of application domains. of calibration for each model changes substantially.

1. Introduction can be used to minimise expected monetary losses to the Sequential data is abundant in the modern world, commonly lender if the probability of defaulting is accurately estimated. seen in forms such as natural language (Harper & Konstan, In semi-autonomous vehicles, low estimated confidence for 2016; Rajpurkar et al., 2016), video streams (Cordts et al., the most likely class for some object may indicate that extra 2016) and financial trading patterns (Brown et al., 2013). caution is required, possibly alerting the driver to intervene. Modern approaches to making predictions from these kinds Sometimes, an accurate probability distribution, rather than of data typically involves using model architectures such a hard classification, is required to have a functioning system as deep averaging networks (Iyyer et al., 2015), recurrent at all. For example, if a model is used to predict the winner arXiv:2002.02644v2 [cs.LG] 15 Feb 2020 neural networks (Hochreiter & Schmidhuber, 1997; Cho of a sports game in order to automatically set betting odds, et al., 2014) and more recently, transformers (Vaswani et al., accurate probability estimates are necessary. 2017; Devlin et al., 2018; Radford et al., 2019). Even though most modern algorithms na- In many domains, predicting the most likely class label yˆi tively produce an estimated probability distribution over for an instance i with features xi and label yi is sufficient the class labels for a given instance, it is not always the for classification tasks. However, it is often the case in real case that they closely reflect the true probabilities of each applications that an estimated probability distribution pˆi class (Niculescu-Mizil & Caruana, 2005; Guo et al., 2017; over the labels can improve the quality or usefulness of the Leathart et al., 2019; Kumar & Sarawagi, 2019). Models system. For example, an automated loan approval system for which this is true are said to be poorly calibrated. Prob- ability calibration is an additional step one can apply when 1Sportsflare AI. Correspondence to: Tim Leathart . training a model f, where its class probability estimates (or logits) are used as inputs for another model π that scales them appropriately to better match the true probabilities. Temporal Probability Calibration

This work considers situations in which we wish to obtain proxy metrics that aim to emulate this intuition. Common class probability estimates for a sequence at any time during metrics for evaluating the quality of probability estimates the formation of the sequence. This situation is fairly com- include negative log likelihood (NLL) mon, e.g., offering a “help” article to a website user while n they are typing a description of the issue they are facing, 1 X NLL = − y log ˆp (2) predicting if an investor should buy or sell an option up until n i i i=1 the expiration date, or predicting the outcome of a sports game given information about the current state of the game. and the Brier score (Brier, 1950), also known as mean For these types of problems, the prediction task typically squared error (MSE) gets easier as the end of the full sequence draws nearer—if n the score in a football game is 1-0 with one minute remain- 1 X MSE = (y − ˆp )2. (3) ing, we should be much more confident in our prediction n i i of the winner than if the score is 1-0 at half-time. Figure1 i=1 shows how expected calibration error, a commonly used Technically speaking, neither of these metrics directly mea- calibration metric described in Section 2.1, changes for the sure calibration, as for each example i, the estimated prob- Large Movie Review dataset (Maas et al.) as the sequence ability ˆpi is compared to its label yi rather than the true length increases for several different models. In this exam- probability distribution over the label space. However, they ple, deep average networks and transformers have poorer are good proxy metrics, and possess convenient properties calibration for shorter sequences than longer, and vice-versa for machine learning such as being differentiable, applicable for the recurrent network. Intuitively, a global calibration to individual examples, and easy to compute. strategy that applies the same calibration to sequences of any length will not be suitable for these models. Of course, obtaining a true probability distribution over the label space for a single example is usually not possible in In this paper, we propose several simple strategies to adapt practice, as typically, only labels are supplied. Expected cal- calibration schemes to better handle incomplete sequences, ibration error (ECE) (Naeini et al., 2015) is a binning-based and evaluate them against traditional, global calibration approach that attempts to approximate this distribution em- methods. The paper is structured as follows. First, an pirically. The probability space [0, 1] is split into K bins, introduction to probability calibration is provided, where and the test examples are grouped into the bins based on definitions, existing approaches and evaluation methods are their estimated probabilities. The accuracy and average discussed. Then, our proposed temporal probability calibra- confidence of each bin Bk are computed as tion techniques are described, considering both discrete and n continuous sequences of fixed or variable length. Experi- 1 Xk acc(B ) = (ˆy = y ) (4) ments are described and their results discussed. Finally, we k n I i i go over conclusions and future work. k i n 1 Xk conf(B ) = pˆ . (5) k n i 2. Probability Calibration k i

A probabilistic classifier is said to be perfectly calibrated ECE is then defined as when the probability estimates for each example exactly K match the true class probabilities of the example. For in- X nk ECE = acc(Bk) − conf(Bk) . (6) stance, for those examples that are assigned a confidence of n 75% by a perfectly calibrated classifier, 75% of them should k=1 actually be classified correctly. For a perfectly calibrated model, the accuracy and con- More formally, for a probabilistic classifier f for an M- fidence of each bin should be equal. Even though ECE class classification task and predicted probability distribu- measures (average) calibration directly, it is not without tion ˆp = [ˆp1,..., pˆM ], the proportions of classes for all problems—the choice for number of bins is arbitrary, prob- possible instances that would be assigned the prediction ˆp abilities for individual examples are discarded in favour of by f are equal to ˆp (Kull et al., 2019): aggregated bins, and its applicability to multiclass problems is debated (Nixon et al., 2019; Leathart, 2019). Strategies P(Y = m | f(X) = ˆp) =p ˆm for m ∈ {1 ...M}. (1) such as classwise-ECE (Kull et al., 2019) have been pro- posed to better handle the multiclass case. 2.1. Evaluating Calibration Accuracy and confidence are often compared visually in Without an infinite number of samples, the condition in (1) reliability diagrams, where they are plotted against each is not possible to achieve. However, there exist several other (DeGroot & Fienberg, 1983). In reliability diagrams, Temporal Probability Calibration perfect calibration is shown by a straight diagonal line. Re- 2.3. Nonparametric Calibration Methods gions where the curve sits above the diagonal represent Histogram binning (Zadrozny & Elkan, 2001) is a simple underconfidence, and regions where the curve sits under the nonparametric approach to probability calibration. In his- diagonal represent overconfidence. togram binning, the model’s output space is split into K bins, typically by equal-width or equal-frequency strate- 2.2. Parametric Calibration Methods gies. A calibrated probability per bin is assigned such that One of the most well-known approaches to probability cali- the MSE for each bin is minimised, which turns out to be bration is Platt scaling (Platt, 1999), in which a univariate equal to the percentage of positive examples in each bin model with parameters (α, β) is learned respectively. The calibrated probability estimate for a test to minimise NLL between a binary model’s outputs f(xi) example is given by the assigned value for the bin that and the labels yi. Calibrated probabilities pˆi for an instance it lands in. Naeini et al.(2015) proposed an extension of i can be obtained by histogram binning called Bayesian binning into quantiles, which performs Bayesian model averaging over all possible pˆ = π(f(x ); α, β) = σ (αf(x ) + β) (7) i i i equal-frequency binning schemes. where σ is the sigmoid function. Platt scaling can be applied (Zadrozny & Elkan, 2002) learns a piece- to multiclass problems using the widely known one-vs-rest wise constant function that minimises MSE between the method (Zadrozny & Elkan, 2002). Note that f(x), the uncalibrated probabilities and labels, with the constraint input to the calibration model, should be a logit rather than that it must be monotonically increasing. This can be seen a probability. This is because logistic regression assumes a as an extension of histogram binning in that it optimises linear relationship between the inputs f(x) and the output the bin widths and predictions jointly. Naeini & Cooper logits. (2016) proposed an extension of this where monotonicity Guo et al.(2017) proposed several variants of Platt scaling— is not constrained, but encouraged through regularisation. vector scaling, matrix scaling and temperature scaling—that The removal of the monotonicity constraint leads to overfit- can be applied to multiclass problems. Matrix and vec- ting if training is performed to completion; to combat this, tor scaling are identical to multinomial logistic regression the collection of so called near-isotonic regression models where a weights matrix W and bias vector b is learned produced at each step of training are used in an ensemble, with their predictions weighted by the Bayesian information ˆp = π(f(x ); W, b) = σ (Wf(x ) + b) . (8) i i i criterion (Schwarz et al., 1978). The difference between them is that in vector scaling, W is restricted to be a diagonal matrix, while matrix scaling 3. Temporal Probability Calibration imposes no constraints on W. Vector scaling can be seen as applying Platt scaling in a one-vs-rest strategy, but with Temporal probability calibration is motivated by the idea jointly optimised weights. Finally, temperature scaling is that a model ought to be more or less confident about its the most simple strategy which learns a single parameter τ predictions at different stages of completion of a sequence. (referred to as the temperature) that scales the logits for all Leathart et al.(2017) showed that overall calibration can classes in the same way: be improved by applying different calibration models in   different regions of the input space; this work takes a similar f(xi) ˆpi = π(f(xi); τ) = σ (9) approach in the temporal dimension. τ An interesting property of temperature scaling is that be- 3.1. Discrete Fixed-Length Sequences cause there is no bias term applied, it simply makes proba- bilistic predictions more or less extreme without changing For classification problems where predictions are made at their classification, and hence does not affect the classifica- discrete timesteps from t ∈ {1,...,T }, it is simple and tion accuracy (Guo et al., 2017). highly effective to produce a series of T calibration models each parametrised by θt, π1(f; θ1), . . . , πT (f; θT ), corre- Beta scaling (Kull et al., 2017) and Dirichlet scaling (Kull sponding to sequences of each possible length. Calibration et al., 2019) are, in a practical sense, very similar to Platt of a model output f(X) can be performed by scaling and matrix scaling respectively. They model beta and Dirichlet distributions on probabilities f(x) ∈ [0, 1], π(f(X), t; θ) = πt(f(X); θt) (10) but convert them to logits and log-probabilities respectively and use logistic regression to fit them. A regularisation These calibration models π1, . . . , πT are straightforward to scheme that penalises off-diagonal and bias parameters was fit by optimising the NLL for a held-out calibration set: also proposed which lead to strong results for both Dirichlet  arg min NLL yt, πt(f(Xt); θt) t ∈ {1,...,T } (11) scaling and matrix scaling (Kull et al., 2019). θt Temporal Probability Calibration where Xt and yt are the sequential features and labels re- 3 spectively of a sample of examples from the calibration set of length t. 2.5

3.2. Variable-Length and Continuous Sequences 2 For classification problems where predictions are made at any real-valued or integer time t ∈ (0,T ] with possi- 1.5 bly infinite T , a suitable function must be chosen to con- tinuously evolve the parameters θ of a calibration model Inverse Temperature 1 π : R → [0, 1]. Throughout this section, we consider temperature scaling as the main calibration method for sim- 0.5 plicity of notation; however, the ideas presented are quite general and could, in theory, be applied to other (parametric) Sequence Length calibration methods. Figure 2. Examples of possible forms of (12). A suitable type of function for modelling an adaptive temper- ature used in calibration for many problems is a saturating function, such as an exponentially decaying function: where the goal is to win five rounds, and a draw is not possible in each round. Clearly, there are a maximum of g(t; α, β, γ, s) = γ − αe(−βt−s) (12) nine rounds. Intuitively, one might think that as the number which can then be applied in a calibration function like so of rounds passed increases, we should be more confident of (parameters of g have been written as θ): the outcome. However, if a game reaches the ninth round, then the players are likely to be of a similar skill level; thus π(f(X), t; g, θ) = σ (g(t; θ)f(X)) . (13) the probabilities of each outcome becomes more difficult to predict accurately. The coefficient of f(X) in (13) can be interpreted as the inverse of τ in temperature scaling. A function of this A superior measure of sequence completion than the round form is flexible enough to allow many different temporal number t for a game like this may be to fit calibration mod- calibration schemes, e.g., continuously decaying toward an els to subsets of historical games with equal absolute score upper or lower bound from any starting position.1 Figure2 difference s. Sequences from other domains may have ways shows some examples of possible temperature functions of of representing an estimation of completion other than di- this form. A convenient feature of applying temperature rectly using the time, and it is left to the practitioner to scaling in a temporal fashion in this way is that, like in the decide the best approach for their specific situation. global calibration case, the total accuracy is left unchanged. Similarly to the discrete case, the parameters can be fit to 4. Experimental Results minimise NLL on a held-out calibration set: In this section, our experimental methods are described and arg min NLLy, π(f(X); g, θ) (14) results discussed. As explained in the introduction, there θ are many situations in which accurate probability estimates except that in this case, X and y must be the sequential from incomplete sequences are useful. However, no datasets features and labels respectively of a held-out set containing for any of these specific tasks exist in the public domain. sequences that have been artificially truncated to produce a Nevertheless, we artificially create incomplete sequences range of sequence lengths. We found that (14) can be opti- from natural language datasets, as well as introduce a se- mised reliably using typical off-the-shelf optimisers when quential dataset for esports outcome prediction, and show t is normalised to the range [0, 1], based on the maximum that temporal calibration works well across these domains. length of sequences in the calibration set. For all results, we present the mean of ten runs with ten dif- ferent random seeds. In our results tables, the method with 3.3. Alternative Temporal Measures the best mean score for each metric is bolded. Additionally, we also compute statistical significance by the Friedman test In some problems, it may be more appropriate to use a followed by the Nemenyi post-hoc test at p = 0.05 (Demsarˇ , different indicator of sequence completion than the time 2006). This is a nonparametric test for multiple classifier t directly. Consider a game played between two players, comparison that compares average ranks of each method 1This assumes that β is positive. One can square this term to across the ten runs. If the average ranks of two methods are enforce this property. less than a critical difference (a function of number of clas- Temporal Probability Calibration

Table 1. Global performance metrics for natural language datasets. Values presented are the means and standard deviations of ten independent runs. Test examples have been randomly truncated to simulate incomplete sequences. Large Movie Review Amazon Fine Food Review NLL ECE NLL ECE No Calibration 0.646 ± 0.008 0.107 ± 0.077 0.412 ± 0.005 0.048 ± 0.027 • DAN Global Calibration 0.548 ± 0.005 • 0.110 ± 0.048 0.386 ± 0.032 0.152 ± 0.046 Temporal Calibration 0.497 ± 0.004 • 0.034 ± 0.012 • 0.340 ± 0.007 • 0.037 ± 0.023 • No Calibration 0.464 ± 0.011 0.076 ± 0.018 • 0.233 ± 0.009 0.039 ± 0.055 • GRU Global Calibration 0.458 ± 0.010 0.089 ± 0.011 0.229 ± 0.008 • 0.047 ± 0.061 Temporal Calibration 0.450 ± 0.008 • 0.072 ± 0.010 • 0.226 ± 0.008 • 0.038 ± 0.058 • No Calibration 0.593 ± 0.012 0.098 ± 0.029 • 0.533 ± 0.045 0.172 ± 0.068 BERT Global Calibration 0.592 ± 0.011 0.114 ± 0.024 0.534 ± 0.047 0.157 ± 0.075 • Temporal Calibration 0.577 ± 0.009 • 0.085 ± 0.011 • 0.521 ± 0.038 • 0.162 ± 0.074 • sifiers, number of runs and desired p-value), then they are though transformers are quickly becoming the de-facto mod- considered statistically indistinguishable from each other. els for NLP research, we decided to include DANs and In our tables, a bullet (•) indicates that a method is in the GRUs in this investigation because they have much smaller best-performing group. computational and memory requirements in order to be effective, and are still commonly used in industrial appli- Experiments were conducted using the PyTorch neural net- cations. We use 300-dimensional GloVe embeddings to works framework (Paszke et al., 2019) and Huggingface’s represent the words for DAN and GRU models (Pennington transformers library (Wolf et al., 2019) on a Google Cloud et al., 2014). Service instance equipped with 16 vCPUs, 60GB of memory and an NVIDIA P100 GPU. The DAN had two fully-connected layers after the embed- ding layer, of 512 and 256 units respectively, before the out- 4.1. Natural Language Datasets put layer. The recurrent network contained a GRU followed by one fully-connected layer, each of 256 units, before the We use two natural language datasets: Large Movie output layer. The sum of hidden outputs from the GRU at Review (Maas et al.) and Amazon Fine Food Re- each time step was passed to the fully-connected. In both of view (McAuley & Leskovec, 2013). Large Movie Review is these networks, ReLU activations (Krizhevsky et al., 2012) a collection of film reviews taken from the Internet Movie and dropout (Srivastava et al., 2014) with p = 0.25 were Database (IMDB) (Maas et al.). The classification task is to used between the fully-connected layers. For the BERT- predict if reviews speak positively or negatively about the based classifier, we truncate sequences to a maximum length film. The dataset is split into 20,000 train, 5,000 calibration of 512 tokens and use fixed pre-trained weights, only learn- and 25,000 test examples. The average length of samples ing the final layer for prediction. Adam (Kingma & Ba, in this dataset is 233 words. Amazon Fine Food Review is 2015) with default settings was used to optimise each net- a larger dataset of food reviews taken from Amazon. The work and temporal calibration parameters. original dataset has classes from one to five stars; in this investigation, we combine the positive reviews (four and As the sequence length of the examples in these datasets is five stars) and the neutral/negative reviews (one to three not uniform, we optimised a continuously decaying expo- stars) to form a binary classification problem for simplicity. nential function for temporal calibration as in (13) in each There are 450,000 train, 50,000 calibration, and 68,484 test experiment. The test and calibration sets were constructed examples, with an average length of 85 words. by truncating each sequence in the original test set at five random uniformly-sampled points in order to obtain datasets 4.1.1. EXPERIMENTAL SETUP containing a range of sequence completenesses.

For these datasets, we compare no calibration, global tem- 4.1.2. RESULTS AND DISCUSSION perature scaling and temporal temperature scaling for three simple natural language processing (NLP) models: a deep Table1 lists the global NLL and ECE for each calibration averaging network (DAN) (Iyyer et al., 2015), recurrent net- method and model for both NLP datasets. Temporal calibra- work using gated recurrent units (GRU) (Cho et al., 2014) tion is in the top performing group of results at p = 0.05 and a BERT-based classifier (Devlin et al., 2018). Even Temporal Probability Calibration

0.35 0.16 0.26 No Calibration Global Calibration 0.3 0.14 0.22 Temporal Calibration 0.25 0.12 0.18 0.2 0.1 0.14 0.15 0.08 0.1 0.1

0.05 0.06 0.06 ECE (Large Movie Review)

0 0.04 0.02 22 23 24 25 26 27 28 29 22 23 24 25 26 27 28 29 22 23 24 25 26 27 28 29 Sequence Length Sequence Length Sequence Length (a) DAN (b) GRU (c) BERT

0.21 0.36 0.49 No Calibration Global Calibration 0.18 0.31 0.42 Temporal Calibration 0.27 0.15 0.35 0.22 0.12 0.28 0.18 0.09 0.21 0.14 0.06 0.14 0.09 0.03 0.05 0.07

ECE (Amazon Fine Food Review) 0 0 0 20 21 22 23 24 25 26 27 28 20 21 22 23 24 25 26 27 28 20 21 22 23 24 25 26 27 28 Sequence Length Sequence Length Sequence Length (d) DAN (e) GRU (f) BERT

Figure 3. ECE for Large Movie Review (top) and Amazon Fine Food Review (bottom) over different sequence lengths. Each point and set of error bars represents the mean and standard deviation ECE respectively over ten independent runs. in every comparison.2 In all but one experiment, the tem- well as global calibration (Figures 3a, 3d). The effect of a poral calibration approach achieves the lowest NLL and fine-grained calibration strategy is most obvious for shorter ECE on the test sets. The effect of temporal calibration is sequences for both datasets. As the sequence length reaches most pronounced for DANs. Interestingly, applying a global the higher end, temporal calibration matches the perfor- calibration scheme often resulted in degraded overall ECE mance of the baseline. This may be because a DAN has no compared to the baseline, despite an improvement in NLL. notion of sequence length, so it is not able to learn behaviour for specifically dealing with very short sequences, and is A more enlightening approach to the evaluation of tem- often overconfident without the full context available. poral calibration is to visualise calibration for incomplete sequences of different lengths. Figure3 shows how ECE The global calibration scheme baseline had interesting evolves as the sequence length grows. The altered test sets behaviour—usually it matches the performance of temporal with truncated examples were binned by sequence length scaling for one bin with ECE rising on either side, especially such that each bin contains approximately the same number for the smaller dataset, Large Movie Review. It is intuitive of samples, and ECE computed for each bin. For all dataset- that global calibration should behave in this manner, if our model pairs, temporal calibration gives the lowest (or tied assumption that different calibration strategy is required at lowest) ECE for the majority of the time. different sequence lengths is correct, as a global calibration scheme will be unable to adapt to the needs of different For both datasets, DANs saw marked improvements when regions of temporal space. This result sheds light on how using temporal calibration, compared to the baseline as overall ECE can be worse after applying global calibration, 2In fact, temporal calibration achieved an average rank of ex- despite slight improvements in overall NLL. actly one for most comparisons, but the Nemenyi test at n = 10, k = 3 and p = 0.05 only finds two methods to be statistically In general, calibration appears to be more impactful for indistinguishable if the difference in average ranks is greater than smaller datasets when considering GRUs (Figures 3b, 3e) 1.04. This is why in some cases, pairs of methods with seemingly and the BERT-based classifier (Figures 3c, 3f). We did large NLL and ECE differences are reported as being statistically not see substantial improvements for these models with the indistinguishable. Temporal Probability Calibration

0.1 No Calibration No Calibration No Calibration Global Calibration Temporal Calibration (Round Number) Temporal Calibration (Score Difference) 0.08

0.06

ECE 0.04

0.02

0 1 6 11 16 21 26 31 36 1 6 11 16 21 26 31 36 1 6 11 16 21 26 31 36 Round Number Round Number Round Number (a) Global Calibration (b) Temporal Calibration (Round Number) (c) Temporal Calibration (Score Difference)

Figure 4. ECE of each temporal calibration technique for CS:GO dataset at different round numbers. Each point and set of error bars represents the mean and standard deviation ECE respectively over ten independent runs.

first team to win sixteen rounds. In the event of a draw, a Table 2. Global metrics for CS:GO dataset. Values presented are six-round overtime is added to the match until a winner is the means and standard deviations of ten independent runs. Test examples have been randomly truncated to simulate incomplete decided. sequences. We introduce the first public dataset for CS:GO game winner Calibration Method NLL ECE prediction. It contains sequences of round statistics where No Calibration 0.3830 ± 0.006 0.0316 ± 0.013 each sequence element includes information such as the Global Calibration 0.3815 ± 0.006 0.0290 ± 0.012 • round winner, number of surviving players and current score. Temporal (Round) 0.3793 ± 0.006 • 0.0235 ± 0.012 • There are 12,362 games in the dataset, which we split into Temporal (Score) 0.4062 ± 0.004 0.0450 ± 0.018 8,751 training samples, 2,907 calibration samples and 704 test samples. More information about the dataset can be found in the supplementary material. larger Amazon Fine Food Review dataset. On the other hand, calibration of these models is greatly improved for 4.2.1. EXPERIMENTAL SETUP Large Movie Reviews. The BERT-based classifier, like the DAN, sees the most improvement for shorter sequences. For this dataset, we only use a recurrent neural network with The calibration curve for BERT (Figure 3c) appears to gated recurrent units (Cho et al., 2014) that uses the raw share a mixture of the behaviours of DAN and GRU (Fig- features as inputs. The GRU has 96 units, followed by two ures 3a, 3b). This may be because the bidirectional self- feed-forward fully-connected layers of 96 units each. ReLU attention of BERT acts like a weighted average over the activations (Krizhevsky et al., 2012) are applied between the input tokens at each timestep similarly to DANs. However, linear fully-connected layers, as well as dropout (Srivastava unlike DANs, BERT retains temporal information through et al., 2014) with p = 0.25. its positional encodings. As before, no calibration and global temperature scaling The GRU has a stronger sequential assumption than the are used as baselines. These sequences have a maximum DAN and BERT built into its architecture. Temporal calibra- length and discrete timesteps, so we use discrete temporal tion of GRUs matches the performance of the baseline for calibration as described in Section 3.1. Temperature scaling shorter sequences, only seeing a reduction in ECE for longer is used to calibrate each timestep. We experiment with sequences. The performance drop for longer sequences is using the round number and the absolute score difference likely due to the GRU “forgetting” about the first tokens as measures of time. As with the NLP datasets, we produce as the sequences grow. However, the network has learned an augmented test and calibration set where examples are to make low-confidence predictions at the beginning of se- randomly truncated. quences, so there is little to gain from calibration here. 4.2.2. RESULTS AND DISCUSSION 4.2. CS:GO Round Sequences Table2 shows global NLL and ECE of each calibration Counter-Strike: Global Offensive (CS:GO) is a first-person method for the CS:GO dataset. For this data, global temper- shooter multiplayer video game in which two teams of five ature scaling shows slight improvements overall compared players play against each other. A game is comprised of to the baseline, and temporal scaling based on round num- two halves of fifteen rounds, where the winning team is the ber achieves the best results. Interestingly, temporal scaling Temporal Probability Calibration

1 No Calibration Temporal Calibration

0.8

0.6

Accuracy 0.4

0.2

Round 2, τ = 0.948 Round 6, τ = 0.799 Round 10, τ = 0.838 Round 14, τ = 0.989 0

1

0.8

0.6

Accuracy 0.4

0.2

Round 18, τ = 1.014 Round 22, τ = 1.080 Round 26, τ = 0.961 Round 30, τ = 0.749 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Confidence Confidence Confidence Confidence

Figure 5. Reliability diagrams of round-based temporal calibration and the baseline for the CS:GO dataset at different rounds. Tempera- ture (τ), fitted from the calibration set, is listed for each round number. based on score difference has the worst calibration overall at each step is reasonably good, visible improvements are with a degradation compared to the baseline. made at rounds six, ten, and thirty, while the calibration of the middle rounds stays virtually the same. As equal- Figure4 shows how ECE evolves with the length of the frequency binning and temperature scaling are employed for sequence. While global calibration does achieve slight ECE each round number, the accuracy of each bin (on the y-axis) improvements between rounds five and sixteen, as well as does not change, but the confidence (on the x-axis) moves improvements at the end of the game and during overtime, closer to the diagonal line after calibration is applied. the calibration degrades in-between these areas. Again, it is clear that global calibration strategies are suboptimal. Temporal calibration by round number results in lower ECE 5. Conclusion than global calibration in these areas, while matching the In this paper, we investigated probability calibration for se- performance of the baseline during the mid-game rounds. quential data; specifically looking at how calibration of the Temporal calibration by score difference, hypothesised in classification of a sequence that is being produced changes Section 3.3 to be more appropriate than round number for over time. Two simple methods for calibrating different this type of sequence, turned out to have relatively poor types of sequences—exponentially-decaying temperature performance, with ECE rising fairly steadily above that of for continuous sequences and timestep-based binning for the baseline after round sixteen. This may be due to the discrete sequences—were proposed, each showing an im- score difference not being able to differentiate between, for provement in NLL and ECE in their respective areas of example, an absolute score difference of two at round two as application. Especially high performance is obtained when opposed to at the penultimate round. The truncation strategy a DAN is used for natural language data, or a relatively low employed results in comparatively more short sequences amount of training data is available. than long ones, which may explain why calibration degrades near the end of the games. This paper touches on the calibration of predictions made by transformer-based classifiers, specifically BERT; an in- Figure5 shows reliability diagrams comparing the baseline teresting avenue of future research would be to perform a to temporal calibration (by round number). Each plot shows thorough investigation on this topic. Some work has been the calibration at a different round number, illustrating the done in this area for neural machine translation (Kumar & progression of the predictions across the length of a whole Sarawagi, 2019;M uller¨ et al., 2019), but further investiga- game. In reliability diagrams, perfect calibration is shown tion for standard classification tasks would be a valuable by a perfect diagonal line. We use ten equal-frequency bins research contribution. in these plots. Even though the calibration of the baseline Temporal Probability Calibration

Acknowledgements Iyyer, M., Manjunatha, V., Boyd-Graber, J., and Daume´ III, H. Deep unordered composition rivals syntactic meth- The authors thank Christopher Laing and Chris Herrmann ods for text classification. In Proceedings of the Annual for useful discussions. The authors also thank Google for Meeting of the Association for Computational Linguistics providing cloud credits that were used to run the experi- and the 7th International Joint Conference on Natural ments using Google Cloud Platform. Language Processing, pp. 1681–1691, 2015.

References Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Brier, G. W. Verification of forecasts expressed in terms of Conference on Learning Representations, 2015. probability. Monthly Weather Review, 78(1):1–3, 1950. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Brown, M. S., Pelosi, M. J., and Dirska, H. Dynamic-radius classification with deep convolutional neural networks. species-conserving genetic algorithm for the financial In Proceedings of Advances in Neural Information Pro- forecasting of Dow Jones index stocks. In Proceedings cessing Systems, pp. 1097–1105, 2012. of the International Workshop on Machine Learning and Kull, M., Silva Filho, T. M., Flach, P., et al. Beyond sig- in Pattern Recognition, pp. 27–41. Springer, moids: How to obtain well-calibrated probabilities from 2013. binary classifiers with beta calibration. Electronic Journal of Statistics, 11(2):5052–5080, 2017. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning Kull, M., Nieto, M. P., Kangsepp,¨ M., Silva Filho, T., Song, phrase representations using RNN encoder–decoder for H., and Flach, P. Beyond temperature scaling: Obtaining statistical machine translation. In Proceedings of the well-calibrated multi-class probabilities with Dirichlet Conference on Empirical Methods in Natural Language calibration. In Proceedings of Advances in Neural Infor- Processing, pp. 1724–1734, 2014. mation Processing Systems, pp. 12295–12305, 2019.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, Kumar, A. and Sarawagi, S. Calibration of encoder decoder M., Benenson, R., Franke, U., Roth, S., and Schiele, B. models for neural machine translation. arXiv preprint The cityscapes dataset for semantic urban scene under- arXiv:1903.00802, 2019. standing. In Proceedings of the 29th IEEE Conference Leathart, T., Frank, E., Holmes, G., and Pfahringer, B. Prob- on Computer Vision and Pattern Recognition, 2016. ability calibration trees. In Proceedings of the 9th Asian Conference on Machine Learning, pp. 145–160. PMLR, DeGroot, M. H. and Fienberg, S. E. The comparison and 2017. evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983. Leathart, T., Frank, E., Pfahringer, B., and Holmes, G. On calibration of nested dichotomies. In Proceedings of the Demsar,ˇ J. Statistical comparisons of classifiers over multi- 23rd Pacific-Asia Conference on Knowledge Discovery ple data sets. Journal of Machine Learning Research, 7 and Data Mining, pp. 69–80. Springer, 2019. (Jan):1–30, 2006. Leathart, T. M. Tree-structured multiclass probability esti- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: mators. PhD thesis, University of Waikato, 2019. Pre-training of deep bidirectional transformers for lan- Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., guage understanding. arXiv preprint arXiv:1810.04805, and Potts, C. Learning word vectors for sentiment anal- 2018. ysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan- Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On guage Technologies calibration of modern neural networks. In Proceedings of , pp. 142–150. the 34th International Conference on Machine Learning, McAuley, J. J. and Leskovec, J. From amateurs to connois- pp. 1321–1330, 2017. seurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd International Harper, F. M. and Konstan, J. A. The movielens datasets: Conference on World Wide Web, pp. 897–908, 2013. History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19, 2016. Muller,¨ R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Proceedings of Advances in Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Information Processing Systems, pp. 4696–4705, Neural computation, 9(8):1735–1780, 1997. 2019. Temporal Probability Calibration

Naeini, M. P. and Cooper, G. F. Binary classifier calibration Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, using an ensemble of near isotonic regression models. In C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, Proceedings of the 16th IEEE International Conference M., and Brew, J. Huggingface’s transformers: State- on Data Mining, pp. 360–369. IEEE, 2016. of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. Naeini, M. P., Cooper, G., and Hauskrecht, M. Obtaining well calibrated probabilities using Bayesian binning. In Zadrozny, B. and Elkan, C. Obtaining calibrated probability Proceedings of the 29th AAAI Conference on Artificial estimates from decision trees and naive Bayesian classi- Intelligence, 2015. fiers. In Proceedings of the 18th International Conference on Machine Learning, volume 1, pp. 609–616. Citeseer, Niculescu-Mizil, A. and Caruana, R. Predicting good prob- 2001. abilities with . In Proceedings of the 22nd International Conference on Machine Learning, pp. Zadrozny, B. and Elkan, C. Transforming classifier scores 625–632. ACM, 2005. into accurate multiclass probability estimates. In Proceed- ings of the 8th ACM SIGKDD International Conference Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., and on Knowledge Discovery and Data Mining Tran, D. Measuring calibration in . In , pp. 694–699. Proceedings of the 32nd IEEE Conference on Computer ACM, 2002. Vision and Pattern Recognition Workshops, pp. 38–41, 2019. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of Advances in Neural Information Processing Systems, pp. 8024–8035, 2019. Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Association for Computa- tional Linguistics, 2014. Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Ad- vances in Large Margin Classifiers, 10(3):61–74, 1999. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, 2016. Schwarz, G. et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.