Obtaining Well Calibrated Probabilities Using Bayesian Binning
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Obtaining Well Calibrated Probabilities Using Bayesian Binning Mahdi Pakdaman Naeini1, Gregory F. Cooper1;2, and Milos Hauskrecht1;3 1Intelligent Systems Program, University of Pittsburgh, PA, USA 2Department of Biomedical Informatics, University of Pittsburgh, PA, USA 3Computer Science Department, University of Pittsburgh, PA, USA Abstract iments to perform), medicine (e.g., deciding which therapy Learning probabilistic predictive models that are well cali- to give a patient), business (e.g., making investment deci- brated is critical for many prediction and decision-making sions), and others. However, model calibration and the learn- tasks in artificial intelligence. In this paper we present a new ing of well-calibrated probabilistic models have not been non-parametric calibration method called Bayesian Binning studied in the machine learning literature as extensively as into Quantiles (BBQ) which addresses key limitations of ex- for example discriminative machine learning models that isting calibration methods. The method post processes the are built to achieve the best possible discrimination among output of a binary classification algorithm; thus, it can be classes of objects. One way to achieve a high level of model readily combined with many existing classification algo- calibration is to develop methods for learning probabilistic rithms. The method is computationally tractable, and em- models that are well-calibrated, ab initio. However, this ap- pirically accurate, as evidenced by the set of experiments proach would require one to modify the objective function reported here on both real and simulated datasets. used for learning the model and it may increase the com- putational cost of the associated optimization task. An al- Introduction ternative approach is to construct well-calibrated models by relying on the existing machine learning methods and by A rational problem solving agent aims to maximize its util- modifying their outputs in a post-processing step to obtain ity subject to the existing constraints (Russell and Norvig the desired model. This approach is often preferred because 1995). To be able to maximize the utility function for many of its generality, flexibility, and the fact that it frees the de- practical prediction and decision-making tasks, it is cru- signer of the machine learning model from the need to add cial to develop an accurate probabilistic prediction model additional calibration measures into the objective function from data. Unfortunately, many existing machine learning used to learn the model. The existing approaches developed and data mining models and algorithms are not optimized for this purpose include histogram binning, Platt scaling, or for obtaining accurate probabilities and the predictions they isotonic regression (Platt 1999; Zadrozny and Elkan 2001; produce may be miscalibrated. Generally, a set of predic- 2002). In all these the post-processing step can be seen as tions of a binary outcome is well calibrated if the outcomes a function that maps the output of a prediction model to predicted to occur with probability p do occur about p frac- probabilities that are intended to be well-calibrated. Figure tion of the time, for each probability p that is predicted. 1 shows an example of such a mapping. This concept can be readily generalized to outcomes with more than two values. Figure 1 shows a hypothetical ex- Existing post-processing calibration methods can be di- ample of a reliability curve (DeGroot and Fienberg 1983; vided into two groups: parametric and nonparametric meth- Niculescu-Mizil and Caruana 2005), which displays the cal- ods. An example of a parametric method is Platt’s method ibration performance of a prediction method. The curve that applies a sigmoidal transformation that maps the out- shows, for example, that when the method predicts Z = 1 put of a predictive model to a calibrated probability output to have probability 0:5, the outcome Z = 1 occurs in about (Platt 1999). The parameters of the sigmoidal transforma- 0:57 fraction of the instances (cases). The curve indicates tion function are learned using a maximum likelihood esti- that the method is fairly well calibrated, but it tends to assign mation framework. The key limitation of the approach is the probabilities that are too low. In general, perfect calibration (sigmoidal) form of the transformation function, which only corresponds to a straight line from (0; 0) to (1; 1). The closer rarely fits the true distribution of predictions. The most com- a calibration curve is to this line, the better calibrated is the mon non-parametric methods are based either on binning associated prediction method. (Zadrozny and Elkan 2001) or isotonic regression (Zadrozny Producing well-calibrated probabilistic predictions is crit- and Elkan 2002). In the histogram binning approach, also ical in many areas of science (e.g., determining which exper- known as quantile binning, the raw predictions of a binary classifier are sorted first, and then they are partitioned into Copyright c 2015, Association for the Advancement of Artificial B subsets of equal size, called bins. Given a prediction y, Intelligence (www.aaai.org). All rights reserved. the method finds the bin containing that prediction and re- 2901 Methods BBQ extends the simple histogram-binning calibration method (Zadrozny and Elkan 2001) by considering multiple binning models and their combination. The main challenge here is to decide on how to pick the models and how to com- bine them. BBQ considers multiple equal-frequency bin- ning models that distribute the data-points in the training set equally across all bins. The different binning models differ in the number of bins they have. We combine them using a Bayesian score derived from the BDeu (Heckerman, Geiger, and Chickering 1995) score used for learning Bayesian net- work structures. Let yi and zi define respectively an uncal- ibrated classifier prediction and the true class of the i’th in- Figure 1: The solid line shows a calibration (reliability) stance. Also, let D define the set of all training instances curve for predicting Z = 1. The dotted line is the ideal cali- (yi; zi) and let S be the sorted set of all uncalibrated clas- bration curve. sifier predictions fy1; y2 : : : ; yN g, where N is total number of training data. In addition, let P a denote a partitioning of S into B equal frequency bins. A binning model M induced turns as y^ the fraction of positive outcomes (Z = 1) in the by the training set is defined as M ≡ fB; P a; Θg. Where, bin. Histogram binning has several limitations, including the B is the number of bins over the set S and Θ is the set of all need to define the number of bins and the fact that the bins the calibration model parameters Θ = fθ1; : : : ; θBg, which and their associated boundaries remain fixed over all predic- are defined as follows. For a bin b the distribution of the tions (Zadrozny and Elkan 2002). class variable P (Z = 1jB = b) is modeled as a binomial The other non-parametric calibration method is based on distribution with parameter θb. Thus, Θ specifies all the bi- isotonic regression (Zadrozny and Elkan 2002). This method nomial distributions for all the existing bins in P a. We score only requires that the mapping function be isotonic (mono- a binning model M as follows: tonically increasing) (Niculescu-Mizil and Caruana 2005). A commonly used method for computing the isotonic re- Score(M) = P (M) · P (DjM) (1) gression is pool adjacent violators (PAV) algorithm (Bar- The marginal likelihood P (DjM) in Equation 1 has low et al. 1972). The isotonic calibration method based on a closed form solution under the following assumptions the (PAV) algorithm can be viewed as a binning algorithm (Heckerman, Geiger, and Chickering 1995): (1) All samples where the position of boundaries and the size of bins are are i.i.d and the class distribution P (ZjB = b), which is selected according to how well the classifier ranks the ex- class distribution for instances located in bin number b, is amples in the training data (Zadrozny and Elkan 2002). modeled using a binomial distribution with parameter θ , (2) Recently a variation of the isotonic-regression-based cali- b the distributions of the class variable over two different bins bration method for predicting accurate probabilities with a are independent of each other, and (3) the prior distribution ranking loss was proposed (Menon et al. 2012). Although over binning model parameters θs are modeled using a Beta isotonic regression based calibration yields good perfor- distribution. We also assume that the parameters of the Beta mance in many real data applications, the violation of iso- distribution associated with θb are αb and βb. We set them tonicity assumption in practice is quite frequent, so the re- 0 0 α N p β N (1 − p ) N 0 laxation of the isotonicity constraints may be appropriate. to be equal to b = B b and b = B b , where is the equivalent sample size expressing the strength of our This paper presents a new binary classifier calibration 2 method called Bayesian Binning into Quantiles (BBQ) that belief in the prior distribution and pb is the midpoint of the is applied as a post-processing step1. The approach can be interval defining the b’th bin in the binning model M. Given viewed as a refinement of the histogram-binning calibration the above assumptions, the marginal likelihood can be ex- method in that it considers multiple different binnings and pressed as (Heckerman, Geiger, and Chickering 1995): their combination to yield more robust calibrated predic- 0 B Γ( N ) tions. Briefly, by considering only one fixed bin discretiza- Y B Γ(mb + αb) Γ(nb + βb) P (DjM) = N 0 ; Γ(N + ) Γ(αb) Γ(βb) tion, one may not be able to guess correctly the optimal bin b=1 b B width.