Predicting Accurate Probabilities with a Ranking Loss

Predicting accurate probabilities with a ranking loss Aditya Krishna Menon1 [email protected] Xiaoqian Jiang1 [email protected] Shankar Vembu2 [email protected] Charles Elkan1 [email protected] Lucila Ohno-Machado1 [email protected] 1University of California, San Diego, 9500 Gilman Drive, La Jolla CA 92093, USA 2University of Toronto, 160 College Street, Toronto, ON M5S 3E1, Canada Abstract cally where our approach can provide more reliable estimates than standard statistical workhorses for probability In many real-world applications of machine estimation, such as logistic regression. The model attempts learning classifiers, it is essential to predict the to achieve good ranking (in an area under ROC sense) and probability of an example belonging to a particu- regression (in a squared error sense) performance simul- lar class. This paper proposes a simple technique taneously, which is important in many real-world appli- for predicting probabilities based on optimizing cations (Sculley, 2010). Further, our model is much less a ranking loss, followed by isotonic regression. expensive to train than full-blown nonparametric methods, This semi-parametric technique offers both good such as kernel logistic regression. It is thus an appeal- ranking and regression performance, and mod- ing choice in situations where parameteric models are em- els a richer set of probability distributions than ployed for probability estimation, such as medical infor- statistical workhorses such as logistic regression. matics and credit scoring. We provide experimental results that show the ef- fectiveness of this technique on real-world appli- The paper is organized as follows. First, we provide mo- cations of probability prediction. tivating examples for predicting probabilities, and define the fundamental concept of proper losses. We then review existing methods used to predict probabilities, and discuss 1. Introduction their limitations. Next, we detail our method to estimate probabilities, based on optimizing a ranking loss and feed- Classification is the problem of learning a mapping from ing the results into isotonic regression. Finally, we provide examples to labels, with the goal of categorizing future ex- experimental results on real-world datasets to validate our amples into one of several classes. However, many real- analysis and to test the efficacy of our method. world applications instead require that we estimate the probability of an example having a particular label. For ex- We first fix our notation. We focus on probability estima- ample, when studying the click behaviour of ads in compu- tion for examples x 2 X with labels y 2 f0;1g. Each x tational advertising, it is essential to model the probability has a conditional probability function h(x) := Pr[y = 1jx]. of an ad being clicked, rather than just predicting whether For our purposes, a model is some deterministic mapping or not it will be clicked (Richardson et al., 2007). Accurate sˆ : X ! R. A probabilistic model hˆ is a model whose out- probabilities are also essential for medical screening tools puts are in [0;1], and may be derived by composing a model to trigger early assessment and admission to an ICU (Subbe with a link function f : R ! [0;1]. The scores of a model et al., 2001). may be thresholded to give a classifiery ˆ : X ! f0;1g. We n assumes ˆ is learned from a training set f(xi;yi)g of n iid In this paper, we propose a simple semi-parametric model i=1 draws from X × f0;1g. for predicting accurate probabilities that uses isotonic regression in conjunction with scores derived from optimizing a ranking loss. We analyze theoretically and empiri- 2. Background and motivation Appearing in Proceedings of the 29th International Conference on Classically, the supervised learning literature has focussed Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright on the scenario where we want to minimize the number 2012 by the author(s)/owner(s). of misclassified examples on test data. However, practical Predicting accurate probabilities with a ranking loss applications of machine learning models often have more then (s∗)−1(s∗(h)) = h, so that the optimal scores are complex constraints and requirements, which demand that some transformation of h(x). In such cases, we call the we output the probability of an example possessing a label. corresponding probabilistic loss `P a proper (or Fisher- Examples of such applications include: consistent) loss (Buja et al., 2005), and say that ` corre- sponds to a proper loss. Building meta-classifiers, where the output of a model is fed to a meta-classifier that uses additional domain knowl- Many commonly used loss functions, such as square edge to make a prediction. For example, doctors prefer to loss `(y;sˆ) = (y − sˆ)2, and logistic loss `(y;sˆ) = log(1 + use a classifier’s prediction as evidence to aid their own e−(2y−1)sˆ), correspond to a proper loss function. Thus, decision-making process (Manickam & Abidi, 1999). In a model with good regression performance according to such scenarios, it is essential that the classifier assess the squared error, say, can be thought to yield meaningful confidence in its predictions being correct, which may be probability estimates. The hinge loss of SVMs, `(y;sˆ) = captured using probabilities; max(0;1−(2y−1)sˆ), is Bayes consistent but does not correspond to a proper loss function, which is why SVMs do Using predictions to take actions, such as deciding not output meaningful probabilities (Platt, 1999). whether or not to contact a person for a marketing cam- paign. Such actions have an associated utility that is to be maximized, and maximization of expected utility is most 3. Analysis of existing paradigms to learn naturally handled by estimating probabilities rather than accurate probabilities making hard decisions (Zadrozny & Elkan, 2001); We now analyze two major paradigms for probability esti- Non-standard learning tasks, where problem constraints mation, and study their possible failure modes. demand estimating uncertainty. For example, in the task of learning from only positive and unlabelled examples, train- 3.1. Optimization of a proper loss ing a probabilistic model that distinguishes labelled versus unlabelled examples is a provably (under some assump- A direct approach to predicting probabilities is to optimize tions) sufficient strategy (Elkan & Noto, 2008). a proper loss function on the training data using some hy- pothesis class, e.g. linear separators. Examples include lo- Intuitively, probability estimates hˆ (·) are accurate if, on av- gistic regression and linear regression (after truncation to erage, they are close to the true probability h(·). Quantify- [0;1]), which are instances of the generalized linear model ing “close to” requires picking some sensible discrepancy T framework, which assumes E[yjx] = f (w x) for some link measure, and this idea is formalized by the theory of proper function f (·). The loss-dependent error measure, L`(h;sˆ), loss functions, which we now discuss. A model for binary is one metric by which we can choose amongst proper classification uses a loss function ` : f0;1g × R ! R+ to losses. For example, the discrepancy measures for square measure the discrepancy between a label y and the model’s and logistic loss are (Zhang, 2004) predictions ˆ for some example x. If our model outputs 2 probability estimates hˆ by transforming scores with a link Lsquare(h;sˆ) = (h − sˆ) +C1 (2) function f (·), we may equivalently think of there being a 1 P P Llogistic(h;sˆ) = KL h +C2; (3) probabilistic loss ` (·;·) such that `(y;sˆ) = ` (y; f (sˆ)). The 1 + e−sˆ empirical error ofs ˆ with respect to the loss ` is where KL denotes the Kullback-Leibler divergence, and n 1 C1;C2 are independent of the predictions ˆ. Based on this, Eemp(sˆ(·)) = ∑ `(yi;sˆ(xi)); n i=1 Zhang (2004) notes that logistic regression has difficulty when h(x)(1 − h(x)) ≈ 0 for some x, by virtue of requir- which is a surrogate for the generalization error ing jsˆ(x)j ! ¥. This has been observed in practical uses of logistic regression with imbalanced classes (King & Zeng, E (sˆ(·)) = ExEyjx`(y;sˆ(x)) 2001; Foster & Stine, 2004), with the latter proposing the = Ex [h(x)`(1;sˆ(x)) + (1 − h(x))`(0;sˆ(x))] use of linear regression as a more robust alternative. := ExL`(h(x);sˆ(x)): (1) 3.2. Post-processing methods The term L`(h;sˆ) is a measure of discrepancy between an example’s probability of being positive and its pre- A distinct strategy is to train a model in some manner, ∗ dicted score. Let s (h) = argmins L`(h;s). Then, we and then extract probability estimates from it in a post- call a loss function ` Bayes consistent (Buja et al., 2005) processing step. Three popular techniques of this type ∗ 1 if for every h 2 [0;1], s (h) · (h − 2 ) ≥ 0, meaning that are Platt scaling (Platt, 1999), binning (Zadrozny & Elkan, we have the same sign as the optimal prediction under 2001), and isotonic regression (Zadrozny & Elkan, 2002). the 0-1 loss `(y;sˆ) = 1[ysˆ ≤ 0]. If s∗(h) is invertible, We focus on the latter, as it is more flexible than the former Predicting accurate probabilities with a ranking loss two approaches by virtue of being nonparametric, and has up to the choice of link function, i.e. h(x) = f (wT x), but been shown to work well empirically for a range of input f (·) is not the sigmoid function. The maximum likelihood models (Niculescu-Mizil & Caruana, 2005). estimates of a generalized linear model with a misspecified link function are known to be asymptotically biased (Czado Isotonic regression is a nonparametric technique to find a & Santner, 1992). Isotonic regression alleviates this partic- monotone fit to a set of target values.

Load more