Performance Measures Outline 1 Introduction 2 Binary Labels
Total Page:16
File Type:pdf, Size:1020Kb
CIS 520: Machine Learning Spring 2018: Lecture 10 Performance Measures Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa). Outline • Introduction • Binary labels • Multiclass labels 1 Introduction So far, in supervised learning problems with binary (or multiclass) labels, we have focused mostly on clas- sification with the 0-1 loss (where the error on an example is zero if a model predicts the correct label and one if it predicts a wrong label). In many problems, however, performance is measured differently from the 0-1 loss. In general, for any learning problem, the choice of a learning algorithm should factor in the type of training data available, the type of model that is desired to be learned, and the performance measure that will be used to evaluate the model; in essence, the performance measure defines the objective of learning.1 Here we discuss a variety of performance measures used in practice in settings with binary as well as multiclass labels. 2 Binary Labels Say we are given training examples S = ((x1; y1);:::; (xm; ym)) with instances xi 2 X and binary labels yi 2 {±1g. There are several types of models one might want to learn; for example, the goal could be to learn a classification model h : X →{±1g that predicts the binary label of a new instance, or to learn a class probability estimation (CPE) model ηb : X![0; 1] that predicts the probability of a new instance having label +1, or to learn a ranking or scoring model f : X!R that assigns higher scores to positive instances than to negative ones. Let's consider each of these in turn; in each case, several different performance measures are used in practice. 2.1 Binary Classification Here the goal is to a learn a classification model h : X →{±1g that predicts the binary label of a new instance. 1Other factors to consider when choosing an algorithm include training and prediction times, memory/storage requirements, interpretability or other requirements on the learned model, possible prior knowledge, etc. 1 2 Performance Measures 2.1.1 0-1 Loss The most common performance measure for binary classification, and the one we have focused on so far, is the 0-1 loss, which simply assigns a fixed penalty of 1 if the predicted label yb differs from the true label y, and a penalty of 0 otherwise: yb −1 +1 y −1 0 1 +1 1 0 0 0 0 0 0 0 Here, given a new test sample ((x1; y1);:::; (xn; yn)), one computes the 0-1 loss on each example (xi; yi) and evaluates the model h in terms of the average loss: n 1 X er0-1 [h] = 1(h(x0 ) 6= y0) : test n i i i=1 As we have seen, several algorithms are well suited for binary classification with the 0-1 loss, including Na¨ıve Bayes, logistic regression, SVMs, neural networks, nearest neighbors, decision trees, boosting, etc. 2.1.2 Cost-Sensitive Loss (Asymmetric Classification Costs) In many applications, the cost of a false positive (predicting +1 when the true label is −1) is different from the cost of a false negative (predicting −1 when the true label is +1). For example, in medical diagnosis, classifying a patient with a disease as healthy is generally a very costly error, since it means the patient may not receive treatment; classifying a healthy patient as having a disease is also costly but usually less so (in the latter case, one might be able to conduct follow-up tests and determine the patient does not need treatment, or in the worst case, even if the patient is given a treatment he/she does not need, it may not be a threat to the patient's life). In such settings, we have asymmetric classification costs, say c 2 (0; 1) for false positives and 1 − c for false negatives (the costs can always be scaled to add up to 1 without affecting the problem): yb −1 +1 y −1 0 c +1 1 − c 0 0 0 0 0 Here, given a new test sample ((x1; y1);:::; (xn; yn)), one computes the above cost-sensitive loss on each 0 0 example (xi; yi) and evaluates the model h in terms of the average loss: n 1 X erc [h] = c 1(y0 = −1; h(x0 ) = +1) + (1 − c) 1(y0 = +1; h(x0 ) = −1) : test n i i i i i=1 For binary classification with such a cost-sensitive loss, two common approaches are the following: • Class probability estimation model with a modified threshold. Say we have learned a CPE model η : X![0; 1] (see also below). Then instead of making binary predictions by thresholding η(x) at 1 b b 2 as done for the 0-1 loss, for the cost-sensitive loss above, we make binary predictions by thresholding at c instead: ( +1 if η(x) > c h(x) = signη(x) − c = b b −1 otherwise. Performance Measures 3 To see why this makes sense, denote by η(x) = P(Y = 1 j X = x) the true probability of a positive label given x. Now, given an instance x, if we predict a label of +1, then the expected loss is η(x) · 0 + (1 − η(x)) · c = (1 − η(x)) · c : Similarly, for a prediction of −1, the expected loss is η(x) · (1 − c) + (1 − η(x)) · 0 = η(x) · (1 − c) : Therefore, if we knew η(x), the optimal prediction would be +1 if (1 − η(x)) · c < η(x) · (1 − c), i.e. if η(x) > c, and −1 otherwise. The above classification model h simply uses the CPE model's estimated probability ηb(x) instead of the true probability η(x) in this decision rule. • Weighted surrogate loss minimization. For algorithms that minimize a surrogate loss over the training sample, such as logistic regression (logistic loss) and SVMs (hinge loss), an alternative approach is to incorporate the cost-sensitive loss above by replacing the usual loss minimization m 1 X min `(yi; f(xi)) f m i=1 with the weighted loss minimization m 1 X min c `(yi; f(xi)) · 1(yi = −1) + (1 − c) `(yi; f(xi)) · 1(yi = +1) : f m i=1 This has the effect of weighing the losses on positive and negative examples differently during training. In this case, after obtaining f : X!R through the above minimization, the final classifier is obtained by thresholding f at 0 as usual: h(x) = sign(f(x)). 2.1.3 Complex Performance Measures In many settings, the performance measure is more complex and cannot be expressed as an average loss over individual examples. For example, in settings with class imbalance, where one class is rare compared to the other, one often needs to consider more complex performance measures. To see the need for this, consider a situation where only 1% of all examples are positive. Here, if one uses the 0-1 loss, one can obtain 99% accuracy just by predicting all examples to be negative! Clearly, a different approach is needed. Often in class imbalance settings, one looks at the error rates on the positive and negative examples separately, 0 0 0 0 and then combines these in some way. In particular, given a test sample ((x1; y1);:::; (xn; yn)), one can compute the true positive rate (TPR) and true negative rate (TNR) of the model h as follows:2 Pn 0 0 Pn 0 0 i=1 1(yi = +1; h(xi) = +1) i=1 1(yi = −1; h(xi) = −1) TPR[h] = Pn 0 ; TNR[h] = Pn 0 : i=1 1(yi = +1) i=1 1(yi = −1) TPR is simply the fraction of positive examples that are predicted correctly; TNR is the fraction of negative examples that are predicted correctly. Then some common performance measures used to evaluate classifi- cation models in class imbalance settings are the arithmetic mean (AM) and geometric mean (GM) of the TPR and TNR: 1 AM[h] = 2 TPR[h] + TNR[h] ; GM[h] = pTPR[h] · TNR[h] : 2In some applications such as computational biology, TPR is commonly referred to as sensitivity and TNR as specificity. 4 Performance Measures Note that higher values of AM and GM are better. Another type of setting where complex performance measures arise is when using a classification model for a `retrieval' or `detection' system, where the goal is to successfully retrieve or detect members of one class. For example, in information retrieval, where documents may be relevant or irrelevant, one is interested in successfully `retrieving' the relevant documents; similarly in object detection problems in computer vision, one is interested in successfully `detecting' some class of objects of interest (such as faces or cars). In these settings too, there is typically class imbalance (relevant documents are fewer than irrelevant documents; objects of interest are present in only a few image windows searched by a detection system). However the classification accuracy on the negative class is not of as much interest here, since even high TNR may mean many false retrievals/detections (e.g. 99% TNR would still mean falsely detecting a face in 1% of image windows searched; which could be a very large number compared to the actual number of faces present); instead, one often looks at the recall (which is the same as TPR) and precision of the classification model h instead: Pn 0 0 i=1 1(yi = +1; h(xi) = +1) Recall[h] = TPR[h] ; Precision[h] = Pn 0 : i=1 1(h(xi) = +1) Precision is the fraction of positive predictions that actually have positive labels.