Analysis of Classification Models File
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of Classification Models The Essentials of Data Analytics and Machine Learning [A guide for anyone who wants to learn practical machining learning using R] Author: Dr. Mike Ashcroft Editor: Ali Syed This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc- sa/4.0/. ® 2016 Dr. Michael Ashcroft and Persontyle Limited Essentials of Data Analytics and Machine Learning 1 ANALYSIS OF Module 10 CLASSIFICATION MODELS We introduced misclassification error in module 9, where we also discussed the use of mean squared error for probability estimates. These are the most common error scores used for fitting parameters. You should note that misclassification is the simple accuracy statistic discussed in the next section, and hence shares its problem. Analysis of Classification Models Analysis of Classification Models In this module we will see a number of alternative performance measures, including precision, recall, and other confusion matrix based statistics, as well as AUROC. It is also possible to use the Gini index and cross-entropy as error scores. However, these are most commonly used in the generation of decision trees, and we wait for introducing these until we discuss tree based methods in module 15. Besides basic performance measures, we also discuss cost-weighted performance optimization, confidence intervals and significance tests for comparing two classifiers. Confusion Matrices Confusion matrices are an excellent way of providing all information about the performance of your classifier on test data. They provide a matrix of actual values vs classified values. These classified values may include special values, such as uncertain, if your model outputs such. The simplest version are binary confusion matrices. When we are seeking to classify objects as being of a class or not, a number of terms are associated with the different elements: Actual Class \ Predicted Class F T F TN – true negative FP – false positive T FN – false negative TP – true positive False positives and false negatives are also called type I and II errors respectively. A number of statistics can be read directly of such confusion matrices. Unfortunately, the terminology is different in different fields: Accuracy (TN+TP)/(TN+FP+FN+TN) Error rate (FN+FP)/(TN+FP+FN+TN) Recall / Sensitivity / True Positive Rate / Hit Rate TPR TP/(TP+FN) Specificity / True Negative Rate TNR TN/(TN+FN) Precision / Positive Predictive Value TP/(TP+FP) Negative Predictive Value TN/(TN+FN) False Omission Rate FN/(TN+FN) False Discovery Rate FP/(FP+TP) Fall Out / False Positive Rate FPR FP/(FP+TN) False Negative Rate / Miss Rate FNR FN/(FN+TP) Essentials of Data Analytics and Machine Learning 1 Analysis of Classification Models Positive Likelihood Ratio LR+ TPR/FPR Negative Likelihood Ration LR– FNR/TNR Diagnostic Odds Ratio LR+/LR– F1 Score 2TP/(2TP+FP+FN) Balanced Accuracy (TPR+TNR)/2 Informedness TPR+TNR-1 The most important of these, and the terms we will use for these, are highlighted. The importance of accuracy should be obvious, but in fact recall and precision are generally the most important statistics of a classifier. Imagine we are testing for a rare cancer. It occurs in only 1 in 10000 individuals. Creating a classifier that always estimates that the cancer is absent will have 99.99% accuracy on future data and is completely useless. It has high accuracy, but zero recall and precision. Let us imagine a second situation, where we attempt to identify aliased social media accounts (different accounts that belong to the same individual). To do this, we create a classifer that classifies whether pairs of accounts are aliases or not. We train and test the model on balanced data, where half of the pairs are aliased, and half are not. Let us imagine our results are: Not Aliased Aliased Not Aliased 490 10 Aliased 20 480 Here our statistics are: 970 Accuracy: = 97% 1000 480 Recall: = 96% 500 480 Precision: = 97.96% 490 We also have: 490 True Negative Rate: = 98% 500 The recall (true positive rate) and true negative rate provide us with estimates of the class accuracy of the classifier. That is to say, the accuracy at predicting unaliased accounts as unaliased accounts, and aliased accounts as aliased accounts. Let us consider what these rates would mean when dealing with wild, unbalanced data. It is likely that very few pairs of accounts are actually aliased. Let’s estimate the real number at 1 in a million. If we used our classifier to estimate 1 billion pairs of accounts in the wild, we would expect: Essentials of Data Analytics and Machine Learning 2 Analysis of Classification Models Not Aliased Aliased Not Aliased: 푵푨 = 490 10 푁퐴 = 979999020 푁퐴 = 19999980 ퟗퟗퟗퟗퟗퟗퟎퟎퟎ 500 500 Aliased: 푨 = ퟏퟎퟎퟎ 20 480 퐴 = 40 퐴 = 960 500 500 With statistics: Accuracy: 97% Recall: 96% Precision: 0.000008% With the class balance found in the wild, our accuracy is wash away by the inevitable deluge of false positives due to the massive preponderance of the negative class. In such a situation it is clear that we should value precision far more highly than recall when evaluating our models. Of course, we have made things difficult by using a balanced dataset for training and testing. But this may be suitable, and certainly we will need to use data that has a far more equal balance than that found in the wild so as to have sufficient positive cases to hope to find any pattern in them and avoid our classifier defaulting to simply classifying everything as not-aliased. A rule of thumb is that few machine learning algorithms work well with binary data where one class has less than 10% of the cases. There are reverse cases where recall rather than precision is to be valued, though since we are often interested in classifying unusual or valuable events it is often the case that our positive class is outnumbered by the negative and the precision is rightly valued more highly. In any case, you should think about the balance of classes in the population as well as the training and testing data. Depending on this balance you should consider selecting models on the basis of precision or recall, or a weighted combination of both, rather than merely accuracy. In such cases, or when you are unsure of the population balance, you may also wish to use some of the statistics towards the end of our list that evaluate models based on recall and precision, such as balanced accuracy or informedness. The caret package has a confusionMatrix function that will produce a confusion matrix and a number of statistics. To demonstrate this, let us generate some possible estimates of a pretend classifier (setting the seed for reproducibility): > set.seed(0) > y=sample(c(T,F),100,replace=T) > m1=y > m1_err=sample(1:100,sample(15:30,1)) > m1[m1_err]=!m1[m1_err] Essentials of Data Analytics and Machine Learning 3 Analysis of Classification Models Now we can type: > caret::confusionMatrix(m1,y,positive="TRUE") Confusion Matrix and Statistics Reference Prediction FALSE TRUE FALSE 33 9 TRUE 15 43 Accuracy : 0.76 95% CI : (0.6643, 0.8398) No Information Rate : 0.52 P-Value [Acc > NIR] : 6.939e-07 Kappa : 0.5169 Mcnemar's Test P-Value : 0.3074 Sensitivity : 0.8269 Specificity : 0.6875 Pos Pred Value : 0.7414 Neg Pred Value : 0.7857 Prevalence : 0.5200 Detection Rate : 0.4300 Detection Prevalence : 0.5800 Balanced Accuracy : 0.7572 'Positive' Class : TRUE We will see how to interpret a number of the new statistics given here in later sections. ROC The receiver operating characteristic (ROC) provides a means of evaluating both models and parameters of models. ROC-space is given by the true and false positive rates, with intervals for both between 0 and 1 inclusive. A point in this space specifies a models performance, with the vertical coordinate giving the proportion of true positives vs false negatives, and the horizontal coordinate giving the proportion of false positives vs true negatives. A perfect classifier would give results that would be mapped to X=0 (0% false positives, so 100% true negatives), Y=1 (100% true positives, so 0% false negatives). This corresponds to the green dot in the plot to the left. Essentials of Data Analytics and Machine Learning 4 Analysis of Classification Models A model that performs no better than random guessing would fall on the red line. Its location on this line would depend on the ratio of the classes in the data being classified. This is also termed the positive and negative base rates. Accordingly, a model’s performance can be judged by its distance from the red line. Strictly speaking, this assumes that all models will be better than random guessing (and fall in the top left half of the ROC graph). However, note that were a model to be worse that random guessing (and fall in the lower right half of the ROC graph) then we could use it as a model that is better than random guessing simply by negating its predictions. The distance from the red line of this inverted model, located in the top left half, would be equal to the distance from the red line of the uninverted model, located in the bottom right half. Solely in the case of binary classification, a terrible model is as valuable as a wonderful one. Evaluation is often done visually, but it is simple to specify the distance of a point to the red line, dROC, analytically: (푇푃푅 − 퐹푃푅)2 푑푅푂퐶 = √ 2 This statistic has a very nice characteristic: It is invariant to changes in the base rates of the classes.