Performance Measures Accuracy Confusion Matrix

Performance Measures • Accuracy Performance Measures • Weighted (Cost-Sensitive) Accuracy for Machine Learning • Lift • ROC – ROC Area • Precision/Recall – F – Break Even Point • Similarity of Various Performance Metrics via MDS (Multi-Dimensional Scaling) 1 2 Accuracy Confusion Matrix • Target: 0/1, -1/+1, True/False, … • Prediction = f(inputs) = f(x): 0/1 or Real Predicted 1 Predicted 0 correct • Threshold: f(x) > thresh => 1, else => 0 • If threshold(f(x)) and targets both 0/1: a b r True 1 1- targeti - threshold( f (x i )) ABS Â( ) accuracy = i=1KN N c d True 0 incorrect • #right / #total • p(“correct”): p(threshold(f(x)) = target) † threshold accuracy = (a+d) / (a+b+c+d) 3 4 1 Predicted 1 Predicted 0 Predicted 1 Predicted 0 true false TP FN Prediction Threshold True 1 positive negative True 1 Predicted 1 Predicted 0 false true FP TN 0 b • threshold > MAX(f(x)) True 1 True 0 positive negative True 0 • all cases predicted 0 • (b+d) = total 0 d • accuracy = %False = %0’s Predicted 1 Predicted 0 Predicted 1 Predicted 0 True 0 Predicted 1 Predicted 0 hits misses P(pr1|tr1) P(pr0|tr1) True 1 True 1 a 0 • threshold < MIN(f(x)) True 1 • all cases predicted 1 false correct • (a+c) = total P(pr1|tr0) P(pr0|tr0) c 0 • accuracy = %True = %1’s True 0 True 0 alarms rejections True 0 5 6 Problems with Accuracy • Assumes equal cost for both kinds of errors – cost(b-type-error) = cost (c-type-error) optimal threshold • is 99% accuracy good? – can be excellent, good, mediocre, poor, terrible – depends on problem 82% 0’s in data • is 10% accuracy bad? – information retrieval 18% 1’s in data • BaseRate = accuracy of predicting predominant class (on most problems obtaining BaseRate accuracy is easy) 7 8 2 Percent Reduction in Error Costs (Error Weights) Predicted 1 Predicted 0 • 80% accuracy = 20% error • suppose learning increases accuracy from 80% to 90% w w • error reduced from 20% to 10% a b True 1 • 50% reduction in error wc wd True 0 • 99.90% to 99.99% = 90% reduction in error • 50% to 75% = 50% reduction in error • can be applied to many other measures • Often Wa = Wd = zero and Wb ≠ Wc ≠ zero 9 10 11 12 3 Lift Lift • not interested in accuracy on entire dataset Predicted 1 Predicted 0 • want accurate predictions for 5%, 10%, or 20% of dataset • don’t care about remaining 95%, 90%, 80%, resp. a b a (a + b) • typical application: marketing True 1 lift = (a + c) (a + b + c + d) %positives > threshold lift(threshold) = % dataset > threshold c d True 0 • how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold) threshold 13 14 Lift and Accuracy do not always correlate well Problem 1 Problem 2 lift = 3.5 if mailings sent to 20% of the customers (thresholds arbitrarily set at 0.5 for both lift and accuracy) 15 16 4 ROC Plot and ROC Area ROC Plot • Receiver Operator Characteristic • Sweep threshold and plot • Developed in WWII to statistically model false positive – TPR vs. FPR and false negative detections of radar operators – Sensitivity vs. 1-Specificity • Better statistical foundations than most other measures – P(true|true) vs. P(true|false) • Standard measure in medicine and biology • Sensitivity = a/(a+b) = LIFT numerator = Recall (see later) • Becoming more popular in ML • 1 - Specificity = 1 - d/(c+d) 17 18 Properties of ROC • ROC Area: – 1.0: perfect prediction – 0.9: excellent prediction – 0.8: good prediction – 0.7: mediocre prediction diagonal line is – 0.6: poor prediction random prediction – 0.5: random prediction – <0.5: something wrong! 19 20 5 Wilcoxon-Mann-Whitney 1 0.9 1 0.8 1 0.7 # _ pairwise _ inversions 1 0.6 ROCA =1- 0 0.5 # POS * # NEG 0 0.4 where 0 0.3 0 0.3 # _ pair_ inversions = 0 0.2 0 0.1 Â I [(P( xi ) > P(x j )) & (T(xi ) < T(x j ))] i, j 21 22 1 0.9 1 0.9 1 0.8 1 0.8 1 0.7 1 0.7 1 0.6 1 0.6 0 0.65 0 0.75 0 0.4 0 0.4 0 0.3 0 0.3 0 0.3 0 0.3 0 0.2 0 0.2 0 0.1 0 0.1 23 24 6 † 1 0.9 1 0.9 1 0.25 1 0.8 1 0.7 1 0.7 1 0.6 1 0.6 0 0.75 0 0.75 0 0.4 0 0.4 0 0.3 0 0.3 0 0.3 0 0.93 0 0.2 0 0.2 0 0.1 0 0.1 25 26 1 0.09 1 0.09 1 0.25 1 0.025 1 0.07 1 0.07 1 0.06 1 0.06 0 0.75 0 0.75 0 0.4 0 0.4 0 0.3 0 0.3 0 0.3 0 0.3 0 0.2 0 0.2 0 0.1 0 0.1 27 28 7 Properties of ROC Problem 1 Problem 2 • Slope is non-increasing • Each point on ROC represents different tradeoff (cost ratio) between false positives and false negatives • Slope of line tangent to curve defines the cost ratio • ROC Area represents performance averaged over all possible cost ratios • If two ROC curves do not intersect, one method dominates the other • If two ROC curves intersect, one method is better for some cost ratios, and other method is better for other cost ratios 29 30 Precision and Recall Precision/Recall • typically used in document retrieval • Precision: Predicted 1 Predicted 0 – how many of the returned documents are correct – precision(threshold) a b PRECISION = a /(a + c) True 1 • Recall: RECALL = a /(a + b) – how many of the positives does the model return c d – recall(threshold) True 0 • Precision/Recall Curve: sweep thresholds threshold 31 32 8 Summary Stats: F & BreakEvenPt PRECISION = a /(a + c) RECALL = a /(a + b) harmonic average of precision and recall 2 * (PRECISION ¥ RECALL) F = (PRECISION + RECALL) BreakEvenPoint = PRECISION = RECALL 33 34 F and BreakEvenPoint do not always correlate well Problem 1 Problem 2 better performance worse performance 35 36 9 Problem 1 Problem 2 Problem 1 Problem 2 37 38 Many Other Metrics Summary • Mitre F-Score • the measure you optimize to makes a difference • Kappa score • the measure you report makes a difference • Balanced Accuracy • use measure appropriate for problem/community • RMSE (squared error) • accuracy often is not sufficient/appropriate • Log-loss (cross entropy) • ROC is gaining popularity in the ML community • Calibration • only a few of these (e.g. accuracy) generalize – reliability diagrams and summary scores easily to >2 classes • … 39 40 10 Different Models Best on Different Metrics [Andreas Weigend, Connectionist Models Summer School, 1993] 41 42 Really does matter what you optimize! 43 44 11 2-D Multi-Dimensional Scaling 45 12.

Performance Measures Accuracy Confusion Matrix

Extraction of User's Stays and Transitions from Gps Logs: a Comparison of Three Spatio-Temporal Clustering Approaches

Performance Measures Outline 1 Introduction 2 Binary Labels

The Difference Between Precision-Recall and ROC Curves for Evaluating the Performance of Credit Card Fraud Detection Models

A Practioner's Guide to Evaluating Entity Resolution Results

Benchmarking Differential Expression Analysis Tools for RNA-Seq: Normalization-Based Vs. Log-Ratio Transformation-Based Methods Thomas P

Anomaly Detection in Application Log Data

Software Benchmark—Classification Tree Algorithms for Cell Atlases

Evaluation Criteria (For Binary Classification)

Controlling and Visualizing the Precision-Recall Tradeoff for External

Precision Recall Curves

Chapter 6 Evaluation Metrics and Evaluation

Empirical Assessment of the Impact of Sample Number and Read Depth on RNA-Seq Analysis Workflow Performance