San José State University Math 251: Statistical and Classification

Evaluation Criteria (for binary classification)

Dr. Guangliang Chen Outline of the presentation:

• Evaluation criteria

– Precision, recall/sensitivity, specificity, F1 score

– ROC curves, AUC

• References

– Stanford lecture1

– Wikipedia page2

1http://cs229.stanford.edu/section/evaluation_metrics_spring2020.pdf 2https://en.wikipedia.org/wiki/Receiver_operating_characteristic Evaluation criteria

Motivation

The two main criteria we have been using for evaluating classifiers are

• Accuracy (overall)

• Running time

The overall accuracy does not reflect the classwise accuracy scores, and can be dominated by the largest class(es).

For example, with two imbalanced classes (80 : 20), the constant prediction with the dominant label will achieve 80% accuracy overall.

Dr. Guangliang Chen | Mathematics & , San José State University3/20 Evaluation criteria

The binary classification setting

More performance metrics can be introduced for binary classification, where the data points only have two different labels:

: y = 1 (positive), y = 0 (negative),

• SVM: y = 1 (positive), y = −1 (negative)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University4/20 Evaluation criteria

Additionally, we suppose that the classifier outputs a continuous score instead of a discrete label directly, e.g.,

1 • Logistic regression: p(x; θ) = 1+e−θ·x • SVM: w · x + b

A threshold will need to be specified in order to make predictions (i.e., to convert scores to labels).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University5/20 Evaluation criteria

Interpretation of the

The confusion table summarizes the 4 different combinations of true conditions and predicted labels:

Predicted Actual (H0: Test point is negative) Positive Negative (H1: Test point is positive) Positive TPFP FP: Type-I error Negative FNTN FN: Type-II error

The overall accuracy of the classifier is TP + TN Accuracy = TP + FP + FN + TN

Dr. Guangliang Chen | Mathematics & Statistics, San José State University6/20 Evaluation criteria

Example

Predicted Actual Positive Negative Positive TP=7 FP=3 Negative FN=1 TN=9

The overall accuracy of this classifier is 7 + 9 16 = = 0.8 7 + 3 + 1 + 9 20

Dr. Guangliang Chen | Mathematics & Statistics, San José State University7/20 Evaluation criteria

Remark. The overall accuracy is not a good measure when the classes are imbalanced.

Example. In a cancer detection example with 100 people, only 5 people has cancer. Let’s say our model is very bad and predicts every case as No Cancer. In doing so, it has classified those 95 non-cancer patients correctly and 5 cancerous patients as Non-cancerous. Now even though the model is terrible at predicting cancer, the accuracy of such a bad model is also 95%.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University8/20 Evaluation criteria

More performance metrics

• Precision

• Recall or Sensitivity

• Specificity

• F1 score

Dr. Guangliang Chen | Mathematics & Statistics, San José State University9/20 Evaluation criteria

Precision

Predicted Actual Positive Negative Positive TP=7 FP=3 Negative FN=1 TN=9

The precision of this classifier is defined as TP 7 7 Precision = = = TP + FP 7 + 3 10

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/20 Evaluation criteria

Recall or sensitivity

Predicted Actual Positive Negative Positive TP=7 FP=3 Negative FN=1 TN=9

The recall/sensitivity of this classifier is defined as TP 7 7 Recall = = = TP + FN 7 + 1 8

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/20 Evaluation criteria

When to use precision and when to use recall:

Precision is about being precise. So even if we managed to capture only one positive case, and we captured it correctly, then we are 100% precise.

Recall is about capturing all positive cases with the answer as positive. So if we simply always say every case as being positive, we have 100% recall.

Some common patterns:

• High precision is hard constraint, do best recall ( results, grammar correction): Intolerant to FP

• High recall is hard constraint, do best precision (medical diagnosis): Intolerant to FN

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/20 Evaluation criteria

Specificity

Predicted Actual Positive Negative Positive TP=7 FP=3 Negative FN=1 TN=9

The specificity of this classifier is defined as TN 9 9 Specificity = = = FP + TN 3 + 9 12

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/20 Evaluation criteria

F1 score

Predicted Actual Positive Negative 7 Positive TP=7 FP=3 Precision = 10 Negative FN=1 TN=9 7 Recall = 8

The F1 score is defined as the harmonic of : 2 2 · precision · recall 14 F1 score = = = precision−1 + recall−1 precision + recall 18

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/20 Evaluation criteria

Remark. The of two different positive numbers x, y

is closer to the smaller number than to the larger number: 1 2st s = 0.2, t = 0.8 : (s + t) = 0.5, = 0.32 2 s + t

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/20 Evaluation criteria

Changing threshold

Predicted Actual Positive Negative 8 Positive TP=8 FP=5 Precision = 13 Negative FN=0 TN=7 7 Recall = 1 Specificity = 12 (sensitivity = 1)

8 15 2·1· 13 16 Accuracy = 20 , and F1 score = 8 = 21 1+ 13

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/20 Evaluation criteria

Receiver operating characteristic (ROC) curves

An ROC curve is a graphical plot of the true positive rate TP TPR = = Sensitivity TP + FN against the FP FPR = = 1 − Specificity FP + TN at various threshold settings. It il- lustrates the diagnostic ability of a binary classifier system as its discrim- “power vs the Type I error” ination threshold is varied.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/20 Evaluation criteria

Each point in the ROC space corre- sponds to a unique confusion table (due to a different threshold used in making predictions):

• (0, 1): perfect classification

• Diagonal: random guess

• Points above the diagonal rep- resent good classification re- sults (better than random); points below represent bad re- (source: Wikipedia) sults (worse than random).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/20 Evaluation criteria

Area under the curve (AUC)

The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ’positive’ ranks higher than ’negative’):

Area = P (X1 > X0) where X1 is the score for a positive instance and X0 is the score for a negative instance.

The ROC AUC is commonly used for model comparison in the machine learning community. ←− The larger, the better

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/20 Evaluation criteria

Multiclass classifications

Still has the confusion table (with large diagonal entries preferred).

Most metrics (except accuracy) generally analyzed as multiple 1-vs-many.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/20