Lecture 8 Multiclass/Log-linear models, Evaluation, and Human Labels

Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O’Connor (http://brenocon.com)

1

Thursday, September 25, 14 To d ay

• Multiclass • Evaluation. Humans! • Next PS: out next week

2

Thursday, September 25, 14 Multiclass?

• Task: Classify into one of |Y| multiple exclusive categories • e.g. Y = {sports, travel, politics} • Language models (word prediction)? Y = vocabulary • One option: transform multiple binary classifiers into single multiclass classifier • |Y| one-versus-rest classifiers. Hard prediction rule: choose most confident. • But what about probabilistic prediction? • Does the above procedure sum-to-1 ?

3

Thursday, September 25, 14 Binary vs Multiclass logreg • Binary logreg: let x be a feature vector, and y either 0 or 1 is a weight vector across the x features. exp(Tx) p(y =1x, )= | 1 + exp(Tx) • Multiclass logreg: y is a categorical variable, attains one of several values in Y

Each y0 is a weight vector across all x features.

T exp(y x) p(y x, )= T | y exp(y x) 02Y 0 P 4

Thursday, September 25, 14 Log-linear models Here’s the NLP-style notation

f(x, y) feature function of input x and output y Produces very long feature vector. 1 if “awesome” in x and y=POSITIVE f105(x,y) = In this view, it’s the feature ⇢0 otherwise engineer’s job when 1 if “awesome” in x and y=NEGATIVE f106(x,y) = implementing f(x,y) to 0 otherwise ⇢ include features for all count(“awesome” in x) if y=POSITIVE f533(x,y) = combinations of aspects 0 if y != POSITIVE ⇢ of inputs x and outputs y.

✓ One long single weight vector. Unnormalized, positive “exp- goodness score” exp(✓Tf(x, y)) p(y x, ✓)= Normalizer: sum the exp- | exp(✓Tf(x, y )) y0 0 goodnesses over all possible 2Y outcomes. P 5 Thursday, September 25, 14 Softmax function

Define: “goodness score” for s = ✓Tf(x, y) potential output y. y exp(s ) p(y x, ✓)= y Log-linear distribution then is | y exp(sy0 ) 02Y Softmax function: turns goodness scores intoP probabilities that sum to 1. Exponentiate then normalize.

exp(s1) exp(s2) exp(s ) softmax( s1...s ) , ,..., |Y| { |Y|} y exp(sy0 ) y exp(sy0 ) y exp(sy0 )! 02Y 02Y 02Y P P P def softmax(scores): exponentiated = [exp(s) for s in scores] Z = sum(exponentiated) return [escore/Z for escore in exponentiated] In-class demo [html] [ipynb] 6

Thursday, September 25, 14 “Log-linear” ?

p(y) exp(✓Tf(y)) exp(✓Tf(y)) / p(y)= “Proportional to” exp(✓Tf(y )) y0 0 notation, since 2Y denominator is log p(y)=✓PTf(y) log exp(✓Tf(y)) invariant to y y0 X2Y log p(y) ✓Tf(y) The / Log prob is... Abusive “log proportional to” notation... somewhat common. Sometimes convenient.

7

Thursday, September 25, 14 “Log-linear” ?

p(y) exp(✓Tf(y)) exp(✓Tf(y)) / p(y)= “Proportional to” exp(✓Tf(y )) y0 0 notation, since 2Y denominator is log p(y)=✓PTf(y) log exp(✓Tf(y)) invariant to y y0 X2Y log p(y) ✓Tf(y) The / Log prob is... Abusive “log proportional Linear in the to” notation... somewhat weights and common. Sometimes features... convenient.

7

Thursday, September 25, 14 “Log-linear” ?

p(y) exp(✓Tf(y)) exp(✓Tf(y)) / p(y)= “Proportional to” exp(✓Tf(y )) y0 0 notation, since 2Y denominator is log p(y)=✓PTf(y) log exp(✓Tf(y)) invariant to y y0 X2Y log p(y) ✓Tf(y) The / Log prob is... Abusive “log proportional Linear in the to” notation... somewhat weights and Um, but except common. Sometimes features... for this. convenient. Log-sum-exp is an important function for these models

7

Thursday, September 25, 14 Log-linear gradients

• Similar as before: difference between the gold label, versus the model’s predicted probability for that label.

8

Thursday, September 25, 14 Log-linear models • Such a great idea it has been reinvented and renamed in many different fields Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression. Common in 1990’s- Maximum Entropy Classifier era NLP literature. Logistic regression estimation obeys the maximum entropy principle, and thus Still sometimes used. logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".

Neural Network: Classification with a Single Neuron Currently Binary logistic regression is equivalent to a one-, single-output neural popular again network with a logistic trained under log loss. This is sometimes called classification with a single neuron.

Generalized Linear Model and Softmax Regression Logistic regression is a generalized linear model with the logit link function. The logistic link function is sometimes called softmax and given its use of exponentiation to convert linear predictors to probabilities, it is sometimes called an exponential model.

http://alias-i.com/lingpipe/demos/tutorial/logistic-regression/read-me.html 9

Thursday, September 25, 14 Evaluation • Given we have gold-standard labels, how good is a classifier? Evaluate on a test set (or dev set). • Evaluating probabilistic predictions • Log-likelihood: we were doing this for LM’s. • Evaluating hard predictions • Accuracy: fraction of predictions that are correct. • What about class imbalance? Note also “most frequent class” baseline. • Many different statistics available from the confusion matrix • Precision, recall, F-score • Expected utility...

10

Thursday, September 25, 14 Confusion matrix metrics pred Brendan O'Connor 2013-06-17 These are conditional probabilities from counts 1 0 http://brenocon.com/confusion_matrix_diagrams.pdf (.graffle) onConfusion a binary confusion matrix. matrix These are just some notes -- no guarantees of correctness. Diagrams show the conditioning population and 1 True False the quantity for the condprob numerator. gold Pos Neg Accuracy = p(correct) = (TP+TN) : (FN+FP) Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X False True But often you want to know about FN's 0 Predicted Predicted Pos Neg versus FP's, thus all these metrics. Spam Non-Spam Truth-conditional metrics 1 0 1 0 p( correct | gold=0 ) p( correct | gold=1 ) If you take the same classifier and Specificity Actual Sensitivity a.k.a. Recall apply it to a new dataset with different 1 a.k.a. TrueNegRate 5000FN a.k.a. 100TruePosRate ground truth population proportions, Spam = TN : FP = TP : FN these metrics will stay the same. 0 FP = 1 - FalsePosRate = 1 - FalseNegRate Makes sense for, say, medical tests. Actual = 1 - (FN:TP) = 1 - (FP:TN) 7 400 Non-Spam Prediction-conditional metrics These are properties jointly of the classifier 1 p( correct | pred=1 ) 0 p( correct | pred=0 ) and dataset. For example, Precision Neg Predictive Value p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value 1 FN = TN : FN = Precision = TP : FP = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = 1 - FalseDiscoveryRate 0 = 1 - (FP:TP) Property of classifier Property of dataset Confidence Tradeoffs Classifier says YES if confidence in '1' is greater than a threshold t. Varying the threshold trades off between FP's versus FN's. For any one metric, you can make it arbitrarily good by being either very liberal or very conservative. Therefore you are often interested in a pair of metrics: one that cares about FP's, and one that cares about FN's. 11Here arehttp://brenocon.com/confusion_matrix_diagrams.pdf two pairings that are popular. By changing the confidence threshold, you get a curve of them. Thursday, September Sens-Spec25, 14 Tradeoff: ROC curve Sens-Prec Tradeoff: Precision-Recall curve (sens, spec) = (TP:FN, TN:FP) (prec, rec) = (TP:FP, TP:FN)

pred pred Prec 1 0 1 0 TPR (Sens) Ignores true negatives: makes sense for needle- 1 Ignores FPR (1-Spec) 1 Rec (Sens) gold FN gold FN in-haystack problems, truth marginals where TN's dominate 0 FP (row proportions) 0 FP and are uninteresting.

There are several summary statistics to make a FP-FN metric pairing invariant to the choice of threshold. AUC a.k.a. ROCAUC: area under the ROC curve. PRAUC: area under the PR curve. Similar to mean average precision. Has a semantics: prob the classifier correctly orders a randomly No semantics that I know of. chosen pos and neg example pair (like Wilcoxon-Mann-Whitney PR Breakeven: find threshold where prec=rec, or Kendall's Tau). And neat analytic tricks with convex hulls. and report that prec (or rec).

There are also summary stats that require a specific threshold -- or you can use if you don't have confidences -- but have been designed to attempt to care about both FP and FN costs. Balanced accuracy: arithmetic average of sens and spec. F-score: harmonic average of prec and rec. This is accuracy if you chose among pos and neg classes with Usually F1 is used: just the unweighted harmonic mean. equal prob each -- i.e. a balanced population. Corresponds to curvy isoclines in PR space. "Balanced" is a.k.a. macroaveraged. Corresponds to linear isoclines in ROC space. [In contrast, AUC is a balanced rank-ordering accuracy.]

Classical Frequentist Hypothesis Testing The point is to give non-Bayesian probabilistic semantics to a single test, so these Task is to detect null hypothesis violations. relationships are a little more subtle. For example, size is a worst-case bound on Define "reject null" as a positive prediction. specificity, and a p-value is for one example: prob the null could generate a test statistic - Type I error = false positive. Size (α) = 1-Specificity at least as extreme (= size of the strictest test that would reject the null on that example). - Type II error = false negative. Power (β) = Sensitivity When applied to a single test, PPV and NPV are not frequentist-friendly concepts since and p-value = 1-Specificity, kind of. they depend on priors. But in multiple testing, they can be frequentist again; e.g. "Significance" refers to either p-value or size (i.e. FPR) Benjamini-Hochberg bounds expected PPV by assuming PriorNeg=.999.

References Diagrams based on William Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Listings of many metrics on: http://en.wikipedia.org/wiki/Receiver_operating_characteristic Fawcett (2006), "An Introduction to ROC analysis": http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf Manning et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html Hypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Confusion matrix metrics pred Brendan O'Connor 2013-06-17 These are conditional probabilities from counts 1 0 http://brenocon.com/confusion_matrix_diagrams.pdf (.graffle) Confusion matrix metrics on a binary predconfusion matrix. Brendan O'Connor 2013-06-17These are just some notes -- no guarantees of correctness. These are conditional probabilities from counts Diagrams show1 the0 conditioning populationhttp://brenocon.com/confusion_matrix_diagrams.pdf and True False (.graffle) on a binary confusion matrix. These are just some1 notes -- no guarantees of correctness. Confusion matrix the quantity for the condprob numerator. gold Pos Neg Accuracy = p(correct) = (TP+TN) : (FN+FP) Diagrams show the conditioning population and 1 True False Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X False True But often you want to know about FN's the quantity for the condprob numerator. gold Pos Neg Accuracy = p(correct) = (TP+TN)0 : (FN+FP) Pos Neg versus FP's, thus all these metrics. Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X False True But often you want to know about FN's 0 Predicted Predicted Pos Neg versus FP's, thus all these metrics. Truth-conditional metrics 1 0 1 0 p( correct | gold=0 ) Spam Non-Spam p( correct | gold=1 ) If you take the same classifier and Sensitivity a.k.a. Recall Truth-conditional metricsSpecificity 1 0 p( correct | gold=1 ) 1 0 p( correct | gold=0 ) apply it to a new dataset with different 1 FN a.k.a. TruePosRate If you take the same a.k.a.classifier TrueNegRate and Sensitivity a.k.a. Recall Specificity ground truth population proportions, Actual = TP : FN apply it to a new dataset= TN with : FP different 1 FN a.k.a. TruePosRate a.k.a. TrueNegRate these metrics will stay the same. 5000 100 = 1 - FalseNegRate ground0 FP truth population= 1 proportions,- FalsePosRate Spam = TN : FP Makes sense for, say, medical tests. = TP : FN these metrics will stay= 1the - ( same.FP:TN) 0 FP = 1 =- (1FN - FalsePos:TP) Rate = 1 - FalseNegRate Makes sense for, say, medical tests. Actual = 1 - (FN:TP) = 5000= 1 - /( FP(5100):TN) Prediction-conditional metrics 7 400 These are properties jointly of the classifier 1 p( correct | pred=1 ) Prediction-conditional0 p( metrics correct | pred=0 ) Non-Spam and dataset. For example, Precision These are propertiesNeg jointly Predictive of the classifier Value 1 p( correct | pred=1 ) 0 p( correct | pred=0 ) p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value and1 dataset.FN For example,= TN : FN Precision Neg Predictive Value = Precision = TP : FP p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value 1 FN = TN : FN = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = Precision0 = TP : FP = 1 - FalseDiscoveryRate = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = 1 - FalseDiscoveryRate 0 = 1 - (FP:TP) = 1 - (FP:TP) Property of classifier Property of dataset Confidence Tradeoffs Property of classifier Property of dataset Confidence Tradeoffs Classifier says YES if confidence in '1' is greater than a threshold t. Varying the threshold trades off between FP's versus FN's. For any one metric, you Classifier says YES if confidence in '1' is greater thancan a make threshold it arbitrarily t. Varying good the by threshold being either trades very off liberal between or very FP's conservative. versus FN's. ForTherefore any one you metric, are often you interested in a pair of metrics: one that cares can make it arbitrarily good by being either very liberalabout or FP very's, conservative.and one that cares Therefore about you FN 's.are Here often are interested two pairings in a pair that of are metrics: popular. one By that changing cares the confidence threshold, you get a curve of them. about FP's, and one that cares about FN's. Here are two pairings that are popular. By changing the confidence threshold, you get a curve of them. 11 http://brenocon.com/confusion_matrix_diagrams.pdfSens-Spec Tradeoff: ROC curve Sens-Prec Tradeoff: Precision-Recall curve Thursday, September Sens-Spec25, 14 Tradeoff: ROC curve (sens, spec) = (TP:FN,Sens-Prec TN:FP) Tradeoff: Precision-Recall curve (prec, rec) = (TP:FP, TP:FN) (sens, spec) = (TP:FN, TN:FP) (prec, rec) = (TP:FP, TP:FN)

pred pred Prec 1 0 1 0 TPR (Sens) Ignores true negatives: pred pred Prec 1 0 1 0 makes sense for needle- TPR (Sens) Ignores true negatives: 1 Ignores FPR (1-Spec) 1 Rec (Sens) gold FN makes sense for needle- gold FN in-haystack problems, truth marginals Rec (Sens) 1 FN Ignores FPR (1-Spec) 1 FN in-haystack problems, where TN's dominate gold gold(row proportions) truth marginals 0 FP where TN's dominate 0 FP and are uninteresting. 0 FP (row proportions) 0 FP and are uninteresting.

There are several summary statistics to make a FP-FN metric pairing invariant to the choice of threshold. There are several summary statistics to make a FPAUC-FN metrica.k.a. ROCAUC:pairing invariant area underto the thechoice ROC of curvethreshold.. PRAUC: area under the PR curve. Similar to mean average precision. AUC a.k.a. ROCAUC: area under the ROC curve. Has a semantics: probPRAUC the classifier: area under correctly the PR orders curve a .randomly Similar to meanNo average semantics precision that I know. of. Has a semantics: prob the classifier correctly orderschosen a randomly pos and negNo example semantics pair that(like IWilcoxon-Mann-Whitney know of. PR Breakeven: find threshold where prec=rec, or Kendall's Tau). And neat analytic tricks with convex hulls. chosen pos and neg example pair (like Wilcoxon-Mann-Whitney PR Breakeven: find threshold where prec=rec, and report that prec (or rec). or Kendall's Tau). And neat analytic tricks with convex hulls. and report that prec (or rec). There are also summary stats that require a specific threshold -- or you can use if you don't have There are also summary stats that require a specificconfidences threshold ---- orbut you have can been use ifdesigned you don't to have attempt to care about both FP and FN costs. confidences -- but have been designed to attempt to care about both FP and FN costs. Balanced accuracy: arithmetic average of sens and spec. F-score: harmonic average of prec and rec. Balanced accuracy: arithmetic average of sens andThis spec. is accuracy if youF-score chose: amongharmonic pos average and neg of classesprec and with rec. Usually F1 is used: just the unweighted harmonic mean. This is accuracy if you chose among pos and neg classesequal prob with each -- i.e.Usually a balanced F1 is used:population. just the unweighted harmonic mean.Corresponds to curvy isoclines in PR space. equal prob each -- i.e. a balanced population. "Balanced" is a.k.a. macroaveragedCorresponds to curvy. isoclines in PR space. "Balanced" is a.k.a. macroaveraged. Corresponds to linear isoclines in ROC space. Corresponds to linear isoclines in ROC space. [In contrast, AUC is a balanced rank-ordering accuracy.] [In contrast, AUC is a balanced rank-ordering accuracy.] Classical Frequentist Hypothesis Testing The point is to give non-Bayesian probabilistic semantics to a single test, so these Classical Frequentist Hypothesis Testing Task is to Thedetect point null is hypothesis to give non-Bayesian violations. probabilistic semanticsrelationships to a single are atest, little so more these subtle. For example, size is a worst-case bound on Task is to detect null hypothesis violations. Define "rejectrelationships null" as aare positive a little prediction. more subtle. For example, specificity,size is a worst-case and a p-value bound is onfor one example: prob the null could generate a test statistic Define "reject null" as a positive prediction. - Type I errorspecificity, = false and positive a p-value. Size is for (α )one = 1-Specificity example: prob theat leastnull could as extreme generate (= sizea test of statistic the strictest test that would reject the null on that example). - Type I error = false positive. Size (α) = 1-Specificity - Type II errorat least = falseas extreme negative (= size. Power of the (β strictest) = Sensitivity test that wouldWhen rejectapplied the to null a single on that test, example). PPV and NPV are not frequentist-friendly concepts since - Type II error = false negative. Power (β) = Sensitivityand p-valueWhen = 1-Specificity, applied to a singlekind of. test, PPV and NPV are notthey frequentist-friendly depend on priors. concepts But in multiplesince testing, they can be frequentist again; e.g. and p-value = 1-Specificity, kind of. "Significance"they dependrefers to on either priors. p-value But in or multiple size (i.e. testing, FPR) they Benjamini-Hochbergcan be frequentist again; bounds e.g. expected PPV by assuming PriorNeg=.999. "Significance" refers to either p-value or size (i.e. FPR) Benjamini-Hochberg bounds expected PPV by assuming PriorNeg=.999. References References Diagrams based on William Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Diagrams based on William Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdfListings of many metrics on: http://en.wikipedia.org/wiki/Receiver_operating_characteristic Listings of many metrics on: http://en.wikipedia.org/wiki/Receiver_operating_characteristicFawcett (2006), "An Introduction to ROC analysis": http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf Fawcett (2006), "An Introduction to ROC analysis":Manning http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html Manning et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.htmlHypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Hypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Confusion matrix metrics pred Brendan O'Connor 2013-06-17 These are conditional probabilities from counts 1 0 http://brenocon.com/confusion_matrix_diagrams.pdf (.graffle) Confusion matrix metrics on a binary predconfusion matrix. Brendan O'Connor 2013-06-17These are just some notes -- no guarantees of correctness. These are conditional probabilities from counts Diagrams show1 the0 conditioning populationhttp://brenocon.com/confusion_matrix_diagrams.pdf and True False (.graffle) Confusionon matrix a binary metrics confusion matrix. pred These are just some1 notes -- noBrendan guarantees O'Connor of correctness. 2013-06-17 Confusion matrix the quantity for the condprob numerator. gold Pos Neg Accuracy = p(correct) = (TP+TN) : (FN+FP) These areDiagrams conditional show probabilities the conditioning from population counts and 1 1 True 0False http://brenocon.com/confusion_matrix_diagrams.pdf (.graffle) Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X False True But often you want to know about FN's on a binarythe confusion quantity for matrix. the condprob numerator. gold Pos Neg AccuracyThese = p(correct) are just = (TP+TN) some0 notes : (FN +--FP no) guarantees of correctness. Pos Neg versus FP's, thus all these metrics. Diagrams Oddsshow notation the conditioning X:Y≡X/(X+Y); population and 1-(X:Y)=Y:X and TrueFalseFalseTrue But often you want to know about FN's 1 0 the quantity for the Predictedcondprob numerator. Predicted gold PosPosNegNeg Accuracyversus FP ='s, p(correct) thus all these = (TP+TN) metrics. : (FN+FP) Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X But often you want to know about FN's Truth-conditional metrics False1 True0 1 0 p( correct | gold=0 ) Spam Non-Spam 0 p(versus correct FP| gold=1's, thus ) all these metrics. If you take the same classifier and Pos Neg Sensitivity a.k.a. Recall Truth-conditional metricsSpecificity 1 0 p( correct | gold=1 ) 1 0 p( correct | gold=0 ) apply it to a new dataset with different 1 FN a.k.a. TruePosRate If you take the same a.k.a.classifier TrueNegRate and Sensitivity a.k.a. Recall Specificity ground truth population proportions, Actual = TP : FN apply it to a new dataset= TN with : FP different 1 FN a.k.a. TruePosRate a.k.a. TrueNegRate these metrics will stay the same. 5000 100 = 1 - FalseNegRate groundTruth-conditional0 FP truth population metrics= 1 proportions,- FalsePosRate Spam1 0 p( correct | gold=1 ) 1 0 p( =correct TN : FP | gold=0 ) Makes sense for, say, medical tests. = TP : FN theseIf you metrics take the will samestay= 1the -classifier ( same.FP:TN) and 0 FP =Specificity 1 =- (1FN - FalsePos:TP) Rate Sensitivity= 1 - a.k.a.FalseNeg RecallRate Makesapply senseit to a for,new say, dataset medical with tests. different 1 = 1 - (FN:TP) a.k.a.= 1 -TrueNegRate (FP:TN) Actual FN a.k.a. TruePosRate = 5000 / (5100) ground truth population proportions, Prediction-conditional metrics =7 TP : FN 400 = TN : FP These are properties jointly of the classifier 1 p( correct | pred=1 ) Prediction-conditionalthese metrics0 will p(stay metrics correct the same.| pred=0 ) Non-Spam = 1 - FalseNegRate 0 FP = 1 - FalsePosRate and dataset. For example, Precision TheseMakes are sense properties for, Negsay, jointly Predictivemedical of the classifier tests. Value 1 = 1 - (FNp(:TP) correct | pred=1 ) 0 = 1p( - correct (FP:TN) | pred=0 ) p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value and1 dataset.FN For example,= TN : FN Precision Neg Predictive Value = Precision = TP : FP p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value 1 FN = TN : FN Prediction-conditional metrics = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = 1 - FalseDiscoveryRate = Precision0 = TP : FP These are properties jointly of the classifier 1 p( correct | pred=1 ) 0 =p( 1 -correct (FP:TP) | pred=0 ) = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = 1 - FalseDiscoveryRate 0 and dataset. For example, Precision Neg Predictive Value Property of classifier = 1 - (FP:TP) p(person has disease | test says "positive") Property of dataset 1 a.k.a. Pos Predictive Value 1 FN = TN : FN Confidence Tradeoffs Property= Precision of classifier Property of dataset = TP : FP Confidence Tradeoffs Classifier says YES if confidence in '1' is greater than a threshold= Sens*PriorPos t. Varying the + (1-Spec)*PriorNegthreshold trades off between FP's versus FN's. For any one metric, you 0 FP = 1 - FalseDiscoveryRate can make0 it arbitrarily good by being either very liberal or very conservative. Therefore you are often interested in a pair of metrics: one that cares Classifier says =YES 1 - if( FPconfidence:TP) in '1' is greater than a threshold t. Varying the threshold trades off between FP's versus FN's. For any one metric, you can make it arbitrarily good by being either very liberalabout or FP very's, conservative.and one that cares Therefore about you FN 's.are Here often are interested two pairings in a pair that of are metrics: popular. one By that changing cares the confidence threshold, you get a curve of them. about FP's, and one that cares about FN's. Here are two pairings that are popular. By changing the confidenceProperty threshold, of classifier you get Propertya curve of of them. dataset = 5000 / 5007 11 http://brenocon.com/confusion_matrix_diagrams.pdfSens-Spec Tradeoff: ROC curve Sens-Prec Tradeoff: Precision-Recall curve ConfidenceThursday, September Sens-Spec25,Tradeoffs 14 Tradeoff: ROC curve (sens, spec) = (TP:FN,Sens-Prec TN:FP) Tradeoff: Precision-Recall curve (prec, rec) = (TP:FP, TP:FN) Classifier says(sens, YES spec) if confidence = (TP:FN, TN:in '1'FP is) greater than a threshold t. Varying the(prec, threshold rec) = (TP:tradesFP, offTP: betweenFN) FP's versus FN's. For any one metric, you can make it arbitrarily good by being either very liberal or very conservative.pred Therefore you are often interested in a pair of metrics:pred one that cares Prec 1 0 1 0 TPR (Sens) Ignores true negatives: about FP's, and one thatpred cares about FN's. Here are two pairings that are popular. Bypred changing the confidence threshold, youPrec get a curve of them. 1 0 1 0 makes sense for needle- TPR (Sens) Ignores true negatives: 1 Ignores FPR (1-Spec) 1 Rec (Sens) Sens-Spec Tradeoff: ROC curve gold FN Sens-Prec Tradeoff: Precision-Recallmakes sense for needle- curve gold FN in-haystack problems, truth marginals Rec (Sens) (sens, spec) = (TP:1 FN, FNTN:FP)Ignores FPR (1-Spec) (prec,1 rec) = (TP:FNFP, TP:in-haystackFN) problems, where TN's dominate gold gold(row proportions) truth marginals 0 FP where TN's dominate 0 FP and are uninteresting. 0 FP (row proportions) 0 FP and are uninteresting.

pred pred Prec 1 0 1 0 TPR (Sens) There are several summary statistics to makeIgnores a FP-FN true metric negatives: pairing invariant to the choice of threshold. makes sense for needle- There are several summary statistics to make a FPAUC-FN metrica.k.a. ROCAUC:pairing invariant area underto the thechoice ROC of curvethreshold.. PRAUC: Recarea (Sens)under the PR curve. Similar to mean average precision. 1 FN Ignores FPR (1-Spec) 1 FN in-haystack problems, gold Has a semantics: probgold the classifier correctly orders a randomly No semantics that I know of. AUC a.k.a. ROCAUC:truth marginals area under the ROC curve. PRAUC: area under the wherePR curve TN's. Similar dominate to mean average precision. Has a semantics:(row prob proportions) the classifier correctly orderschosen a randomly pos and negNo example semantics pair that(like IWilcoxon-Mann-Whitney know of. PR Breakeven: find threshold where prec=rec, 0 FP or Kendall's Tau). And neat0 analyticFP tricks withand convex are uninteresting. hulls. chosen pos and neg example pair (like Wilcoxon-Mann-Whitney PR Breakeven: find threshold where prec=rec, and report that prec (or rec). or Kendall's Tau). And neat analytic tricks with convex hulls. and report that prec (or rec). There are also summary stats that require a specific threshold -- or you can use if you don't have There are several summary statistics to make a FP-FN metric pairing invariant to the choice of threshold. There are also summary stats that require a specificconfidences threshold ---- orbut you have can been use ifdesigned you don't to have attempt to care about both FP and FN costs. confidences -- but have been designed to attempt to care about both FP and FN costs. AUC a.k.a. ROCAUC: area under the ROC curve. Balanced accuracyPRAUC: arithmetic: area average under theof sens PR curveand spec.. Similar to meanF-score average: harmonic precision average. of prec and rec. Has a semantics:Balanced prob accuracy the classifier: arithmetic correctly average orders of sens a and randomlyThis spec. is accuracy ifNo youF-score semantics chose: amongharmonic that pos I averageknow and negof. of classesprec and with rec. Usually F1 is used: just the unweighted harmonic mean. chosen posThis and is accuracyneg example if you pairchose (like among Wilcoxon-Mann-Whitney pos and neg classesequal prob with each --PR i.e.Usually Breakeven a balanced F1 is used::population. find justthreshold the unweighted where prec=rec, harmonic mean.Corresponds to curvy isoclines in PR space. or Kendall'sequal Tau). prob And each neat -- i.e. analytic a balanced tricks population. with convex hulls."Balanced" is a.k.a.and macroaveragedCorresponds report that toprec curvy. (or isoclines rec). in PR space. "Balanced" is a.k.a. macroaveraged. Corresponds to linear isoclines in ROC space. [In contrast, AUC is a balanced rank-ordering accuracy.] There areCorresponds also summary to linear stats isoclines that require in ROC a specific space. threshold -- or you can use if you don't have [In contrast, AUC is a balanced rank-ordering accuracy.] confidences -- but have been designed to attempt to care about both FP and FN costs. Classical Frequentist Hypothesis Testing The point is to give non-Bayesian probabilistic semantics to a single test, so these BalancedClassical accuracy Frequentist: arithmetic Hypothesis average Testing of sens andTask spec. is to Thedetect point null isF-score hypothesis to give non-Bayesian: harmonic violations. average probabilistic of prec semantics andrelationships rec. to a single are atest, little so more these subtle. For example, size is a worst-case bound on This is accuracyTask is to detectif you nullchose hypothesis among violations.pos and neg classesDefine with "rejectrelationships null" asUsually aare positive a littleF1 prediction. ismore used: subtle. just theFor unweightedexample, specificity,size harmonic is a worst-case and mean.a p-value bound is onfor one example: prob the null could generate a test statistic equal probDefine each "reject -- i.e. null" a balancedas a positive population. prediction. - Type I errorspecificity, = false Correspondsand positive a p-value. Size isto for (curvyα )one = 1-Specificity example:isoclines probin PR theat space. leastnull could as extreme generate (= sizea test of statistic the strictest test that would reject the null on that example). "Balanced" - Type is a.k.a.I error macroaveraged= false positive. Size. (α) = 1-Specificity - Type II errorat least = falseas extreme negative (= size. Power of the (β strictest) = Sensitivity test that wouldWhen rejectapplied the to null a single on that test, example). PPV and NPV are not frequentist-friendly concepts since Corresponds - Type to II errorlinear = isoclinesfalse negative in ROC. Power space. (β) = Sensitivityand p-valueWhen = 1-Specificity, applied to a singlekind of. test, PPV and NPV are notthey frequentist-friendly depend on priors. concepts But in multiplesince testing, they can be frequentist again; e.g. and p-value = 1-Specificity, kind of. "Significance" refers to either p-value or size (i.e. FPR) [In contrast, AUC is a balanced rank-ordering accuracy.] they depend on priors. But in multiple testing, they Benjamini-Hochbergcan be frequentist again; bounds e.g. expected PPV by assuming PriorNeg=.999. "Significance" refers to either p-value or size (i.e. FPR) Benjamini-Hochberg bounds expected PPV by assuming PriorNeg=.999. References Classical ReferencesFrequentist Hypothesis Testing DiagramsThe pointbased is on to William give non-Bayesian Press's slides: probabilistic http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf semantics to a single test, so these Task is to Diagramsdetect null based hypothesis on William violations. Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdfListingsrelationships of many metrics are a on: little http://en.wikipedia.org/wiki/Receiver_operating_characteristic more subtle. For example, size is a worst-case bound on Define "rejectListings null" of as many a positive metrics prediction.on: http://en.wikipedia.org/wiki/Receiver_operating_characteristicFawcettspecificity, (2006), "Anand Introduction a p-value isto forROC one analysis": example: http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf prob the null could generate a test statistic Fawcett (2006), "An Introduction to ROC analysis":Manning http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html - Type I error = false positive. Size (α) = 1-Specificity at least as extreme (= size of the strictest test that would reject the null on that example). - Type II errorManning = false et al negative(2008) "Intro. Power to IR", e.g.(β) =http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html SensitivityHypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Hypothesis testing: Casella and Berger, "Statistical Inference"When (chapplied 8); also to Wasserman,a single test, "All PPV of Statistics" and NPV (ch are 10) not frequentist-friendly concepts since and p-value = 1-Specificity, kind of. they depend on priors. But in multiple testing, they can be frequentist again; e.g. "Significance" refers to either p-value or size (i.e. FPR) Benjamini-Hochberg bounds expected PPV by assuming PriorNeg=.999.

References Diagrams based on William Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Listings of many metrics on: http://en.wikipedia.org/wiki/Receiver_operating_characteristic Fawcett (2006), "An Introduction to ROC analysis": http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf Manning et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html Hypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Confusion matrix metrics pred Brendan O'Connor 2013-06-17 These are conditional probabilities from counts 1 0 http://brenocon.com/confusion_matrix_diagrams.pdf (.graffle) Confusion matrix metrics on a binary predconfusion matrix. Brendan O'Connor 2013-06-17These are just some notes -- no guarantees of correctness. These are conditional probabilities from counts Diagrams show1 the0 conditioning populationhttp://brenocon.com/confusion_matrix_diagrams.pdf and True False (.graffle) Confusionon matrix a binary metrics confusion matrix. pred These are just some1 notes -- noBrendan guarantees O'Connor of correctness. 2013-06-17 Confusion matrix the quantity for the condprob numerator. gold Pos Neg Accuracy = p(correct) = (TP+TN) : (FN+FP) These areDiagrams conditional show probabilities the conditioning from population counts and 1 1 True 0False http://brenocon.com/confusion_matrix_diagrams.pdf (.graffle) Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X False True But often you want to know about FN's on a binarythe confusion quantity for matrix. the condprob numerator. gold Pos Neg AccuracyThese = p(correct) are just = (TP+TN) some0 notes : (FN +--FP no) guarantees of correctness. Pos Neg versus FP's, thus all these metrics. Diagrams Oddsshow notation the conditioning X:Y≡X/(X+Y); population and 1-(X:Y)=Y:X and TrueFalseFalseTrue But often you want to know about FN's 1 0 the quantity for the Predictedcondprob numerator. Predicted gold PosPosNegNeg Accuracyversus FP ='s, p(correct) thus all these = (TP+TN) metrics. : (FN+FP) Odds notation X:Y≡X/(X+Y); and 1-(X:Y)=Y:X But often you want to know about FN's Truth-conditional metrics False1 True0 1 0 p( correct | gold=0 ) Spam Non-Spam 0 p(versus correct FP| gold=1's, thus ) all these metrics. If you take the same classifier and Pos Neg Sensitivity a.k.a. Recall Truth-conditional metricsSpecificity 1 0 p( correct | gold=1 ) 1 0 p( correct | gold=0 ) apply it to a new dataset with different 1 FN a.k.a. TruePosRate If you take the same a.k.a.classifier TrueNegRate and Sensitivity a.k.a. Recall Specificity ground truth population proportions, Actual = TP : FN apply it to a new dataset= TN with : FP different 1 FN a.k.a. TruePosRate a.k.a. TrueNegRate these metrics will stay the same. 5000 100 = 1 - FalseNegRate groundTruth-conditional0 FP truth population metrics= 1 proportions,- FalsePosRate Spam1 0 p( correct | gold=1 ) 1 0 p( =correct TN : FP | gold=0 ) Makes sense for, say, medical tests. = TP : FN theseIf you metrics take the will samestay= 1the -classifier ( same.FP:TN) and 0 FP =Specificity 1 =- (1FN - FalsePos:TP) Rate Sensitivity= 1 - a.k.a.FalseNeg RecallRate Makesapply senseit to a for,new say, dataset medical with tests. different 1 = 1 - (FN:TP) a.k.a.= 1 -TrueNegRate (FP:TN) Actual FN a.k.a. TruePosRate = 5000 / (5100) ground truth population proportions, Prediction-conditional metrics =7 TP : FN 400 = TN : FP These are properties jointly of the classifier 1 p( correct | pred=1 ) Prediction-conditionalthese metrics0 will p(stay metrics correct the same.| pred=0 ) Non-Spam = 1 - FalseNegRate 0 FP = 1 - FalsePosRate and dataset. For example, Precision TheseMakes are sense properties for, Negsay, jointly Predictivemedical of the classifier tests. Value 1 = 1 - (FNp(:TP) correct | pred=1 ) Precision and 0Recall= 1p( are- correct (FP metrics:TN) | pred=0 for ) p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value and1 dataset.FN For example,= TN : FN Precision • Neg Predictive Value = Precision binary classification.= TP : FP p(person has disease | test says "positive") 1 a.k.a. Pos Predictive Value 1 FN = TN : FN Prediction-conditional metrics = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = 1 - FalseDiscoveryRate = Precision0 = TP : FP These are properties jointly of the classifier 1 p( correct | pred=1 ) • You can make0 one=p( 1 arbitrarily -correct (FP:TP) | pred=0 high ) = Sens*PriorPos + (1-Spec)*PriorNeg 0 FP = 1 - FalseDiscoveryRate 0 and dataset. For example, Precision by changing your Negdecision Predictive Value Property of classifier = 1 - (FP:TP) p(person has disease | test says "positive") Property of dataset 1 a.k.a. Pos Predictive Value 1 FN = TN : FN Confidencethreshold. Tradeoffs Property= Precision of classifier Property of dataset = TP : FP Confidence Tradeoffs ClassifierF-score: says harmonic YES if confidence mean inof '1' P is and greater R. than a threshold= Sens*PriorPos t. Varying the + (1-Spec)*PriorNegthreshold trades off between FP's versus FN's. For any one metric, you 0 FP = 1 - FalseDiscoveryRate • can make0 it arbitrarily good by being either very liberal or very conservative. Therefore you are often interested in a pair of metrics: one that cares Classifier says =YES 1 - if( FPconfidence:TP) in '1' is greater than a threshold t. Varying the threshold trades off between FP's versus FN's. For any one metric, you can make it arbitrarily good by being either very liberalabout or FP very's, conservative.and one that cares Therefore about you FN 's.are Here often are interested two pairings in a pair that of are metrics: popular. one By that changing cares the confidence threshold, you get a curve of them. about FP's, and one that cares about FN's. Here are two pairings that are popular. By changing the confidenceProperty threshold, of classifier you get Propertya curve of of them. dataset = 5000 / 5007 11 http://brenocon.com/confusion_matrix_diagrams.pdfSens-Spec Tradeoff: ROC curve Sens-Prec Tradeoff: Precision-Recall curve ConfidenceThursday, September Sens-Spec25,Tradeoffs 14 Tradeoff: ROC curve (sens, spec) = (TP:FN,Sens-Prec TN:FP) Tradeoff: Precision-Recall curve (prec, rec) = (TP:FP, TP:FN) Classifier says(sens, YES spec) if confidence = (TP:FN, TN:in '1'FP is) greater than a threshold t. Varying the(prec, threshold rec) = (TP:tradesFP, offTP: betweenFN) FP's versus FN's. For any one metric, you can make it arbitrarily good by being either very liberal or very conservative.pred Therefore you are often interested in a pair of metrics:pred one that cares Prec 1 0 1 0 TPR (Sens) Ignores true negatives: about FP's, and one thatpred cares about FN's. Here are two pairings that are popular. Bypred changing the confidence threshold, youPrec get a curve of them. 1 0 1 0 makes sense for needle- TPR (Sens) Ignores true negatives: 1 Ignores FPR (1-Spec) 1 Rec (Sens) Sens-Spec Tradeoff: ROC curve gold FN Sens-Prec Tradeoff: Precision-Recallmakes sense for needle- curve gold FN in-haystack problems, truth marginals Rec (Sens) (sens, spec) = (TP:1 FN, FNTN:FP)Ignores FPR (1-Spec) (prec,1 rec) = (TP:FNFP, TP:in-haystackFN) problems, where TN's dominate gold gold(row proportions) truth marginals 0 FP where TN's dominate 0 FP and are uninteresting. 0 FP (row proportions) 0 FP and are uninteresting.

pred pred Prec 1 0 1 0 TPR (Sens) There are several summary statistics to makeIgnores a FP-FN true metric negatives: pairing invariant to the choice of threshold. makes sense for needle- There are several summary statistics to make a FPAUC-FN metrica.k.a. ROCAUC:pairing invariant area underto the thechoice ROC of curvethreshold.. PRAUC: Recarea (Sens)under the PR curve. Similar to mean average precision. 1 FN Ignores FPR (1-Spec) 1 FN in-haystack problems, gold Has a semantics: probgold the classifier correctly orders a randomly No semantics that I know of. AUC a.k.a. ROCAUC:truth marginals area under the ROC curve. PRAUC: area under the wherePR curve TN's. Similar dominate to mean average precision. Has a semantics:(row prob proportions) the classifier correctly orderschosen a randomly pos and negNo example semantics pair that(like IWilcoxon-Mann-Whitney know of. PR Breakeven: find threshold where prec=rec, 0 FP or Kendall's Tau). And neat0 analyticFP tricks withand convex are uninteresting. hulls. chosen pos and neg example pair (like Wilcoxon-Mann-Whitney PR Breakeven: find threshold where prec=rec, and report that prec (or rec). or Kendall's Tau). And neat analytic tricks with convex hulls. and report that prec (or rec). There are also summary stats that require a specific threshold -- or you can use if you don't have There are several summary statistics to make a FP-FN metric pairing invariant to the choice of threshold. There are also summary stats that require a specificconfidences threshold ---- orbut you have can been use ifdesigned you don't to have attempt to care about both FP and FN costs. confidences -- but have been designed to attempt to care about both FP and FN costs. AUC a.k.a. ROCAUC: area under the ROC curve. Balanced accuracyPRAUC: arithmetic: area average under theof sens PR curveand spec.. Similar to meanF-score average: harmonic precision average. of prec and rec. Has a semantics:Balanced prob accuracy the classifier: arithmetic correctly average orders of sens a and randomlyThis spec. is accuracy ifNo youF-score semantics chose: amongharmonic that pos I averageknow and negof. of classesprec and with rec. Usually F1 is used: just the unweighted harmonic mean. chosen posThis and is accuracyneg example if you pairchose (like among Wilcoxon-Mann-Whitney pos and neg classesequal prob with each --PR i.e.Usually Breakeven a balanced F1 is used::population. find justthreshold the unweighted where prec=rec, harmonic mean.Corresponds to curvy isoclines in PR space. or Kendall'sequal Tau). prob And each neat -- i.e. analytic a balanced tricks population. with convex hulls."Balanced" is a.k.a.and macroaveragedCorresponds report that toprec curvy. (or isoclines rec). in PR space. "Balanced" is a.k.a. macroaveraged. Corresponds to linear isoclines in ROC space. [In contrast, AUC is a balanced rank-ordering accuracy.] There areCorresponds also summary to linear stats isoclines that require in ROC a specific space. threshold -- or you can use if you don't have [In contrast, AUC is a balanced rank-ordering accuracy.] confidences -- but have been designed to attempt to care about both FP and FN costs. Classical Frequentist Hypothesis Testing The point is to give non-Bayesian probabilistic semantics to a single test, so these BalancedClassical accuracy Frequentist: arithmetic Hypothesis average Testing of sens andTask spec. is to Thedetect point null isF-score hypothesis to give non-Bayesian: harmonic violations. average probabilistic of prec semantics andrelationships rec. to a single are atest, little so more these subtle. For example, size is a worst-case bound on This is accuracyTask is to detectif you nullchose hypothesis among violations.pos and neg classesDefine with "rejectrelationships null" asUsually aare positive a littleF1 prediction. ismore used: subtle. just theFor unweightedexample, specificity,size harmonic is a worst-case and mean.a p-value bound is onfor one example: prob the null could generate a test statistic equal probDefine each "reject -- i.e. null" a balancedas a positive population. prediction. - Type I errorspecificity, = false Correspondsand positive a p-value. Size isto for (curvyα )one = 1-Specificity example:isoclines probin PR theat space. leastnull could as extreme generate (= sizea test of statistic the strictest test that would reject the null on that example). "Balanced" - Type is a.k.a.I error macroaveraged= false positive. Size. (α) = 1-Specificity - Type II errorat least = falseas extreme negative (= size. Power of the (β strictest) = Sensitivity test that wouldWhen rejectapplied the to null a single on that test, example). PPV and NPV are not frequentist-friendly concepts since Corresponds - Type to II errorlinear = isoclinesfalse negative in ROC. Power space. (β) = Sensitivityand p-valueWhen = 1-Specificity, applied to a singlekind of. test, PPV and NPV are notthey frequentist-friendly depend on priors. concepts But in multiplesince testing, they can be frequentist again; e.g. and p-value = 1-Specificity, kind of. "Significance" refers to either p-value or size (i.e. FPR) [In contrast, AUC is a balanced rank-ordering accuracy.] they depend on priors. But in multiple testing, they Benjamini-Hochbergcan be frequentist again; bounds e.g. expected PPV by assuming PriorNeg=.999. "Significance" refers to either p-value or size (i.e. FPR) Benjamini-Hochberg bounds expected PPV by assuming PriorNeg=.999. References Classical ReferencesFrequentist Hypothesis Testing DiagramsThe pointbased is on to William give non-Bayesian Press's slides: probabilistic http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf semantics to a single test, so these Task is to Diagramsdetect null based hypothesis on William violations. Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdfListingsrelationships of many metrics are a on: little http://en.wikipedia.org/wiki/Receiver_operating_characteristic more subtle. For example, size is a worst-case bound on Define "rejectListings null" of as many a positive metrics prediction.on: http://en.wikipedia.org/wiki/Receiver_operating_characteristicFawcettspecificity, (2006), "Anand Introduction a p-value isto forROC one analysis": example: http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf prob the null could generate a test statistic Fawcett (2006), "An Introduction to ROC analysis":Manning http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html - Type I error = false positive. Size (α) = 1-Specificity at least as extreme (= size of the strictest test that would reject the null on that example). - Type II errorManning = false et al negative(2008) "Intro. Power to IR", e.g.(β) =http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html SensitivityHypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Hypothesis testing: Casella and Berger, "Statistical Inference"When (chapplied 8); also to Wasserman,a single test, "All PPV of Statistics" and NPV (ch are 10) not frequentist-friendly concepts since and p-value = 1-Specificity, kind of. they depend on priors. But in multiple testing, they can be frequentist again; e.g. "Significance" refers to either p-value or size (i.e. FPR) Benjamini-Hochberg bounds expected PPV by assuming PriorNeg=.999.

References Diagrams based on William Press's slides: http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf Listings of many metrics on: http://en.wikipedia.org/wiki/Receiver_operating_characteristic Fawcett (2006), "An Introduction to ROC analysis": http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf Manning et al (2008) "Intro to IR", e.g. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html Hypothesis testing: Casella and Berger, "Statistical Inference" (ch 8); also Wasserman, "All of Statistics" (ch 10) Decision threshold

Decide “1” if p(y =1x) >t .... could vary threshold t |

12

Thursday, September 25, 14 Decision threshold

13

Thursday, September 25, 14 Many other metrics • Expected utility: Perhaps there’s -5 points of user happiness for reading a spam message, but -1000 points for false positive spam. • Metrics that are invariant to a decision threshold • Log-likelihood • “ROC AUC”: rank by confidence value. Choose gold-positive and gold-negative examples. Are they ranked correctly? “area under the receiving-operator-characteristic curve” • Many other related things with different names from different disciplines (medical, engineering, statistics...)

http://brenocon.com/confusion_matrix_diagrams.pdf

Thursday, September 25, 14 Human annotations

• We usually think data from humans is “gold standard” • But our tasks are subjective! What does this mean?

• Compare answers to your neighbor. How many did you agree on, out of 10?

15

Thursday, September 25, 14 num num fraction Truth (?) HAPPY SAD HAPPY

@AppleEl @melissaclse Oh ok! GO! - Lol you guys, I wanna join you... damn homework _EMOTICON_ 0 45 0.000 >:(

My phone is so bad... _EMOTICON_ 0 46 0.000 :(

Swim is on pause. It's raining. _EMOTICON_ 2 44 0.043 :(

wat does that text mean.? _EMOTICON_ 4 40 0.091 :)

@Moonchild66 ouch, i empathise, i get that a bit from 7 39 0.152 :( sleeping awkwardly! _EMOTICON_ surreal knowing middle school is officially over 19 26 0.422 :( _EMOTICON_ If dnt nobody else love me i love me _EMOTICON_ 41 5 0.891 :)

Fireflies - Owl City #nowplaying _EMOTICON_ 45 1 0.978 =)

Oh, is anyone watching Dancing with the Stars? That 45 1 0.978 ^_^ old lady is made of win _EMOTICON_ U see the name _EMOTICON_ http://t.co/Bns4wQyF 45 1 0.978 :)

16

Thursday, September 25, 14 @AppleEl @melissaclse Oh ok! GO! - Lol you guys, @AppleEl @melissaclse Oh ok! GO! - Lol you guys, SAD I wanna join you... damn homework _EMOTICON_ I wanna join you... damn homework >:(

surreal knowing middle school is officially over SAD surreal knowing middle school is officially over :[ _EMOTICON_

@Moonchild66 ouch, i empathise, i get that a bit @Moonchild66 ouch, i empathise, i get that a bit SAD from sleeping awkwardly! _EMOTICON_ from sleeping awkwardly! :(

HAPPY If dnt nobody else love me i love me _EMOTICON_ If dnt nobody else love me i love me :)

SAD My phone is so bad... _EMOTICON_ My phone is so bad... :(

Oh, is anyone watching Dancing with the Stars? Oh, is anyone watching Dancing with the Stars? HAPPY That old lady is made of win _EMOTICON_ That old lady is made of win ^_^

SAD Swim is on pause. It's raining. _EMOTICON_ Swim is on pause. It's raining. :-(

U see the name _EMOTICON_ http://t.co/ HAPPY U see the name :) http://t.co/Bns4wQyF Bns4wQyF

HAPPY Fireflies - Owl City #nowplaying _EMOTICON_ Fireflies - Owl City #nowplaying =)

HAPPY wat does that text mean.? _EMOTICON_ wat does that text mean.? :)

17

Thursday, September 25, 14 Human agreement

• Should we expect machines to do subjective tasks? • Is the task “real”, or is it a fake made-up thing? What does “real” mean anyways? Is sentiment a “real” thing? • Inter-annotator agreement rate (IAA) is the standard way to measure “realness” and quality of human annotations. • Have two annotators annotate the same item. • Fraction of the time they agree. • Alternate view: accuracy rate that one has when trying to model the other. • Cohen’s kappa: a variation on IAA that controls for base rates (compare against null case of everyone answering the most common answer). • Human factors affect agreement rates!! • Are the annotators trained similarly? • Are the guidelines clear? • Is your labeling theory of sentiment/semantics/etc “real”? • Common wisdom: IAA is upper bound on machine accuracy. Really? Discuss.

18

Thursday, September 25, 14