Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011, pp 37 63 Available online at http://www.bioinfo.in/contents.php?id=51
EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION
POWERS, D.M.W. *AILab, School of Computer Science, Engineering and Mathematics, Flinders University, South Australia, Australia Corresponding author. E mail: [email protected]
Received: February 18, 2011; Accepted: February 27, 2011
Abstract Commonly used evaluation measures including Recall, Precision, F Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi class case. Keywords –Recall and Precision, F Measure, Rand Accuracy, Kappa,Informedness and Markedness, DeltaP, Correlation, Significance.
INTRODUCTION A common but poorly motivated way of evaluating decision system or a scientific experiment, results of Machine Learning experiments is using analyzing and introducing new probabilistic and Recall, Precision and F measure. These measures information theoretic measures that overcome the are named for their origin in Information Retrieval problems with Recall, Precision and their and present specific biases, namely that they ignore derivatives. performance in correctly handling negative examples, they propagate the underlying marginal THE BINARY CASE prevalences and biases, and they fail to take It is common to introduce the various measures in account the chance level performance. In the the context of a dichotomous binary classification Medical Sciences, Receiver Operating problem, where the labels are by convention + and Characteristics (ROC) analysis has been borrowed and the predictions of a classifier are summarized in from Signal Processing to become a standard for a four cell contingency table. This contingency table evaluation and standard setting, comparing True may be expressed using raw counts of the number Positive Rate and False Positive Rate. In the of times each predicted label is associated with Behavioural Sciences, Specificity and Sensitivity, each real class, or may be expressed in relative are commonly used. Alternate techniques, such as terms. Cell and margin labels may be formal Rand Accuracy and Cohen Kappa, have some probability expressions, may derive cell expressions advantages but are nonetheless still biased from margin labels or vice versa, may use measures. We will recapitulate some of the alphabetic constant labels a, b, c, d or A, B, C, D, or literature relating to the problems with these may use acronyms for the generic terms for True measures, as well as considering a number of other and False, Real and Predicted Positives and techniques that have been introduced and argued Negatives. Often UPPER CASE is used where the within each of these fields, aiming/claiming to values are counts, and lower case letters where the address the problems with these simplistic values are probabilities or proportions relative to N measures. or the marginal probabilities – we will adopt this This paper recapitulates and re examines the convention throughout this paper (always written in relationships between these various measures, typewriter font), and in addition will use Mixed Case develops new insights into the problem of (in the normal text font) for popular nomenclature measuring the effectiveness of an empirical that may or may not correspond directly to one of
37 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation our formal systematic names. True and False are many relevant documents, that it doesn't really Positives (TP/FP) refer to the number of Predicted matter which subset we find, that we can't know Positives that were correct/incorrect, and similarly anything about the relevance of documents that for True and False Negatives (TN/FN), and these aren't returned). Recall tends to be neglected or four cells sum to N. On the other hand tp, fp, fn, tn averaged away in Machine Learning and and rp, rn and pp, pn refer to the joint and marginal Computational Linguistics (where the focus is on probabilities, and the four contingency cells and the how confident we can be in the rule or classifier). two pairs of marginal probabilities each sum to 1. However, in a Computational Linguistics/Machine We will attach other popular names to some of Translation context Recall has been shown to have these probabilities in due course. a major weight in predicting the success of Word We thus make the specific assumptions that we are Alignment [1]. In a Medical context Recall is predicting and assessing a single condition that is moreover regarded as primary, as the aim is to either positive or negative (dichotomous), that we identify all Real Positive cases, and it is also one of have one predicting model, and one gold standard the legs on which ROC analysis stands. In this labeling. Unless otherwise noted we will also for context it is referred to as True Positive Rate (tpr). simplicity assume that the contingency is non trivial Recall is defined, with its various common in the sense that both positive and negative states appellations, by equation (1): of both predicted and real conditions occur, so that Recall = Sensitivity = tpr = tp/rp none of the marginal sums or probabilities is zero. = TP / RP = A /(A+C) (1) We illustrate in Table 1 the general form of a binary Conversely, Precision or Confidence (as it is called contingency table using both the traditional in Data Mining) denotes the proportion of Predicted alphabetic notation and the directly interpretable Positive cases that are correctly Real Positives. systematic approach. Both definitions and This is what Machine Learning, Data Mining and derivations in this paper are made relative to these Information Retrieval focus on, but it is totally labellings, although English terms (e.g. from ignored in ROC analysis. It can however Information Retrieval) will also be introduced for analogously be called True Positive Accuracy (tpa), various ratios and probabilities. The green positive being a measure of accuracy of Predicted Positives diagonal represents correct predictions, and the in contrast with the rate of discovery of Real pink negative diagonal incorrect predictions. The Positives (tpr). Precision is defined in (2): predictions of the contingency table may be the Precision = Confidence =tpa=tp/pp predictions of a theory, of some computational rule =TP / PP = A /(A+B) (2) or system (e.g. an Expert System or a Neural These two measures and their combinations focus Network), or may simply be a direct measurement, only on the positive examples and predictions, a calculated metric, or a latent condition, symptom although between them they capture some or marker. We will refer generically to "the model" information about the rates and kinds of errors as the source of the predicted labels, and "the made. However, neither of them captures any population" or "the world" as the source of the real information about how well the model handles conditions. We are interested in understanding to negative cases. Recall relates only to the +R what extent the model "informs" predictions about column and Precision only to the +P row. Neither of the world/population, and the world/population these takes into account the number of True "marks" conditions in the model. Negatives. This also applies to their Arithmetic, Geometric and Harmonic Means: A, Gand F=G 2/A Recall & Precision, Sensitivity & Specificity (the F factor or F measure). Note that the F1 Recall or Sensitivity (as it is called in Psychology) is measure effectively references the True Positives to the proportion of Real Positive cases that are the Arithmetic Mean of Predicted Positives and Real correctly Predicted Positive. This measures the Positives, being a constructed rate normalized to an Coverage of the Real Positive cases by the +P idealized value, and expressed in this form it is (Predicted Positive) rule. Its desirable feature is that known in statistics as a Proportion of Specific it reflects how many of the relevant cases the +P Agreement as it is a applied to a specific class, so rule picks up. It tends not to be very highly valued in applied to the Positive Class, it is PS+. It also Information Retrieval (on the assumptions that there corresponds to the set theoretic Dice Coefficient.
Table 1. Systematic and traditional notations in a binary contingency table. Shading indicates correct (light=green) and incorrect (dark=red) rates or counts in the contingency table.
+R −R +R −R +P tp fp pp +P A B A+B −P fn tn pn −P C D C+D rp rn 1 A+C B+D N 38 Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011 Powers DMW
The Geometric Mean of Recall and Precision (G of the system designer (e.g. as a threshold). measure) effectively normalizes TP to the Similarly, we can note that one of N,FP or FN is free Geometric Mean of Predicted Positives and Real to vary. Whilst it apparently takes into account TN in Positives, and its Information content corresponds the numerator,theJaccard (or Tanimoto) similarity to the Arithmetic Mean of the Information coefficient uses it to heuristically discount the correct represented by Recall and Precision. classification of negatives, but it can be written (6) In fact, there is in principle nothing special about the independently of FN and N in a way similar tothe Positive case, and we can define Inverse statistics effectively equivalent Dice or PS+ or F1 (7), or in in terms of the Inverse problem in which we terms of them, and so is subject to bias as FN or N interchange positive and negative and are is free to vary and theyfail to capture contingencies predicting the opposite case. Inverse Recall or fully without knowing inverse statisticstoo. Specificity is thus the proportion of Real Negative Each of the above also has a complementary form cases that are correctly Predicted Negative (3), and defining an error rate, of which some have specific is also known as the True Negative Rate names and importance: Fallout or False Positive (tnr). Conversely, Inverse Precision is the Rate (fpr) are the proportion of Real Negatives that proportion of Predicted Negative cases that are occur as Predicted Positive (ring ins); Miss Rate or indeed Real Negatives (4), and can also be called False Negative Rate (fnr) are the proportion of Real True Negative Accuracy (tna): Positives that are Predicted Negatives (false drops). Inverse Recall =tnr =tn/rn False Positive Rate is the second of the legs on =TN/RN =D/(B+D) (3) which ROC analysis is based. Inverse Precision =tna =tn/pn Fallout =fpr =fp/rp =TN/PN =D/(C+D) (4) =FP/RP =B/(B+D) (8) The inverse of F1 is not known in AI/ML/CL/IR but is Miss Rate =fnr =fn/rn just as well known as PS+ in statistics,being the =FN/RN =C/(A+C) (9) Proportion of Specific Agreement for the class of Note that FN and FP are sometimes referred to as negatives, PS−. Note that where as F1 is advocated Type I and Type II Errors, and the rates fn and fp as in AI/ML/CL/IR as a single measure to capture the alpha and beta, respectively – referring to falsely effectiveness of a system, it still completely ignores rejecting or accepting a hypothesis. More correctly, TN which can vary freely without affecting the these terms apply specifically to the meta level statistic. In statistics, PS+ is used in conjunction problem discussed later of whether the precise with PS− to ensure the contingencies are pattern of counts (not rates) in the contingency table completely captured, and similarly Specificity fit the null hypothesis of random distribution rather (Inverse Recall) is always recorded along with than reflecting the effect of some alternative Sensitivity (Recall). hypothesis (which is not in general the one Rand Accuracy explicitly takes into account the represented by +P→+R or P→ Ror both). Note classification of negatives, and is expressible (5) that all the measures discussed individually leave at both as a weighted average of Precision and least two degree of freedom (plus N) unspecified Inverse Precision and as a weighted average of and free to control, and this leaves the door open Recall and Inverse Recall: for bias, whilst N is needed too for estimating Accuracy =tca=tcr=tp+tn significance and power. =rp ⋅tpr+rn ⋅tnr =(TP+TN)/N =pp ⋅tpa+pn ⋅tna =(A+D)/N (5) Prevalence, Bias, Cost & Skew Dice = F1 =tp/(tp+(fn+fp)/2) We now turn our attention to the various forms of =A/(A+(B+C)/2) (6) bias that detract from the utility of all of the above =1/(1+mean(FN,FP)/TP) surface measures [2]. We will first note that rp Jaccard =tp/(tp+fn+fp)=TP/(N TN) represents the Prevalence of positive cases, RP/N, =A/(A+B+C) = A/(N D) (7) and is assumed to be a property of the population of =1/(1+2mean(FN,FP)/TP) interest – it may be constant, or it may vary across = F1 / (2 – F1) subpopulations, but is regarded here as not being As shown in (5) Rand Accuracy is effectively a under the control of the experimenter, and so we prevalence weighted average of Recall and Inverse want a prevalence independent measure. By Recall, as well as a bias-weighted average of contrast, pp represents the (label) Bias of the model Precision and Inverse Precision. Whilst it does take [3], the tendency of the model to output positive into account TN in the numerator, the sensitivity to labels, PP/N, and is directly under the control of the bias and prevalence is an issue since these are experimenter, who can change the model by independent variables, with prevalence varying as changing the theory or algorithm, or some we apply to data sampled under different parameter or threshold, to better fit the conditions, and bias being directly under the control world/population being modeled. As discussed
39 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation
ratios and measures derivable from the normalized tpr best contingency table, but N is also required to good determine significance. As another case of specific sse 1 interest, Precision, Inverse Precision and Bias, in combination, suffice to determine all ratios or measures, although we will show later that an sse 0 alternate characterization of Prevalence and Bias in B terms of Evenness allows for even simpler relationships to be exposed.
We can also take into account a differential value for positives (cp) and negatives (cn) – this can be pervers applied to errors as a cost (loss or debit) and/or to
correct cases as a gain (profit or credit), and can be combined into a single Cost Ratio c v= cn/cp. Note that the value and skew determined costs have similar effects, and may be multiplied to produce a single skew like cost factor worst c = c vcs. Formulations of measures that are fpr expressed using tpr, fpr and c s may be made cost Fig. 1- Illustration of ROC Analysis. The main sensitive by using c = c vcs in place of c = c s, or can diagonal represents chance with parallel isocost lines be made skew/cost insensitive by using c = 1[5]. representing equal cost performance. Points above ROC and PN Analyses the diagonal represent performance better than Flach [5] highlighted the utility of ROC analysis to chance, those below worse than chance. For a single the Machine Learning community, and good (dotted=green) system, AUC is area under curve characterized the skew sensitivity of many (trapezoid between green line and x=[0,1]). measures in that context, utilizing the ROC format The perverse (dashed=red) system shown is the same to give geometric insights into the nature of the (good) system with class labels reversed. measures and their sensitivity to skew. [6] further elaborated this analysis, extending it to the earlier, F factor (or Dice or Jaccard) effectively unnormalized PN variant of ROC, and targeting references tp (probability or proportion of True their analysis specifically to rule learning. We will Positives) to the Arithmetic Mean of Bias and not examine the advantages of ROC analysis here, Prevalence (6 7). A common rule of thumb, or even but will briefly explain the principles and recapitulate a characteristic of some algorithms, is to some of the results. parameterize a model so that Prevalence = Bias, ROC analysis plots the rate tpr against the rate fpr, viz. rp = pp. Corollaries of this setting are Recall = whilst PN plots the unnormalized TP against FP. Precision (= Dice but not Jaccard), Inverse Recall = This difference in normalization only changes the Inverse Precision and Fallout = Miss Rate. scales and gradients, and we will deal only with the Alternate characterizations of Prevalence are in normalized form of ROC analysis. A perfect terms of Odds[4] or Skew [5], being the Class Ratio classifier will score in the top left hand corner cs= rn/rp, recalling that by definition rp+rn = 1 and (fpr=0,tpr=100%). A worst case classifier will score RN+RP = N. If the distribution is highly skewed, in the bottom right hand corner (fpr=100%,tpr=0). A typically there are many more negative cases than random classifier would be expected to score positive, this means the number of errors due to somewhere along the positive diagonal (tpr=fpr) poor Inverse Recall will be much greater than the since the model will throw up positive and negative number of errors due to poor Recall. Given the cost examples at the same rate (relative to their of both False Positives and False Negatives is populations – these are Recall like scales: tpr = equal, individually, the overall component of the Recall, 1 fpr = Inverse Recall). For the negative total cost due to False Positives (as Negatives) will diagonal (tpr+c ⋅fpr=1) corresponds to matching Bias be much greater at any significant level of chance to Prevalence for a skew of c. performance, due to the higher Prevalence of Real The ROC plot allows us to compare classifiers Negatives. (models and/or parameterizations) and choose the Note that the normalized binary contingency table one that is closest to (0,1) and furtherest from with unspecified margins has three degrees of tpr=fpr in some sense. These conditions for freedom – setting any three non−Redundant ratios choosing the optimal parameterization or model are determines the rest (setting any count supplies the not identical, and in fact the most common condition remaining information to recover the original table of is to minimize the area under the curve (AUC), counts with its four degrees of freedom). In which for a single parameterization of a model is particular, Recall, Inverse Recall and Prevalence, or defined by a single point and the segments equivalently tpr, fpr and c s, suffice to determine all connecting it to (0,0) and (1,1). For a
40 Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011 Powers DMW
parameterized model it will be a monotonic function or tpr fpr =2AUC−1, as c is constant. Thus WRAcc consisting of a sequence of segments from (0,0) to is an unbiased accuracy measure, and the skew (1,1). A particular cost model and/or accuracy insensitive form of WRAcc, with c=1, is precisely measure defines an isocost gradient, which for a tpr fpr. Each of the other measures (10−12) shows skew and cost insensitive model will be c=1, and a bias in that it can not be maximized independent hence another common approach is to choose a of skew, although skew insensitive versions can be tangent point on the highest isocost line that defined by setting c=1. The recasting of Accuracy, touches the curve. The simple condition of Precision and F Measure in terms of Recall makes choosing the point on the curve nearest the clear how all of these vary only in terms of the way optimum point (0,1) is not commonly used, but this they are affected by Prevalence and Bias. distance to (0,1) is given by √[( fpr) 2+ (1 tpr) 2], and Prevalence is regarded as a constant of the target minimizing this amounts to minimizing the sum of condition or data set (and c=[1−Prev]/Prev), whilst squared normalized error, fpr 2+fnr 2. parameterizing or selecting a model can be viewed A ROC curve with concavities can also be locally in terms of trading off tpr and fpr as in ROC interpolated to produce a smoothed model following analysis, or equivalently as controlling the relative the convex hull of the original ROC curve. It is even number of positive and negative predictions, namely possible to locally invert across the convex hull to the Bias, in order to maximize a particular accuracy repair concavities, but this may overfit and thus not measure (Recall, Precision, F Measure, Rand generalize to unseen data. Such repairs can lead to Accuracy and AUC). Note that for a given Recall selecting an improved model, and the ROC curve level, the other measures (10−13) all decrease with can also be used to return a model to changing increasing Bias towards positive predictions. Prevalence and costs. The area under such a DeltaP, Informedness and Markedness multipoint curve is thus of some value, but the Powers [4] also derived an unbiased accuracy optimum in practice is the area under the simple measure to avoid the bias of Recall, Precision and trapezoid defined by the model: Accuracy due to population Prevalence and label AUC =(tpr fpr+1)/2 bias. The Bookmaker algorithm costs wins and =(tpr+tnr)/2 losses in the same way a fair bookmaker would set =1 – (fpr+fnr)/2 (10) prices based on the odds. Powers then defines the For the cost and skew insensitive case, with c=1, concept of Informedness which represents the maximizing AUC is thus equivalent to maximizing 'edge' a punter has in making his bet, as evidenced tpr fpr or minimizing a sum of (absolute) normalized and quantified by his winnings. Fair pricing based error fpr+fnr. The chance line corresponds to tpr on correct odds should be zero sum – that is, fpr=0, and parallel isocost lines for c=1 have the guessing will leave you with nothing in the long run, form tpr fpr=k. The highest isocost line also whilst a punter with certain knowledge will win every maximizes tpr fpr and AUC so that these two time. Informedness is the probability that a punter approaches are equivalent. Minimizing a sum of is making an informed bet and is explained in terms squared normalized error, fpr 2+fnr 2, corresponds to of the proportion of the time the edge works out a Euclidean distance minimization heuristic that is versus ends up being pure guesswork. Powers equivalent only under appropriate constraints, e.g. defined Bookmaker Informedness for the general, fpr=fnr, or equivalently, Bias=Prevalence, noting K label, case, but we will defer discussion of the that all cells are non negative by construction. general case for now and present a simplified We now summarize relationships between the formulation of Informedness, as well as the various candidate accuracy measures as rewritten complementary concept of Markedness. [5,6] in terms of tpr, fpr and the skew, c, as well in Definition 1 terms of Recall, Bias and Prevalence: Informedness quantifies how informed a predictor Accuracy = [tpr+c(1 fpr)]/[1+c] is for the specified condition, and specifies the = 2RecallPrev+1 Bias−Prev (11) probability that a prediction is informed in relation Precision = tpr/[tpr+cfpr] to the condition (versus chance). = RecallPrev/Bias (12) Definition 2 F Measure F1 = 2tpr/[tpr+cfpr+1] Markedness quantifies how marked a condition is = 2RecallPrev/[Bias+Prev] (13) for the specified predictor, and specifies the WRacc = 4c[tpr fpr]/[1+c] 2 probability that a condition is marked by the =4[Recall Bias]Prev (14) predictor (versus chance). The last measure, Weighted Relative Accuracy, was These definitions are aligned with the psychological defined [7] to subtract off the component of the True and linguistic uses of the terms condition and Positive score that is attributable to chance and marker. The condition represents the experimental rescale to the range ±1. Note that outcome we are trying to determine by indirect maximizingWRacc is equivalent to maximizing AUC means. A marker or predictor (cf. biomarker or
41 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation neuromarker) represents the indicator we are using given +P (1) versus P (0), we obtain this gradient of to determine the outcome. There is no implication best fit (minimizing the error in the real values R): of causality – that is something we will address rP = [AD–BC] / [(A+B)(C+D)] later. However there are two possible directions of = A/(A+B) – C/(C+D) implication we will address now. Detection of the = DeltaP = Markedness (17) predictor may reliably predict the outcome, with or Conversely, we can find the regression coefficient without the occurrence of a specific outcome for predicting P from R (minimizing the error in the condition reliably triggering the predictor. predictions P): For the binary case we have rR = [AD–BC] / [(A+C)(B+D)] Informedness = Recall + Inverse Recall – 1 = A/(A+C) – B/(B+D) = tpr fpr = 1 fnr fpr (15) = DeltaP' = Informedness (18) Markedness = Precision + Inverse Precision – 1 Finally we see that the Matthews correlation, a = tpa fna = 1 fpa fna contingency matrix method of calculating the We noted above that maximizing AUC or the Pearson product moment correlation coefficient, ρ, unbiased WRAcc measure effectively maximized is defined by tpr fpr and indeed WRAcc reduced to this in the rG =[AD–BC]/√[(A+C)(B+D)(A+B)(C+D)] skew independent case. This is not surprising given =Correlation both Powers [4] and Flach [5 7] set out to produce =±√[InformednessMarkedness] (19) an unbiased measure, and the linear definition of Given the regressions find the same line of best fit, Informedness will define a unique linear form. Note these gradients should be reciprocal, defining a that while Informedness is a deep measure of how perfect Correlation of 1. However, both consistently the Predictor predicts the Outcome by Informedness and Markedness are probabilities with combining surface measures about what proportion an upper bound of 1, so perfect correlation requires of Outcomes are correctly predicted, Markedness is perfect regression. The squared correlation is a a deep measure of how consistently the Outcome coefficient of proportionality indicating the has the Predictor as a Marker by combining surface proportion of the variance in R that is explained by measures about what proportion of Predictions are P, and is traditionally also interpreted as a correct. probability. We can now interpret it either as the In the Psychology literature, Markedness is known joint probability that P informs R and R marks P, as DeltaP and is empirically a good predictor of given that the two directions of predictability are human associative judgements – that is it seems we independent, or as the probability that the variance develop associative relationships between a is (causally) explained reciprocally. The sign of the predictor and an outcome when DeltaP is high, and Correlation will be the same as the sign of this is true even when multiple predictors are in Informedness and Markedness and indicates competition [8]. In the context of experiments on whether a correct or perverse usage of the information use in syllable processing, [9] notes that information has been made – take note in Schanks [8] sees DeltaP as "the normative measure interpreting the final part of (19). of contingency", but propose a complementary, Psychologists traditionally explain DeltaP in terms of backward, additional measure of strength of causal prediction, but it is important to note that the association, DeltaP' aka dichotomous direction of stronger prediction is not necessarily the Informedness. Perruchet and Peeremant [9] also direction of causality, and the fallacy of abductive note the analog of DeltaP to regression coefficient, reasoning is that the truth of A → B does not in and that the Geometric Mean of the two measures general have any bearing on the truth of B → A. is a dichotomous form of the Pearson correlation If Pi is one of several independent possible causes coefficient, the Matthews' Correlation Coefficient, of R, Pi→R is strong, but R →Pi is in general weak which is appropriate unless a continuous scale is for any specific Pi. If Pi is one of several necessary being measured dichotomously in which case a contributing factors to R, Pi→R is weak for any Tetrachoric Correlation estimate would be single Pi, but R →Pi is strong. The directions of the appropriate [10,11]. implication are thus not in general dependent. Causality, Correlation and Regression In a linear regression of two variables, we seek to In terms of the regression to fit R from P, since predict one variable, y, as a linear combination of there are only two correct points and two error the other, x, finding a line of best fit in the sense of points, and errors are calculated in the vertical (R) minimizing the sum of squared error (in y). The direction only, all errors contribute equally to tilting equation of fit has the form the regression down from the ideal line of fit. This y = y 0 + r xx where Markedness regression thus provides information 2 rx= [n∑xy ∑x∑y]/[n∑x ∑x∑x] (16) about the consistency of the Outcome in terms of Substituting in counts from the contingency table, having the Predictor as a Marker – the errors for the regression of predicting +R (1) versus R (0) measured from the Outcome R relate to the failure of the Marker P to be present.
42 Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011 Powers DMW
We can gain further insight into the nature of these probabilities. Equations (20 22) illustrate this, regression and correlation coefficients by reducing showing that these coefficients depend only on dtp the top and bottom of each expression to and either Prevalence, Bias or their probabilities (dividing by N 2, noting that the original combination. Note that for a particular dtp these contingency counts sum to N, and the joint coefficients are minimized when the Prevalence probabilities after reduction sum to 1). The and/or Bias are at the evenly biased 0.5 level, numerator is the determinant of the contingency however in a learning or parameterization context matrix, and common across all three coefficients, changing the Prevalence or Bias will in general reducing to dtp, whilst the reduced denominator of change both tp and etp, and hence can change dtp. the regression coefficients depends only on the It is also worth considering further the relationship of Prevalence or Bias of the base variates. The the denominators to the Geometric Means, PrevG regression coefficients, Bookmaker Informedness of Prevalence and Inverse Prevalence (IPrev = (B) and Markedness (M), may thus be re expressed 1−Prev is Prevalence of Real Negatives) and BiasG in terms of Precision (Prec) or Recall, along with of Bias and Inverse Bias (IBias = 1−Bias is bias to Bias and Prevalence (Prev) or their inverses (I ): Predicted Negatives). These Geometric Means represent the Evenness of Real classes (Evenness R M = dtp/ [Bias (1 Bias)] = PrevG 2) and Predicted labels (Evenness P = = dtp/ [pppn] = dtp / pg 2 BiasG 2). We also introduce the concept of Global
= dtp / BiasG 2 = dtp / Evenness P Evenness as the Geometric Mean of these two = [Precision – Prevalence] / IBias (20) natural kinds of Evenness, Evenness G. From this B = dtp/ [Prevalence (1−Prevalence)] formulation we can see that for a given relative delta = dtp/ [rprn] = dtp / rg 2 of true positive prediction above expectation (dtp), = dtp / PrevG 2= dtp / Evenness R the correlation is at minimum when predictions and = [Recall – Bias] / IPrev outcomes are both evenly distributed (√Evenness G = = Recall – Fallout √Evenness R = √Evenness P = Prev = Bias = 0.5), and = Recall + IRecall – 1 Markedness and Bookmaker are individually = Sensitivity + Specificity – 1 minimal when Bias resp. Prevalence are evenly = (LR–1) (1–Specificity) distributed (viz. Bias resp. Prev = 0.5). This = (1–NLR) Specificity suggests that setting Learner Bias (and regularized, = (LR –1) (1–NLR) / (LR–NLR) (21) cost weighted or subsampled Prevalence) to 0.5, as In the medical and behavioural sciences, the sometimes performed in Artificial Neural Network Likelihood Ratio is LR=Sensitivity/[1–Specificity], training is in fact inappropriate on theoretical and the Negative Likelihood Ratio is grounds, as has Previously been shown both NLR=Specificity/[1–Sensitivity]. For non negative B, empirically and based on Bayesian principles – LR>1>NLR, with 1 as the chance case. We also rather it is best to use Learner/Label Bias = Natural express Informedness in these terms in (21). Prevalence which is in general much less than 0.5 The Matthews/Pearson correlation is expressed in [12]. reduced form as the Geometric Mean of Bookmaker Note that in the above equations (20 22) the Informedness and Markedness, abbreviating their denominator is always strictly positive since we product as BookMark (BM) and recalling that it is have occurrences and predictions of both Positives BookMark that acts as a probability like coefficient and Negatives by earlier assumption, but we note of determination, not its root, the Geometric Mean that if in violation of this constraint we have a (BookMarkG or BMG): degenerate case in which there is nothing to predict BMG = dtp/ √[Prev (1−Prev) Bias (1 Bias)] or we make no effective prediction, then tp=etp and = dtp / [PrevG BiasG] dtp=0, and all the above regression and correlation coefficients are defined in the limit approaching = dtp / Evenness G =√[(Recall−Bias)(Prec−Prev)]/(IPrevIBias) zero. Thus the coefficients are zero if and only if (22) dtp is zero, and they have the same sign as dtp These equations clearly indicate how the otherwise. Assuming that we are using the model Bookmaker coefficients of regression and the right way round, then dtp, B and M are non correlation depend only on the proportion of True negative, and BMG is similarly non negative as Positives and the Prevalence and Bias applicable to expected. If the model is the wrong way round, then the respective variables. Furthermore, Prev Bias dtp, B, M and BMG can indicate this by expressing represents the Expected proportion of True below chance performance, negative regressions Positives (etp) relative to N, showing that the and negative correlation, and we can reverse the coefficients each represent the proportion of Delta sense of P to correct this. True Positives (deviation from expectation, dtp=tp The absolute value of the determinant of the etp) renormalized in different ways to give different contingency matrix, dp = dtp, in these probability
43 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation
formulae (20 22), also represents the sum of the Geometric (L 0) Mean is the Geometric Mean of absolute deviations from the expectation the Harmonic (L −1 ) and Arithmetic (L +1 ) Means, with represented by any individual cell and hence positive values of p being biased higher (toward 2dp=2DP/N is the total absolute relative error L+∞=Max) and negative values of p being biased versus the null hypothesis. Additionally it has a lower (toward L−∞=Min). geometric interpretation as the area of a trapezoid Effect of Bias &Prev on Recall & Precision in PN space, the unnormalized variant of ROC [6]. The final form of the equations (26 27) cancels out We already observed that in (normalized) ROC the common Bias and Prevalence (Prev) terms, that analysis, Informedness is twice the triangular area denormalizedtp to tpr (Recall) or tpa (Precision). We between a positively informed system and the now recast the Bookmaker Informedness and chance line, and it thus corresponds to the area of Markedness equations to show Recall and the trapezoid defined by a system (assumed to Precision as subject (28 29), in order to explore the perform no worse than chance), and any of its affect of Bias and Prevalence on Recall and perversions (interchanging prediction labels but not Precision, as well as clarify the relationship of the real classes, or vice versa, so as to derive a Bookmaker and Markedness to these other system that performs no better than chance), and ubiquitous but iniquitous measures. the endpoints of the chance line (the trivial cases in Recall = Bookmaker (1−Prevalence) + Bias which the system labels all cases true or conversely Bookmaker = (Recall Bias)/(1−Prevalence) (28) all are labelled false). Such a kite shaped area is Precision = Markedness (1 Bias) + delimited by the dotted (system) and dashed Prevalence (perversion) lines in Fig. 1 (interchanging class Markedness = (Precision−Prevalence)/(1 Bias) labels), but the alternate parallelogram (29) (interchanging prediction labels) is not shown. The Bookmaker and Markedness are unbiased Informedness of a perverted system is the negation estimators of above chance performance (relative to of the Informedness of the correctly polarized respectively the predicting conditions or the system. predicted markers). Equations (28 29) clearly show We now also express the Informedness and the nature of the bias introduced by both Label Bias Markedness forms of DeltaP in terms of deviations and Class Prevalence. If operating at chance level, from expected values along with the Harmonic both Bookmaker and Markedness will be zero, and mean of the marginal cardinalities of the Real Recall, Precision, and derivatives such as the F classes or Predicted labels respectively, defining measure, will be skewed by the biases. Note that DP, DELTAP, RH, PH and related forms in terms of increasing Bias or decreasing Prevalence increases their N−Relative probabilistic forms defined as Recall and decreases Precision, for a constant level follows: of unbiased performance. We can more specifically etp = rp pp; etn = rn pn (23) see that the regression coefficient for the prediction dp = tp – etp = dtp of Recall from Prevalence is −Informedness, and = dtn = (tn – etn) from Bias is +1, and similarly the regression deltap = dtp – dtn = (24) coefficient for the prediction of Precision from Bias rh = 2rp rn / [rp+rn] = rp 2/ra 2 is −Markedness, and from Prevalence is +1. Using ph = 2pp pn / [pp+pn]= pp 2/pa 2 (25) the heuristic of setting Bias = Prevalence then sets DeltaP' or Bookmaker Informedness may now be Recall = Precision = F1 and Bookmaker expressed in terms of deltap and rh, and DeltaP or Informedness = Markedness = Correlation. Setting Markedness analogously in terms of deltap and ph: Bias = 1 (Prevalence < 1) may be seen to make B = DeltaP' = [etp+dtp]/rp–[efp dtp]/rn Precision track Prevalence with Recall = 1, whilst = etp/rp – efp/rn + 2dtp/rh Prevalence = 1 (Bias < 1) means Recall = Bias with = 2dp/rh = deltap/rh (26) Informedness = 1, and under either condition there M = DeltaP = 2dp/ph = deltap/ph (27) is no information utilized (Bookmaker Informedness These harmonic relationships connect directly with = Markedness = 0). the previous geometric evenness terms by observing that HarmonicMean = In summary, Recall reflects the Bias plus a GeometricMean 2/ArithmeticMean as seen in (25) discounted estimation of Informedness and and used in the alternative expressions for Precision reflects the Prevalence plus a discounted normalization for Evenness in (26 27). The use of estimation of Markedness. Given usually HarmonicMean makes the relationship with F Prevalence << ½ and Bias << ½, their complements measure clearer, but use of GeometricMean is Inverse Prevalence >> ½ and Inverse Bias >> ½ generally preferred as a consistent estimate of represent substantial weighting up of the true central tendency that more accurately estimates the unbiased performance in both these measures, and mode for skewed (e.g. Poisson) data bounded hence also in F1. High Bias drives Recall up below by 0 and unbounded above, and as the strongly and Precision down according to the central limit of the family of L p based averages. Viz. strength of Informedness; high Prevalence drives
44 Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011 Powers DMW
Precision up and Recall down according to the involves cloud as one causal antecedent of rain (but strength of Markedness. sunshowers occur occasionally), and rain as one causal consequent of cloud (but cloudy days aren't Alternately, Informedness can be viewed (21) as a always wet) – only once we have identified the full renormalization of Recall after subtracting off the causal chain can we reduce to equivalence, and chance level of Recall, Bias, and Markedness (20) lack of equivalence may be a result of unidentified can be seen as a renormalization of Precision after causes, alternate outcomes or both. subtracting off the chance level of Precision, The Perverse systems (interchanging the labels on Prevalence (and Flach’s WRAcc, the unbiased form either the predictions or the classes, but not both) being equivalent to Bookmaker Informedness, was have similar performance but occur below the defined in this way as discussed in §2.3). chance line (since we have assumed strictly better Informedness can also be seen (21) as a than chance performance in assigning labels to the renormalization of LR or NLR after subtracting off given contingency matrix). their chance level performance. The Kappa Note that the effect of Prevalence on Accuracy, measure [13 16] commonly used in assessor Recall and Precision has also been characterized agreement evaluation was similarly defined as a above (§2.3) in terms of Flach's demonstration of renormalization of Accuracy after subtracting off an how skew enters into their characterization in ROC estimate of the expected Accuracy, for Cohen analysis, and effectively assigns different costs to Kappa being the dot product of the Biases and (False) Positives and (False) Negatives. This can Prevalences, and expressible as a normalization of be controlled for by setting the parameter c the discriminant of contingency, dtp, by the mean appropriately to reflect the desired skew and cost error rate (cf. F1; viz. Kappa is tradeoff, with c=1 defining skew and cost insensitive dtp/[dtp+mean(fp,fn)]). All three measures are versions. However, only Informedness (or invariant in the sense that they are properties of the equivalents such as DeltaP' and skew insensitive contingency tables that remain unchanged when we WRAcc) precisely characterizes the probability with flip to the Inverse problem (interchange positive and which a model informs the condition, and negative for both conditions and predictions). That conversely only Markedness (or DeltaP) precisely is we observe: characterizes the probability that a condition marks Inverse Informedness = Informedness, (informs) the predictor. Similarly, only the Inverse Markedness = Markedness, Correlation (aka Coefficient of Proportionality aka Inverse Kappa = Kappa. Coefficient of Determination aka Squared Matthews The Dual problem (interchange antecedent and Correlation Coefficient) precisely characterizes the consequent) reverses which condition is the probability that condition and predictor inform/mark predictor and the predicted condition, and hence each other, under our dichotomous assumptions. interchanges Precision and Recall, Prevalence and Note the Tetrachoric Correlation is another estimate Bias, as well as Markedness and Informedness. For of the Pearson Correlation made under the alternate cross evaluator agreement, both Informedness and assumption of an underlying continuous variable Markedness are meaningful although the polarity (assumed normally distributed), and is appropriate if and orientation of the contingency is arbitrary. We instead assume that we are dichotomizing a Similarly when examining causal relationships normal continuous variable [11]. But in this article (conventionally DeltaP vs DeltaP'), it is useful to we are making the explicit assumption that we are evaluate both deductive and abductive directions in dealing with a right/wrong dichotomy that is determining the strength of association. For intrinsically discontinuous. example, the connection between cloud and rain
45 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation
Evenness functions of the marginal Biases and Prevalences, and reflect probabilities relative to the corresponding marginal cases. However, we have seen that Kappa scales the discriminant in a way that reflects the actual error without taking into account expected error due to chance, and in effect it is really just using the discriminant to scale the actual mean error: Kappa is dtp/[dtp+mean(fp,fn)] = 1/[1+mean(fp,fn)/dtp] which approximates for small error to 1 mean(fp,fn)/dtp. The relatively good fit of Kappa to Correlation and Informedness is illustrated in Fig. 2, along with the poor fit of the Rank Weighted Average and the Geometric and Harmonic (F factor) means. The fit of the Evenness weighted determinant is perfect and not easily distinguishable but the separate components (Determinant and geometric means of Real Prevalences and Prediction Biases) are also shown (+1 for clarity).
Significance and Information Gain The ability to calculate various probabilities from a contingency table says nothing about the significance of those numbers – is the effect real, or is it within the expected range of variation around the values expected by chance? Usually this is explored by considering deviation from the expected values (ETP and its relatives) implied by the marginal counts (RP, PP and relatives) – or from expected rates implied by the biases (Class Prevalence and Label Bias). In the case of Machine Learning, Data Mining, or other artificially derived models and rules, there is the further question of whether the training and parameterization of the model has set the 'correct' or 'best' Prevalence and Bias (or Cost) levels. Furthermore, should this Fig. 2- Accuracy of traditional measures. determination be undertaken by reference to the 110 Monte Carlo simulations with 11 stepped model evaluation measures (Recall, Precision, expected Informedness levels (red) with Bookmaker Informedness, Markedness and their derivatives), or estimated Informedness (red dot), Markedness (green should the model be set to maximize the dot) and Correlation (blue dot), and showing (dashed) significance of the results? Kappa versus the biased traditional measures Rank This raises the question of how our measures of Weighted Average (Wav), Geometric Mean (Gav) and association and accuracy, Informedness, Harmonic Mean F1 (Fav). The Determinant (D) and Markedness and Correlation, relate to standard Evenness k th roots (gR=PrevG and gP=BiasP) are measures of significance. shown +1. This article has been written in the context of a K=4, N=128. (Online version has figures in colour.) Prevailing methodology in Computational Linguistics and Information Retrieval that concentrates on Although Kappa does attempt to renormalize a target positive cases and ignores the negative case debiased estimate of Accuracy, and is thus much for the purpose of both measures of association and more meaningful than Recall, Precision, Accuracy, significance. A classic example is saying “water” and their biased derivatives, it is intrinsically non can only be a noun because the system is linear, doesn't account for error well, and retains an inadequate to the task of Part of Speech influence of bias, so that there does not seem that identification and this boosts Recall and hence F there is any situation when Kappa would be factor, or at least setting the Bias to nouns close to preferable to Correlation as a standard independent 1, and the Inverse Bias to verbs close to 0. Of measure of agreement [16,13]. As we have seen, course, Bookmaker will then be 0 and Markedness Bookmaker Informedness, Markedness and unstable (undefined, and very sensitive to any Correlation reflect the discriminant of relative words that do actually get labelled verbs). We contingency normalized according to different would hope that significance would also be 0 (or
46 Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011 Powers DMW
near zero given only a relatively small number of known as Mutual Information, which however is verb labels). We would also like to be able to normally expressed in bits. We will discuss this calculate significance based on the positive case separately under the General Case. We deal with alone, as either the full negative information is G2 for positive predictions in the case of small unavailable, or it is not labelled. effect, that is dp close to zero, showing that G 2is Generally when dealing with contingency tables it is twice as sensitive as χ 2 in this range. assumed that unused labels or unrepresented G2+P /2 =TP ln(TP/ETP) + FP ln(FP/EFP) classes are dropped from the table, with =TP ln(1+DTP/ETP)+FP ln(1+DFP/EFP) corresponding reduction of degrees of freedom. For ≈ TP (DTP/ETP) + FP (DFP/EFP) simplicity we have assumed that the margins are all = 2N dp 2/ehp non zero, but the freedoms are there whether they = 2N dp 2/[rh pp] are used or not, so we will not reduce them or = N dp 2/PrevG 2/Bias reduce the table. = N B 2 Evenness R/Bias There are several schools of thought about = N r 2P PrevG 2/Bias significance testing, but all agree on the utility of ≈ (N+PN) r 2P PrevG 2(Bias → 1) calculating a p value [19], by specifying some = (N+PN) B 2 Evenness R (31) statistic or exact test T(X) and setting p = Prob(T(X) In fact χ 2 is notoriously unreliable for small N and ≥ T(Data)). In our case, the Observed Data is small cell values, and G 2 is to be preferred. The summarized in a contingency table and there are a Yates correction (applied only for cell values under number of tests which can be used to evaluate the 5) is to subtract 0.5 from the absolute dp value for significance of the contingency table. that cell before squaring completing the calculation For example, Fisher's exact test calculates the [17 19]. proportion of contingency tables that are at least as Our result (30 1) shows that χ 2 and G 2 significance favourable to the Prediction/Marking hypothesis, of the Informedness effect increases with N as rather than the null hypothesis, and provides an expected, but also with the square of Bookmaker, accurate estimate of the significance of the entire the Evenness of Prevalence (Evenness R = PrevG 2 = contingency table without any constraints on the Prev (1−Prev)) and the number of Predicted values or distribution. The log likelihood based G 2 Negatives (viz. with Inverse Bias)! This is as test and Pearson's approximating χ2 tests are expected. The more Informed the contingency compared against a Chi Squared Distribution of regarding positives, the less data will be needed to appropriate degree of freedom (r=1 for the binary reach significance. The more Biased the contingency table given the marginal counts are contingency towards positives, the less significant known), and depend on assumptions about the each positive is and the more data is needed to distribution, and may focus only on the Predicted ensure significance. The Bias weighted average Positives. over all Predictions (here for K=2 case: Positive and χ2 captures the Total Squared Deviation relative to Negative) is simply KN B 2 PrevG 2 which gives us an expectation, is here calculated only in relation to estimate of the significance without focussing on positive predictions as often only the overt either case in particular. prediction is considered, and the implicit prediction χ2KB = 2N dtp 2/PrevG 2 of negative case is ignored [17 19], noting that it = 2N rP2 PrevG 2 sufficient to count r=1 cells to determine the table = 2N rP2 Evenness R and make a significance estimate. However, χ 2 is = 2N B 2 Evenness R (32) valid only for reasonably sized contingencies (one Analogous formulae can be derived for the rule of thumb is that the expectation for the smallest significance of the Markedness effect for positive cell is at least 5, and the Yates and Williams real classes, noting that Evenness P = BiasG 2 . corrections will be discussed in due course [18,19]): χ2KM = 2N dtp 2/BiasG 2 χ2+P = (TP ETP) 2/ETP + (FP EFP) 2/EFP = 2N rR2 BiasG 2 = DTP 2/ETP + DFP 2/EFP = 2N M 2 Evenness P (33) = 2DP 2/EHP, EHP The Geometric Mean of these two overall estimates = 2ETP EFP/[ETP+EFP] for the full contingency table is = 2N dp 2/ehp,ehp χ2KBM = 2N dtp 2/PrevG BiasG = 2etp efp/[etp+efp] = 2N rP rR PrevG BiasG = 2N dp 2/[rh pp]= N dp 2/PrevG 2/Bias = 2N r2G Evenness G= 2Nρ 2 Evenness G = N B 2 Evenness R/Bias = N r 2P PrevG 2/Bias = 2N B M Evenness G (34) ≈ (N+PN) r 2P PrevG 2(Bias → 1) This is simply the total Sum of Squares Deviance = (N+PN) B 2 Evenness R (30) (SSD) accounted for by the correlation coefficient G2 captures Total Information Gain, being N times BMG (22) over the N data points discounted by the the Average Information Gain in nats, otherwise Global Evenness factor, being the squared
47 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation
Geometric Mean of all four Positive and Negative Also note that α bounds the probability of the null Bias and Prevalence terms (Evenness G= hypothesis, but 1 α is not a good estimate of the PrevG BiasG). The less even the Bias and probabilty of any specific alternate hypothesis. Prevalence, the more data will be required to Based on a Bayesian equal probability prior for the achieve significance, the maximum evenness value null hypothesis (H 0, e.g. B=M=0 as population of 0.25 being achieved with both even bias and effect) and an unspecific one tailed alternate even Prevalence. Note that for even bias or hypothesis (H A, e.g. the measured B and C as true Prevalence, the corresponding positive and population effect), we can estimate new posterior negative significance estimates match the global probability estimates for Type I (H 0 rejection, estimate. Alpha(p)) and Type II (HA rejection, Beta(p)) errors When χ 2+P or G 2+P is calculated for a specific label in from the posthoc likelihood estimation [22]: a dichotomous contingency table, it has one degree L(p) =Alpha(p)/Beta(p) of freedom for the purposes of assessment of ≈ – e p log(p) (36) significance. The full table also has one degree of Alpha(p) = 1/[1+1/L(p)] (37) freedom, and summing for goodness of fit over only Beta(p) = 1/[1+L(p)] (38) the positive prediction label will clearly lead to a lower χ 2 estimate than summing across the full Confidence Intervals and Deviations table, and while summing for only the negative label An alternative to significance estimation is will often give a similar result it will in general be confidence estimation in the statistical rather than different. Thus the weighted arithmetic mean the data mining sense. We noted earlier that calculated by χ 2KB is an expected value independent selecting the highest isocost line or maximizing of the arbitrary choice of which predictive variate is AUC or Bookmaker Informedness, B, is equivalent investigated. This is used to see whether a to minimizing fpr+fnr=(1 B) or maximizing hypothesized main effect (the alternate hypothesis, tpr+tnr=(1+B), which maximizes the sum of HA) is borne out by a significant difference from the normalized squared deviations of B from chance, usual distribution (the null hypothesis, H0). sse B=B 2 (as is seen geometrically from Fig. 1). Note Summing over the entire table (rather than that this contrasts with minimizing the sum of averaging of labels), is used for χ 2 or G 2 squares distance from the optimum which independence testing independent of any specific minimizes the relative sum of squared normalized alternate hypothesis [21], and can be expected to error of the aggregated contingency, sse B=fpr 2+fnr 2. achieve a χ 2 estimate approximately twice that However, an alternate definition calculating the sum achieved by the above estimates, effectively of squared deviation from optimum is as a cancelling out the Evenness term, and is thus far normalization the square of the minimum distance less conservative (viz. it is more likely to satisfy to the isocost of contingency, sse B=(1 B) 2. p<α): χ2BM = N r2G= N ρ2= N φ2= N B M (35) This approach contrasts with the approach of Note that this equates Pearson’s Rho, ρ, with the considering the error versus a specific null Phi Correlation Coefficient, φ, which is defined in hypothesis representing the expectation from terms of the Inertia φ 2=χ 2/N. We now have margins. Normalization is to the range [0,1] like |B| confirmed that not only does a factor of N connects and normalizes (due to similar triangles) all the full contingency G 2 to Mutual Information (MI), orientations of the distance between isocosts (Fig. but it also normalizes the full approximate χ 2 1). With these estimates the relative error is contingency to Matthews/Pearson (=BMG=Phi) constant and the relative size of confidence Correlation, at least for the dichotomous case. This intervals around the null and full hypotheses only tells us moreover, that MI and Correlation are depend on N as |B| and |1 B| are already measuring essentially the same thing, but MI and standardized measures of deviation from null or full Phi do not tell us anything about the direction of the correlation respectively (σ/ =1). Note however that correlation, but the sign of Matthews or Pearson or if the empirical value is 0 or 1, these measures BMG Correlation does (it is the Biases and admit no error versus no information or full Prevalences that are multiplied and squarerooted). information resp. If the theoretical value is B=0, then The individual or averaged goodness of fit a full ±1 error is possible, particularly in the discrete estimates are in general much more conservative low N case where it can be equilikely and will be than full contingency table estimation of p by the more likely than expected values that are fractional Fisher Exact Test, but the full independence and thus likely to become zeros. If the theoretical estimate can over inflate the statistic due to value is B=1, then no variation is expected unless summation of more than there are degrees of due to measurement error. Thus |1 B| reflects the freedom. The conservativeness has to do both with maximum (low N) deviation in the absence of distributional assumptions of the χ 2 or G 2 estimates measurement error. that are only asymptotically valid as well as the The standard Confidence Interval is defined in approximative nature of χ 2 in particular. terms of the Standard Error, SE =√[SSE/(N (N 1))] 48 Journal of Machine Learning Technologies ISSN: 2229 3981 & ISSN: 2229 399X, Volume 2, Issue 1, 2011 Powers DMW
=√[sse/(N 1)]. It is usual to use a multiplier X of null side and another on the full side (a around X=2 as, given the central limit theorem parameterized special case of the last that applies and the distribution can be regarded as corresponds to percentile based usages like box normal, a multiplier of 1.96 corresponds to a plots, being more appropriate to distributions that confidence of 95% that the true mean lies in the cannot be assumed to be symmetric). specified interval around the estimated mean, viz. The √sse means may be weighted or unweighted the probability that the derived confidence interval and in particular a self weighted arithmetic mean will bound the true mean is 0.95 and the test thus gives our recommended definition, √sse B1 =1 corresponds approximately to a significance test 2|B|+2B 2, whilst an unweighted geometric mean with alpha=0.05 as the probability of rejecting a gives √sse B1 =√[|B| B2] and an unweighted harmonic correct null hypothesis, or a power test with mean gives √sse B1 =|B| B2. All of these are beta=0.05 as the probability of rejecting a true full or symmetric, with the weighted arithmetic mean giving partial correlation hypothesis. A number of other a minimum of 0.5 at B=±0.5 and a maximum of 1 at distributions also approximate 95% confidence at both B=0 and B=±1, contrasting maximally with 2SE. sse B0 and sse B2 resp in these neighbourhoods, whilst We specifically reject the more traditional approach the unweighted harmonic and geometric means which assumes that both Prevalence and Bias are having their minimum of 0 at both B=0 and B=±1, fixed, defining margins which in turn define a acting like sse B0 and sse B2 resp in these specific chance case rather than an isocost line neighbourhoods (which there evidence zero representing all chance cases – we cannot assume variance around their assumed true values). The that any solution on an isocost line has greater error minimum at B=±0.5 for the geometric mean is 0.5 than any other since all are by definition equivalent. and for the harmonic mean, 0.25. The above approach is thus argued to be For this probabilistic |B| range, the weighted appropriate for Bookmaker and ROC statistics arithmetic mean is never less than the arithmetic which are based on the isocost concept, and mean and the geometric mean is never more than reflects the fact that most practical systems do not the arithmetic mean. These relations demonstrate in fact preset the Bias or match it to Prevalence, the complementary nature of the and indeed Prevalences in early trials may be quite weighted/arithmetic and unweighted geometric different from those in the field. means. The maxima at the extremes is arguably he specific estimate of sse that we present for more appropriate in relation to power as alpha, the probability of the current estimate for B intermediate results should calculate squared occurring if the true Informedness is B=0, deviations from a strictly intermediate expectation is√sse B0 =|1 B|=1, which is appropriate for testing based on the theoretical distribution, and will thus the null hypothesis, and thus for defining be smaller on average if the theoretical hypothesis unconventional error bars on B=0. Conversely, holds, whilst providing emphasized differentiation √sse B2 =|B|=0, is appropriate for testing deviation when near the null or full hypothesis. The minima of from the full hypothesis in the absence of 0 at the extremes are not very appropriate in measurement error, whilst √sse B2 =|B|=1 relation to significance versus the null hypothesis conservatively allows for full range measurement due the expectation of a normal distribution, but its error, and thus defines unconventional error bars on power dual versus the full hypothesis is B=M=C=1. appropriately a minimum as perfect correlation admits no error distribution. Based on Monte Carlo In view of the fact that there is confusion between simulations, we have observed that setting the use of beta in relation to a specific full sse B1 =√sse B2 =1 |B| as per the usual convention is dependency hypothesis, B=1 as we have just appropriately conservative on the upside but a little considered, and the conventional definition of an broad on the downside, whilst the weighted arbitrary and unspecific alternate contingent arithmetic mean, √sse B1 =1 2|B|+2B 2, is sufficiently hypothesis, B≠0, we designate the probability of conservative on the downside, but unnecessarily incorrectly excluding the full hypothesis by gamma, conservative for high B. and propose three possible related kinds of correction for the √sse for beta: some kind of mean Note that these two tailed ranges are valid for of |B| and |1 B| (the unweighted arithmetic mean is Bookmaker Informedness and Markedness that can 1/2, the geometric mean is less conservative and go positive or negative, but a one tailed test would the harmonic mean least conservative), the be appropriate for unsigned statistics or where a maximum or minimum (actually a special case of particular direction of prediction is assumed as we the last, the maximum being conservative and the have for our contingency tables. In these cases a minimum too low an underestimate in general), or smaller multiplier of 1.65 would suffice, however the an asymmetric interval that has one value on the convention is to use the overlapping of the
49 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and f measure to ROC, informedness, markedness & correlation
Table 2. Binary contingency tables. Colour coding highlights example counts of correct (green) and incorrect (pink) decisions with the resulting Bookmaker Informedness (B=WRacc=DeltaP'), Markedness (C=DeltaP), Matthews Correlation (C), Recall, Precision, Rand Accuracy, Harmonic Mean of Recall and Precision (F=F1), Geometric Mean of Recall and Precision (G), Cohen Kappa ( κ),and χ2 calculated using 2 2 2 Bookmaker ( χ +P ), Markedness ( χ +R ) and standard ( χ ) methods across the positive prediction or condition only, as well as calculated across the entire K=2 class contingency, all of which are designed to be referenced to alpha ( α) according to the χ2 distribution, with the latter more reliable due to taking into account all contingencies. Single-tailed threshold is shown for α =0.05. 2 68.0% 32.0% χ @α=0.05 3.85 2 2 76.0% 56 20 76 B 19.85% Recall 82.35% F 77.78% χ +P 1.13 χ KB 1.72 2 2 24.0% 12 12 24 M 23.68% Precision 73.68% G 77.90% χ +R 1.61 χ KM 2.05 Ra nd 68 32 100 C 21.68% 68.00% 21.26% 2 1.13 2 1.87 Acc κ χ χ KBM
60.0% 40.0% α=0.05 3.85 2 2 42.0% 30 12 42 B 20.00% Recall 50.00% F 58.82% χ +P 2.29 χ KB 1.92 58.0% 30 28 58 M 19.70% Precision 71.43% G 59.76% 2 2.22 2 1.89 χ +R χ KM Ra nd 2 2 confidence bars60 around 40 the 100 various C 19.85% hypotheses 58. 00%close κrelationship 18.60% (10, 15)χ with2.29 ROC χ AUC.KBM A system1.91 (although usually the null is not explicitly Acc that makes an informed (correct) decision for a represented). target condition with probability B, and guesses the remainder of the time, will exhibit a Bookmaker Thus for any two hypotheses (including the null Informedness (DeltaP') of B and a Recall of hypothesis, or one from a different contingency B(1−Prev) + Bias. Conversely a proposed marker table or other experiment deriving from a different which is marked (correctly) for a target condition theory or system) the traditional approach of with probability M, and according to chance the checking that 1.95SE (or 2SE) error bars don’t remainder of the time, will exhibit a Markedness overlap is rather conservative (it is enough for the (DeltaP) of M and a Precision of M(1 Bias) + Prev. value to be outside the range for a two sided test), Precision and Recall are thus biased by Prevalence whilst checking overlap of 1SE error bars is usually and Bias, and variation of system parameters can insufficiently conservative given that the upper make them rise or fall independently of represents beta