Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011, pp3763 Available online at http://www.bioinfo.in/contents.php?id=51

EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION

POWERS, D.M.W. *AILab, School of Computer Science, Engineering and Mathematics, Flinders University, South Australia, Australia Corresponding author. Email: [email protected]

Received: February 18, 2011; Accepted: February 27, 2011

Abstract Commonly used evaluation measures including Recall, Precision, FMeasure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multiclass case. Keywords –Recall and Precision, FMeasure, Rand Accuracy, Kappa,Informedness and Markedness, DeltaP, Correlation, Significance.

INTRODUCTION A common but poorly motivated way of evaluating decision system or a scientific experiment, results of Machine Learning experiments is using analyzing and introducing new probabilistic and Recall, Precision and Fmeasure. These measures information theoretic measures that overcome the are named for their origin in Information Retrieval problems with Recall, Precision and their and present specific biases, namely that they ignore derivatives. performance in correctly handling negative examples, they propagate the underlying marginal THE BINARY CASE and biases, and they fail to take It is common to introduce the various measures in account the chance level performance. In the the context of a dichotomous Medical Sciences, Receiver Operating problem, where the labels are by convention + and Characteristics (ROC) analysis has been borrowed and the predictions of a classifier are summarized in from Signal Processing to become a standard for a fourcell . This contingency table evaluation and standard setting, comparing True may be expressed using raw counts of the number Positive Rate and False Positive Rate. In the of times each predicted label is associated with Behavioural Sciences, Specificity and Sensitivity, each real class, or may be expressed in relative are commonly used. Alternate techniques, such as terms. Cell and margin labels may be formal Rand Accuracy and Cohen Kappa, have some probability expressions, may derive cell expressions advantages but are nonetheless still biased from margin labels or viceversa, may use measures. We will recapitulate some of the alphabetic constant labels a, b, c, d or A, B, C, D, or literature relating to the problems with these may use acronyms for the generic terms for True measures, as well as considering a number of other and False, Real and Predicted Positives and techniques that have been introduced and argued Negatives. Often UPPER CASE is used where the within each of these fields, aiming/claiming to values are counts, and lower case letters where the address the problems with these simplistic values are probabilities or proportions relative to N measures. or the marginal probabilities – we will adopt this This paper recapitulates and reexamines the convention throughout this paper (always written in relationships between these various measures, typewriter font), and in addition will use Mixed Case develops new insights into the problem of (in the normal text font) for popular nomenclature measuring the effectiveness of an empirical that may or may not correspond directly to one of

37 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation our formal systematic names. True and False are many relevant documents, that it doesn't really Positives (TP/FP) refer to the number of Predicted matter which subset we find, that we can't know Positives that were correct/incorrect, and similarly anything about the relevance of documents that for True and False Negatives (TN/FN), and these aren't returned). Recall tends to be neglected or four cells sum to N. On the other hand tp, fp, fn, tn averaged away in Machine Learning and and rp, rn and pp, pn refer to the joint and marginal Computational Linguistics (where the focus is on probabilities, and the four contingency cells and the how confident we can be in the rule or classifier). two pairs of marginal probabilities each sum to 1. However, in a Computational Linguistics/Machine We will attach other popular names to some of Translation context Recall has been shown to have these probabilities in due course. a major weight in predicting the success of Word We thus make the specific assumptions that we are Alignment [1]. In a Medical context Recall is predicting and assessing a single condition that is moreover regarded as primary, as the aim is to either positive or negative (dichotomous), that we identify all Real Positive cases, and it is also one of have one predicting model, and one gold standard the legs on which ROC analysis stands. In this labeling. Unless otherwise noted we will also for context it is referred to as True Positive Rate (tpr). simplicity assume that the contingency is nontrivial Recall is defined, with its various common in the sense that both positive and negative states appellations, by equation (1): of both predicted and real conditions occur, so that Recall = Sensitivity = tpr = tp/rp none of the marginal sums or probabilities is zero. = TP / RP = A /(A+C) (1) We illustrate in Table 1 the general form of a binary Conversely, Precision or Confidence (as it is called contingency table using both the traditional in Data Mining) denotes the proportion of Predicted alphabetic notation and the directly interpretable Positive cases that are correctly Real Positives. systematic approach. Both definitions and This is what Machine Learning, Data Mining and derivations in this paper are made relative to these Information Retrieval focus on, but it is totally labellings, although English terms (e.g. from ignored in ROC analysis. It can however Information Retrieval) will also be introduced for analogously be called True Positive Accuracy (tpa), various ratios and probabilities. The green positive being a measure of accuracy of Predicted Positives diagonal represents correct predictions, and the in contrast with the rate of discovery of Real pink negative diagonal incorrect predictions. The Positives (tpr). Precision is defined in (2): predictions of the contingency table may be the Precision = Confidence =tpa=tp/pp predictions of a theory, of some computational rule =TP / PP = A /(A+B) (2) or system (e.g. an Expert System or a Neural These two measures and their combinations focus Network), or may simply be a direct measurement, only on the positive examples and predictions, a calculated metric, or a latent condition, symptom although between them they capture some or marker. We will refer generically to "the model" information about the rates and kinds of errors as the source of the predicted labels, and "the made. However, neither of them captures any population" or "the world" as the source of the real information about how well the model handles conditions. We are interested in understanding to negative cases. Recall relates only to the +R what extent the model "informs" predictions about column and Precision only to the +P row. Neither of the world/population, and the world/population these takes into account the number of True "marks" conditions in the model. Negatives. This also applies to their Arithmetic, Geometric and Harmonic Means: A, Gand F=G 2/A Recall & Precision, Sensitivity & Specificity (the Ffactor or Fmeasure). Note that the F1 Recall or Sensitivity (as it is called in Psychology) is measure effectively references the True Positives to the proportion of Real Positive cases that are the of Predicted Positives and Real correctly Predicted Positive. This measures the Positives, being a constructed rate normalized to an Coverage of the Real Positive cases by the +P idealized value, and expressed in this form it is (Predicted Positive) rule. Its desirable feature is that known in as a Proportion of Specific it reflects how many of the relevant cases the +P Agreement as it is a applied to a specific class, so rule picks up. It tends not to be very highly valued in applied to the Positive Class, it is PS+. It also Information Retrieval (on the assumptions that there corresponds to the settheoretic Dice Coefficient.

Table 1. Systematic and traditional notations in a binary contingency table. Shading indicates correct (light=green) and incorrect (dark=red) rates or counts in the contingency table.

+R −R +R −R +P tp fp pp +P A B A+B −P fn tn pn −P C D C+D rp rn 1 A+C B+D N 38 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

The of Recall and Precision (G of the system designer (e.g. as a threshold). measure) effectively normalizes TP to the Similarly, we can note that one of N,FP or FN is free Geometric Mean of Predicted Positives and Real to vary. Whilst it apparently takes into account TN in Positives, and its Information content corresponds the numerator,theJaccard (or Tanimoto) similarity to the Arithmetic Mean of the Information coefficient uses it to heuristically discount the correct represented by Recall and Precision. classification of negatives, but it can be written (6) In fact, there is in principle nothing special about the independently of FN and N in a way similar tothe Positive case, and we can define Inverse statistics effectively equivalent Dice or PS+ or F1 (7), or in in terms of the Inverse problem in which we terms of them, and so is subject to bias as FN or N interchange positive and negative and are is free to vary and theyfail to capture contingencies predicting the opposite case. Inverse Recall or fully without knowing inverse statisticstoo. Specificity is thus the proportion of Real Negative Each of the above also has a complementary form cases that are correctly Predicted Negative (3), and defining an error rate, of which some have specific is also known as the True Negative Rate names and importance: Fallout or False Positive (tnr). Conversely, Inverse Precision is the Rate (fpr) are the proportion of Real Negatives that proportion of Predicted Negative cases that are occur as Predicted Positive (ringins); Miss Rate or indeed Real Negatives (4), and can also be called False Negative Rate (fnr) are the proportion of Real True Negative Accuracy (tna): Positives that are Predicted Negatives (falsedrops). Inverse Recall =tnr =tn/rn False Positive Rate is the second of the legs on =TN/RN =D/(B+D) (3) which ROC analysis is based. Inverse Precision =tna =tn/pn Fallout =fpr =fp/rp =TN/PN =D/(C+D) (4) =FP/RP =B/(B+D) (8) The inverse of F1 is not known in AI/ML/CL/IR but is Miss Rate =fnr =fn/rn just as well known as PS+ in statistics,being the =FN/RN =C/(A+C) (9) Proportion of Specific Agreement for the class of Note that FN and FP are sometimes referred to as negatives, PS−. Note that where as F1 is advocated Type I and Type II Errors, and the rates fn and fp as in AI/ML/CL/IR as a single measure to capture the alpha and beta, respectively – referring to falsely effectiveness of a system, it still completely ignores rejecting or accepting a hypothesis. More correctly, TN which can vary freely without affecting the these terms apply specifically to the metalevel statistic. In statistics, PS+ is used in conjunction problem discussed later of whether the precise with PS− to ensure the contingencies are pattern of counts (not rates) in the contingency table completely captured, and similarly Specificity fit the null hypothesis of random distribution rather (Inverse Recall) is always recorded along with than reflecting the effect of some alternative Sensitivity (Recall). hypothesis (which is not in general the one Rand Accuracy explicitly takes into account the represented by +P→+R or P→ Ror both). Note classification of negatives, and is expressible (5) that all the measures discussed individually leave at both as a weighted average of Precision and least two degree of freedom (plus N) unspecified Inverse Precision and as a weighted average of and free to control, and this leaves the door open Recall and Inverse Recall: for bias, whilst N is needed too for estimating Accuracy =tca=tcr=tp+tn significance and power. =rp ⋅tpr+rn ⋅tnr =(TP+TN)/N =pp ⋅tpa+pn ⋅tna =(A+D)/N (5) , Bias, Cost & Skew Dice = F1 =tp/(tp+(fn+fp)/2) We now turn our attention to the various forms of =A/(A+(B+C)/2) (6) bias that detract from the utility of all of the above =1/(1+mean(FN,FP)/TP) surface measures [2]. We will first note that rp Jaccard =tp/(tp+fn+fp)=TP/(NTN) represents the Prevalence of positive cases, RP/N, =A/(A+B+C) = A/(ND) (7) and is assumed to be a property of the population of =1/(1+2mean(FN,FP)/TP) interest – it may be constant, or it may vary across = F1 / (2 – F1) subpopulations, but is regarded here as not being As shown in (5) Rand Accuracy is effectively a under the control of the experimenter, and so we prevalence weighted average of Recall and Inverse want a prevalence independent measure. By Recall, as well as a bias-weighted average of contrast, pp represents the (label) Bias of the model Precision and Inverse Precision. Whilst it does take [3], the tendency of the model to output positive into account TN in the numerator, the sensitivity to labels, PP/N, and is directly under the control of the bias and prevalence is an issue since these are experimenter, who can change the model by independent variables, with prevalence varying as changing the theory or algorithm, or some we apply to data sampled under different parameter or threshold, to better fit the conditions, and bias being directly under the control world/population being modeled. As discussed

39 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

ratios and measures derivable from the normalized tpr best contingency table, but N is also required to good determine significance. As another case of specific sse 1 interest, Precision, Inverse Precision and Bias, in combination, suffice to determine all ratios or measures, although we will show later that an sse 0 alternate characterization of Prevalence and Bias in B terms of Evenness allows for even simpler relationships to be exposed.

We can also take into account a differential value for positives (cp) and negatives (cn) – this can be pervers applied to errors as a cost (loss or debit) and/or to

correct cases as a gain (profit or credit), and can be combined into a single Cost Ratio c v= cn/cp. Note that the value and skew determined costs have similar effects, and may be multiplied to produce a single skewlike cost factor worst c = c vcs. Formulations of measures that are fpr expressed using tpr, fpr and c s may be made cost Fig. 1- Illustration of ROC Analysis. The main sensitive by using c = c vcs in place of c = c s, or can diagonal represents chance with parallel isocost lines be made skew/costinsensitive by using c = 1[5]. representing equal costperformance. Points above ROC and PN Analyses the diagonal represent performance better than Flach [5] highlighted the utility of ROC analysis to chance, those below worse than chance. For a single the Machine Learning community, and good (dotted=green) system, AUC is area under curve characterized the skew sensitivity of many (trapezoid between green line and x=[0,1]). measures in that context, utilizing the ROC format The perverse (dashed=red) system shown is the same to give geometric insights into the nature of the (good) system with class labels reversed. measures and their sensitivity to skew. [6] further elaborated this analysis, extending it to the earlier, Ffactor (or Dice or Jaccard) effectively unnormalized PN variant of ROC, and targeting references tp (probability or proportion of True their analysis specifically to rule learning. We will Positives) to the Arithmetic Mean of Bias and not examine the advantages of ROC analysis here, Prevalence (67). A common rule of thumb, or even but will briefly explain the principles and recapitulate a characteristic of some algorithms, is to some of the results. parameterize a model so that Prevalence = Bias, ROC analysis plots the rate tpr against the rate fpr, viz. rp = pp. Corollaries of this setting are Recall = whilst PN plots the unnormalized TP against FP. Precision (= Dice but not Jaccard), Inverse Recall = This difference in normalization only changes the Inverse Precision and Fallout = Miss Rate. scales and gradients, and we will deal only with the Alternate characterizations of Prevalence are in normalized form of ROC analysis. A perfect terms of Odds[4] or Skew [5], being the Class Ratio classifier will score in the top left hand corner cs= rn/rp, recalling that by definition rp+rn = 1 and (fpr=0,tpr=100%). A worst case classifier will score RN+RP = N. If the distribution is highly skewed, in the bottom right hand corner (fpr=100%,tpr=0). A typically there are many more negative cases than random classifier would be expected to score positive, this means the number of errors due to somewhere along the positive diagonal (tpr=fpr) poor Inverse Recall will be much greater than the since the model will throw up positive and negative number of errors due to poor Recall. Given the cost examples at the same rate (relative to their of both False Positives and False Negatives is populations – these are Recalllike scales: tpr = equal, individually, the overall component of the Recall, 1fpr = Inverse Recall). For the negative total cost due to False Positives (as Negatives) will diagonal (tpr+c ⋅fpr=1) corresponds to matching Bias be much greater at any significant level of chance to Prevalence for a skew of c. performance, due to the higher Prevalence of Real The ROC plot allows us to compare classifiers Negatives. (models and/or parameterizations) and choose the Note that the normalized binary contingency table one that is closest to (0,1) and furtherest from with unspecified margins has three degrees of tpr=fpr in some sense. These conditions for freedom – setting any three non−Redundant ratios choosing the optimal parameterization or model are determines the rest (setting any count supplies the not identical, and in fact the most common condition remaining information to recover the original table of is to minimize the area under the curve (AUC), counts with its four degrees of freedom). In which for a single parameterization of a model is particular, Recall, Inverse Recall and Prevalence, or defined by a single point and the segments equivalently tpr, fpr and c s, suffice to determine all connecting it to (0,0) and (1,1). For a

40 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

parameterized model it will be a monotonic function or tprfpr =2—AUC−1, as c is constant. Thus WRAcc consisting of a sequence of segments from (0,0) to is an unbiased accuracy measure, and the skew (1,1). A particular cost model and/or accuracy insensitive form of WRAcc, with c=1, is precisely measure defines an isocost gradient, which for a tprfpr. Each of the other measures (10−12) shows skew and cost insensitive model will be c=1, and a bias in that it can not be maximized independent hence another common approach is to choose a of skew, although skewinsensitive versions can be tangent point on the highest isocost line that defined by setting c=1. The recasting of Accuracy, touches the curve. The simple condition of Precision and FMeasure in terms of Recall makes choosing the point on the curve nearest the clear how all of these vary only in terms of the way optimum point (0,1) is not commonly used, but this they are affected by Prevalence and Bias. distance to (0,1) is given by √[(fpr) 2+ (1tpr) 2], and Prevalence is regarded as a constant of the target minimizing this amounts to minimizing the sum of condition or data set (and c=[1−Prev]/Prev), whilst squared normalized error, fpr 2+fnr 2. parameterizing or selecting a model can be viewed A ROC curve with concavities can also be locally in terms of trading off tpr and fpr as in ROC interpolated to produce a smoothed model following analysis, or equivalently as controlling the relative the convex hull of the original ROC curve. It is even number of positive and negative predictions, namely possible to locally invert across the convex hull to the Bias, in order to maximize a particular accuracy repair concavities, but this may overfit and thus not measure (Recall, Precision, FMeasure, Rand generalize to unseen data. Such repairs can lead to Accuracy and AUC). Note that for a given Recall selecting an improved model, and the ROC curve level, the other measures (10−13) all decrease with can also be used to return a model to changing increasing Bias towards positive predictions. Prevalence and costs. The area under such a DeltaP, Informedness and Markedness multipoint curve is thus of some value, but the Powers [4] also derived an unbiased accuracy optimum in practice is the area under the simple measure to avoid the bias of Recall, Precision and trapezoid defined by the model: Accuracy due to population Prevalence and label AUC =(tprfpr+1)/2 bias. The Bookmaker algorithm costs wins and =(tpr+tnr)/2 losses in the same way a fair bookmaker would set =1 – (fpr+fnr)/2 (10) prices based on the odds. Powers then defines the For the cost and skew insensitive case, with c=1, concept of Informedness which represents the maximizing AUC is thus equivalent to maximizing 'edge' a punter has in making his bet, as evidenced tprfpr or minimizing a sum of (absolute) normalized and quantified by his winnings. Fair pricing based error fpr+fnr. The chance line corresponds to tpr on correct odds should be zero sum – that is, fpr=0, and parallel isocost lines for c=1 have the guessing will leave you with nothing in the long run, form tprfpr=k. The highest isocost line also whilst a punter with certain knowledge will win every maximizes tprfpr and AUC so that these two time. Informedness is the probability that a punter approaches are equivalent. Minimizing a sum of is making an informed bet and is explained in terms squared normalized error, fpr 2+fnr 2, corresponds to of the proportion of the time the edge works out a Euclidean distance minimization heuristic that is versus ends up being pure guesswork. Powers equivalent only under appropriate constraints, e.g. defined Bookmaker Informedness for the general, fpr=fnr, or equivalently, Bias=Prevalence, noting Klabel, case, but we will defer discussion of the that all cells are nonnegative by construction. general case for now and present a simplified We now summarize relationships between the formulation of Informedness, as well as the various candidate accuracy measures as rewritten complementary concept of Markedness. [5,6] in terms of tpr, fpr and the skew, c, as well in Definition 1 terms of Recall, Bias and Prevalence: Informedness quantifies how informed a predictor Accuracy = [tpr+c—(1fpr)]/[1+c] is for the specified condition, and specifies the = 2—Recall—Prev+1Bias−Prev (11) probability that a prediction is informed in relation Precision = tpr/[tpr+c—fpr] to the condition (versus chance). = Recall—Prev/Bias (12) Definition 2 FMeasure F1 = 2—tpr/[tpr+c—fpr+1] Markedness quantifies how marked a condition is = 2—Recall—Prev/[Bias+Prev] (13) for the specified predictor, and specifies the WRacc = 4c—[tprfpr]/[1+c] 2 probability that a condition is marked by the =4—[RecallBias]—Prev (14) predictor (versus chance). The last measure, Weighted Relative Accuracy, was These definitions are aligned with the psychological defined [7] to subtract off the component of the True and linguistic uses of the terms condition and Positive score that is attributable to chance and marker. The condition represents the experimental rescale to the range ±1. Note that outcome we are trying to determine by indirect maximizingWRacc is equivalent to maximizing AUC means. A marker or predictor (cf. biomarker or

41 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation neuromarker) represents the indicator we are using given +P (1) versusP (0), we obtain this gradient of to determine the outcome. There is no implication best fit (minimizing the error in the real values R): of causality – that is something we will address rP = [AD–BC] / [(A+B)(C+D)] later. However there are two possible directions of = A/(A+B) – C/(C+D) implication we will address now. Detection of the = DeltaP = Markedness (17) predictor may reliably predict the outcome, with or Conversely, we can find the regression coefficient without the occurrence of a specific outcome for predicting P from R (minimizing the error in the condition reliably triggering the predictor. predictions P): For the binary case we have rR = [AD–BC] / [(A+C)(B+D)] Informedness = Recall + Inverse Recall – 1 = A/(A+C) – B/(B+D) = tprfpr = 1fnrfpr (15) = DeltaP' = Informedness (18) Markedness = Precision + Inverse Precision – 1 Finally we see that the Matthews correlation, a = tpafna = 1fpafna contingency matrix method of calculating the We noted above that maximizing AUC or the Pearson productmoment correlation coefficient, ρ, unbiased WRAcc measure effectively maximized is defined by tprfpr and indeed WRAcc reduced to this in the rG =[AD–BC]/√[(A+C)(B+D)(A+B)(C+D)] skew independent case. This is not surprising given =Correlation both Powers [4] and Flach [57] set out to produce =±√[Informedness—Markedness] (19) an unbiased measure, and the linear definition of Given the regressions find the same line of best fit, Informedness will define a unique linear form. Note these gradients should be reciprocal, defining a that while Informedness is a deep measure of how perfect Correlation of 1. However, both consistently the Predictor predicts the Outcome by Informedness and Markedness are probabilities with combining surface measures about what proportion an upper bound of 1, so perfect correlation requires of Outcomes are correctly predicted, Markedness is perfect regression. The squared correlation is a a deep measure of how consistently the Outcome coefficient of proportionality indicating the has the Predictor as a Marker by combining surface proportion of the variance in R that is explained by measures about what proportion of Predictions are P, and is traditionally also interpreted as a correct. probability. We can now interpret it either as the In the Psychology literature, Markedness is known joint probability that P informs R and R marks P, as DeltaP and is empirically a good predictor of given that the two directions of predictability are human associative judgements – that is it seems we independent, or as the probability that the variance develop associative relationships between a is (causally) explained reciprocally. The sign of the predictor and an outcome when DeltaP is high, and Correlation will be the same as the sign of this is true even when multiple predictors are in Informedness and Markedness and indicates competition [8]. In the context of experiments on whether a correct or perverse usage of the information use in syllable processing, [9] notes that information has been made – take note in Schanks [8] sees DeltaP as "the normative measure interpreting the final part of (19). of contingency", but propose a complementary, Psychologists traditionally explain DeltaP in terms of backward, additional measure of strength of causal prediction, but it is important to note that the association, DeltaP' aka dichotomous direction of stronger prediction is not necessarily the Informedness. Perruchet and Peeremant [9] also direction of causality, and the fallacy of abductive note the analog of DeltaP to regression coefficient, reasoning is that the truth of A → B does not in and that the Geometric Mean of the two measures general have any bearing on the truth of B → A. is a dichotomous form of the Pearson correlation If Pi is one of several independent possible causes coefficient, the Matthews' Correlation Coefficient, of R, Pi→R is strong, but R →Pi is in general weak which is appropriate unless a continuous scale is for any specific Pi. If Pi is one of several necessary being measured dichotomously in which case a contributing factors to R, Pi→R is weak for any Tetrachoric Correlation estimate would be single Pi, but R →Pi is strong. The directions of the appropriate [10,11]. implication are thus not in general dependent. Causality, Correlation and Regression In a of two variables, we seek to In terms of the regression to fit R from P, since predict one variable, y, as a linear combination of there are only two correct points and two error the other, x, finding a line of best fit in the sense of points, and errors are calculated in the vertical (R) minimizing the sum of squared error (in y). The direction only, all errors contribute equally to tilting equation of fit has the form the regression down from the ideal line of fit. This y = y 0 + r x—x where Markedness regression thus provides information 2 rx= [n∑x—y∑x—∑y]/[n∑x ∑x—∑x] (16) about the consistency of the Outcome in terms of Substituting in counts from the contingency table, having the Predictor as a Marker – the errors for the regression of predicting +R (1) versusR (0) measured from the Outcome R relate to the failure of the Marker P to be present.

42 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

We can gain further insight into the nature of these probabilities. Equations (2022) illustrate this, regression and correlation coefficients by reducing showing that these coefficients depend only on dtp the top and bottom of each expression to and either Prevalence, Bias or their probabilities (dividing by N 2, noting that the original combination. Note that for a particular dtp these contingency counts sum to N, and the joint coefficients are minimized when the Prevalence probabilities after reduction sum to 1). The and/or Bias are at the evenly biased 0.5 level, numerator is the determinant of the contingency however in a learning or parameterization context matrix, and common across all three coefficients, changing the Prevalence or Bias will in general reducing to dtp, whilst the reduced denominator of change both tp and etp, and hence can change dtp. the regression coefficients depends only on the It is also worth considering further the relationship of Prevalence or Bias of the base variates. The the denominators to the Geometric Means, PrevG regression coefficients, Bookmaker Informedness of Prevalence and Inverse Prevalence (IPrev = (B) and Markedness (M), may thus be reexpressed 1−Prev is Prevalence of Real Negatives) and BiasG in terms of Precision (Prec) or Recall, along with of Bias and Inverse Bias (IBias = 1−Bias is bias to Bias and Prevalence (Prev) or their inverses (I): Predicted Negatives). These Geometric Means represent the Evenness of Real classes (Evenness R M = dtp/ [Bias — (1Bias)] = PrevG 2) and Predicted labels (Evenness P = = dtp/ [pp—pn] = dtp / pg 2 BiasG 2). We also introduce the concept of Global

= dtp / BiasG 2 = dtp / Evenness P Evenness as the Geometric Mean of these two = [Precision – Prevalence] / IBias (20) natural kinds of Evenness, Evenness G. From this B = dtp/ [Prevalence — (1−Prevalence)] formulation we can see that for a given relative delta = dtp/ [rp—rn] = dtp / rg 2 of true positive prediction above expectation (dtp), = dtp / PrevG 2= dtp / Evenness R the correlation is at minimum when predictions and = [Recall – Bias] / IPrev outcomes are both evenly distributed (√Evenness G = = Recall – Fallout √Evenness R = √Evenness P = Prev = Bias = 0.5), and = Recall + IRecall – 1 Markedness and Bookmaker are individually = Sensitivity + Specificity – 1 minimal when Bias resp. Prevalence are evenly = (LR–1)— (1–Specificity) distributed (viz. Bias resp. Prev = 0.5). This = (1–NLR)— Specificity suggests that setting Learner Bias (and regularized, = (LR –1)— (1–NLR) / (LR–NLR) (21) costweighted or subsampled Prevalence) to 0.5, as In the medical and behavioural sciences, the sometimes performed in Artificial Neural Network Likelihood Ratio is LR=Sensitivity/[1–Specificity], training is in fact inappropriate on theoretical and the Negative Likelihood Ratio is grounds, as has Previously been shown both NLR=Specificity/[1–Sensitivity]. For nonnegative B, empirically and based on Bayesian principles – LR>1>NLR, with 1 as the chance case. We also rather it is best to use Learner/Label Bias = Natural express Informedness in these terms in (21). Prevalence which is in general much less than 0.5 The Matthews/Pearson correlation is expressed in [12]. reduced form as the Geometric Mean of Bookmaker Note that in the above equations (2022) the Informedness and Markedness, abbreviating their denominator is always strictly positive since we product as BookMark (BM) and recalling that it is have occurrences and predictions of both Positives BookMark that acts as a probabilitylike coefficient and Negatives by earlier assumption, but we note of determination, not its root, the Geometric Mean that if in violation of this constraint we have a (BookMarkG or BMG): degenerate case in which there is nothing to predict BMG = dtp/ √[Prev — (1−Prev)— Bias — (1Bias)] or we make no effective prediction, then tp=etp and = dtp / [PrevG — BiasG] dtp=0, and all the above regression and correlation coefficients are defined in the limit approaching = dtp / Evenness G =√[(Recall−Bias)—(Prec−Prev)]/(IPrev—IBias) zero. Thus the coefficients are zero if and only if (22) dtp is zero, and they have the same sign as dtp These equations clearly indicate how the otherwise. Assuming that we are using the model Bookmaker coefficients of regression and the right way round, then dtp, B and M are non correlation depend only on the proportion of True negative, and BMG is similarly nonnegative as Positives and the Prevalence and Bias applicable to expected. If the model is the wrong way round, then the respective variables. Furthermore, Prev — Bias dtp, B, M and BMG can indicate this by expressing represents the Expected proportion of True below chance performance, negative regressions Positives (etp) relative to N, showing that the and negative correlation, and we can reverse the coefficients each represent the proportion of Delta sense of P to correct this. True Positives (deviation from expectation, dtp=tp The absolute value of the determinant of the etp) renormalized in different ways to give different contingency matrix, dp = dtp, in these probability

43 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

formulae (2022), also represents the sum of the Geometric (L 0) Mean is the Geometric Mean of absolute deviations from the expectation the Harmonic (L −1 ) and Arithmetic (L +1 ) Means, with represented by any individual cell and hence positive values of p being biased higher (toward 2dp=2DP/N is the total absolute relative error L+∞=Max) and negative values of p being biased versus the null hypothesis. Additionally it has a lower (toward L−∞=Min). geometric interpretation as the area of a trapezoid Effect of Bias &Prev on Recall & Precision in PNspace, the unnormalized variant of ROC [6]. The final form of the equations (2627) cancels out We already observed that in (normalized) ROC the common Bias and Prevalence (Prev) terms, that analysis, Informedness is twice the triangular area denormalizedtp to tpr (Recall) or tpa (Precision). We between a positively informed system and the now recast the Bookmaker Informedness and chance line, and it thus corresponds to the area of Markedness equations to show Recall and the trapezoid defined by a system (assumed to Precision as subject (2829), in order to explore the perform no worse than chance), and any of its affect of Bias and Prevalence on Recall and perversions (interchanging prediction labels but not Precision, as well as clarify the relationship of the real classes, or viceversa, so as to derive a Bookmaker and Markedness to these other system that performs no better than chance), and ubiquitous but iniquitous measures. the endpoints of the chance line (the trivial cases in Recall = Bookmaker (1−Prevalence) + Bias which the system labels all cases true or conversely Bookmaker = (RecallBias)/(1−Prevalence) (28) all are labelled false). Such a kiteshaped area is Precision = Markedness (1Bias) + delimited by the dotted (system) and dashed Prevalence (perversion) lines in Fig. 1 (interchanging class Markedness = (Precision−Prevalence)/(1Bias) labels), but the alternate parallelogram (29) (interchanging prediction labels) is not shown. The Bookmaker and Markedness are unbiased Informedness of a perverted system is the negation estimators of above chance performance (relative to of the Informedness of the correctly polarized respectively the predicting conditions or the system. predicted markers). Equations (2829) clearly show We now also express the Informedness and the nature of the bias introduced by both Label Bias Markedness forms of DeltaP in terms of deviations and Class Prevalence. If operating at chance level, from expected values along with the Harmonic both Bookmaker and Markedness will be zero, and mean of the marginal cardinalities of the Real Recall, Precision, and derivatives such as the F classes or Predicted labels respectively, defining measure, will be skewed by the biases. Note that DP, DELTAP, RH, PH and related forms in terms of increasing Bias or decreasing Prevalence increases their N−Relative probabilistic forms defined as Recall and decreases Precision, for a constant level follows: of unbiased performance. We can more specifically etp = rp — pp; etn = rn— pn (23) see that the regression coefficient for the prediction dp = tp – etp = dtp of Recall from Prevalence is −Informedness, and = dtn = (tn – etn) from Bias is +1, and similarly the regression deltap = dtp – dtn = (24) coefficient for the prediction of Precision from Bias rh = 2rp — rn / [rp+rn] = rp 2/ra 2 is −Markedness, and from Prevalence is +1. Using ph = 2pp — pn / [pp+pn]= pp 2/pa 2 (25) the heuristic of setting Bias = Prevalence then sets DeltaP' or Bookmaker Informedness may now be Recall = Precision = F1 and Bookmaker expressed in terms of deltap and rh, and DeltaP or Informedness = Markedness = Correlation. Setting Markedness analogously in terms of deltap and ph: Bias = 1 (Prevalence < 1) may be seen to make B = DeltaP' = [etp+dtp]/rp–[efpdtp]/rn Precision track Prevalence with Recall = 1, whilst = etp/rp – efp/rn + 2dtp/rh Prevalence = 1 (Bias < 1) means Recall = Bias with = 2dp/rh = deltap/rh (26) Informedness = 1, and under either condition there M = DeltaP = 2dp/ph = deltap/ph (27) is no information utilized (Bookmaker Informedness These harmonic relationships connect directly with = Markedness = 0). the previous geometric evenness terms by observing that HarmonicMean = In summary, Recall reflects the Bias plus a GeometricMean 2/ArithmeticMean as seen in (25) discounted estimation of Informedness and and used in the alternative expressions for Precision reflects the Prevalence plus a discounted normalization for Evenness in (2627). The use of estimation of Markedness. Given usually HarmonicMean makes the relationship with F Prevalence << ½ and Bias << ½, their complements measure clearer, but use of GeometricMean is Inverse Prevalence >> ½ and Inverse Bias >> ½ generally preferred as a consistent estimate of represent substantial weighting up of the true central tendency that more accurately estimates the unbiased performance in both these measures, and mode for skewed (e.g. Poisson) data bounded hence also in F1. High Bias drives Recall up below by 0 and unbounded above, and as the strongly and Precision down according to the central limit of the family of L p based averages. Viz. strength of Informedness; high Prevalence drives

44 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

Precision up and Recall down according to the involves cloud as one causal antecedent of rain (but strength of Markedness. sunshowers occur occasionally), and rain as one causal consequent of cloud (but cloudy days aren't Alternately, Informedness can be viewed (21) as a always wet) – only once we have identified the full renormalization of Recall after subtracting off the causal chain can we reduce to equivalence, and chance level of Recall, Bias, and Markedness (20) lack of equivalence may be a result of unidentified can be seen as a renormalization of Precision after causes, alternate outcomes or both. subtracting off the chance level of Precision, The Perverse systems (interchanging the labels on Prevalence (and Flach’s WRAcc, the unbiased form either the predictions or the classes, but not both) being equivalent to Bookmaker Informedness, was have similar performance but occur below the defined in this way as discussed in §2.3). chance line (since we have assumed strictly better Informedness can also be seen (21) as a than chance performance in assigning labels to the renormalization of LR or NLR after subtracting off given contingency matrix). their chance level performance. The Kappa Note that the effect of Prevalence on Accuracy, measure [1316] commonly used in assessor Recall and Precision has also been characterized agreement evaluation was similarly defined as a above (§2.3) in terms of Flach's demonstration of renormalization of Accuracy after subtracting off an how skew enters into their characterization in ROC estimate of the expected Accuracy, for Cohen analysis, and effectively assigns different costs to Kappa being the dot product of the Biases and (False) Positives and (False) Negatives. This can Prevalences, and expressible as a normalization of be controlled for by setting the parameter c the discriminant of contingency, dtp, by the mean appropriately to reflect the desired skew and cost error rate (cf. F1; viz. Kappa is tradeoff, with c=1 defining skew and cost insensitive dtp/[dtp+mean(fp,fn)]). All three measures are versions. However, only Informedness (or invariant in the sense that they are properties of the equivalents such as DeltaP' and skewinsensitive contingency tables that remain unchanged when we WRAcc) precisely characterizes the probability with flip to the Inverse problem (interchange positive and which a model informs the condition, and negative for both conditions and predictions). That conversely only Markedness (or DeltaP) precisely is we observe: characterizes the probability that a condition marks Inverse Informedness = Informedness, (informs) the predictor. Similarly, only the Inverse Markedness = Markedness, Correlation (aka Coefficient of Proportionality aka Inverse Kappa = Kappa. Coefficient of Determination aka Squared Matthews The Dual problem (interchange antecedent and Correlation Coefficient) precisely characterizes the consequent) reverses which condition is the probability that condition and predictor inform/mark predictor and the predicted condition, and hence each other, under our dichotomous assumptions. interchanges , Prevalence and Note the Tetrachoric Correlation is another estimate Bias, as well as Markedness and Informedness. For of the Pearson Correlation made under the alternate crossevaluator agreement, both Informedness and assumption of an underlying continuous variable Markedness are meaningful although the polarity (assumed normally distributed), and is appropriate if and orientation of the contingency is arbitrary. We instead assume that we are dichotomizing a Similarly when examining causal relationships normal continuous variable [11]. But in this article (conventionally DeltaP vs DeltaP'), it is useful to we are making the explicit assumption that we are evaluate both deductive and abductive directions in dealing with a right/wrong dichotomy that is determining the strength of association. For intrinsically discontinuous. example, the connection between cloud and rain

45 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

Evenness functions of the marginal Biases and Prevalences, and reflect probabilities relative to the corresponding marginal cases. However, we have seen that Kappa scales the discriminant in a way that reflects the actual error without taking into account expected error due to chance, and in effect it is really just using the discriminant to scale the actual mean error: Kappa is dtp/[dtp+mean(fp,fn)] = 1/[1+mean(fp,fn)/dtp] which approximates for small error to 1mean(fp,fn)/dtp. The relatively good fit of Kappa to Correlation and Informedness is illustrated in Fig. 2, along with the poor fit of the Rank Weighted Average and the Geometric and Harmonic (Ffactor) means. The fit of the Evenness weighted determinant is perfect and not easily distinguishable but the separate components (Determinant and geometric means of Real Prevalences and Prediction Biases) are also shown (+1 for clarity).

Significance and Information Gain The ability to calculate various probabilities from a contingency table says nothing about the significance of those numbers – is the effect real, or is it within the expected range of variation around the values expected by chance? Usually this is explored by considering deviation from the expected values (ETP and its relatives) implied by the marginal counts (RP, PP and relatives) – or from expected rates implied by the biases (Class Prevalence and Label Bias). In the case of Machine Learning, Data Mining, or other artificially derived models and rules, there is the further question of whether the training and parameterization of the model has set the 'correct' or 'best' Prevalence and Bias (or Cost) levels. Furthermore, should this Fig. 2- Accuracy of traditional measures. determination be undertaken by reference to the 110 Monte Carlo simulations with 11 stepped model evaluation measures (Recall, Precision, expected Informedness levels (red) with Bookmaker Informedness, Markedness and their derivatives), or estimated Informedness (red dot), Markedness (green should the model be set to maximize the dot) and Correlation (blue dot), and showing (dashed) significance of the results? Kappa versus the biased traditional measures Rank This raises the question of how our measures of Weighted Average (Wav), Geometric Mean (Gav) and association and accuracy, Informedness, F1 (Fav). The Determinant (D) and Markedness and Correlation, relate to standard Evenness kth roots (gR=PrevG and gP=BiasP) are measures of significance. shown +1. This article has been written in the context of a K=4, N=128. (Online version has figures in colour.) Prevailing methodology in Computational Linguistics and Information Retrieval that concentrates on Although Kappa does attempt to renormalize a target positive cases and ignores the negative case debiased estimate of Accuracy, and is thus much for the purpose of both measures of association and more meaningful than Recall, Precision, Accuracy, significance. A classic example is saying “water” and their biased derivatives, it is intrinsically non can only be a noun because the system is linear, doesn't account for error well, and retains an inadequate to the task of Part of Speech influence of bias, so that there does not seem that identification and this boosts Recall and hence F there is any situation when Kappa would be factor, or at least setting the Bias to nouns close to preferable to Correlation as a standard independent 1, and the Inverse Bias to verbs close to 0. Of measure of agreement [16,13]. As we have seen, course, Bookmaker will then be 0 and Markedness Bookmaker Informedness, Markedness and unstable (undefined, and very sensitive to any Correlation reflect the discriminant of relative words that do actually get labelled verbs). We contingency normalized according to different would hope that significance would also be 0 (or

46 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

near zero given only a relatively small number of known as Mutual Information, which however is verb labels). We would also like to be able to normally expressed in bits. We will discuss this calculate significance based on the positive case separately under the General Case. We deal with alone, as either the full negative information is G2 for positive predictions in the case of small unavailable, or it is not labelled. effect, that is dp close to zero, showing that G 2is Generally when dealing with contingency tables it is twice as sensitive as χ 2 in this range. assumed that unused labels or unrepresented G2+P /2 =TPln(TP/ETP) + FPln(FP/EFP) classes are dropped from the table, with =TPln(1+DTP/ETP)+FPln(1+DFP/EFP) corresponding reduction of degrees of freedom. For ≈ TP(DTP/ETP) + FP(DFP/EFP) simplicity we have assumed that the margins are all = 2Ndp 2/ehp nonzero, but the freedoms are there whether they = 2Ndp 2/[rhpp] are used or not, so we will not reduce them or = Ndp 2/PrevG 2/Bias reduce the table. = NB 2Evenness R/Bias There are several schools of thought about = Nr 2PPrevG 2/Bias significance testing, but all agree on the utility of ≈ (N+PN)r 2PPrevG 2(Bias → 1) calculating a pvalue [19], by specifying some = (N+PN)B 2Evenness R (31) statistic or exact test T(X) and setting p = Prob(T(X) In fact χ 2 is notoriously unreliable for small N and ≥ T(Data)). In our case, the Observed Data is small cell values, and G 2 is to be preferred. The summarized in a contingency table and there are a Yates correction (applied only for cell values under number of tests which can be used to evaluate the 5) is to subtract 0.5 from the absolute dp value for significance of the contingency table. that cell before squaring completing the calculation For example, Fisher's exact test calculates the [1719]. proportion of contingency tables that are at least as Our result (301) shows that χ 2 and G 2 significance favourable to the Prediction/Marking hypothesis, of the Informedness effect increases with N as rather than the null hypothesis, and provides an expected, but also with the square of Bookmaker, accurate estimate of the significance of the entire the Evenness of Prevalence (Evenness R = PrevG 2 = contingency table without any constraints on the Prev(1−Prev)) and the number of Predicted values or distribution. The loglikelihoodbased G 2 Negatives (viz. with Inverse Bias)! This is as test and Pearson's approximating χ2 tests are expected. The more Informed the contingency compared against a ChiSquared Distribution of regarding positives, the less data will be needed to appropriate degree of freedom (r=1 for the binary reach significance. The more Biased the contingency table given the marginal counts are contingency towards positives, the less significant known), and depend on assumptions about the each positive is and the more data is needed to distribution, and may focus only on the Predicted ensure significance. The Biasweighted average Positives. over all Predictions (here for K=2 case: Positive and χ2 captures the Total Squared Deviation relative to Negative) is simply KNB 2PrevG 2 which gives us an expectation, is here calculated only in relation to estimate of the significance without focussing on positive predictions as often only the overt either case in particular. prediction is considered, and the implicit prediction χ2KB = 2Ndtp 2/PrevG 2 of negative case is ignored [1719], noting that it = 2NrP2 PrevG 2 sufficient to count r=1 cells to determine the table = 2NrP2 Evenness R and make a significance estimate. However, χ 2 is = 2NB 2Evenness R (32) valid only for reasonably sized contingencies (one Analogous formulae can be derived for the rule of thumb is that the expectation for the smallest significance of the Markedness effect for positive cell is at least 5, and the Yates and Williams real classes, noting that Evenness P = BiasG 2 . corrections will be discussed in due course [18,19]): χ2KM = 2Ndtp 2/BiasG 2 χ2+P = (TPETP) 2/ETP + (FPEFP) 2/EFP = 2N rR2 BiasG 2 = DTP 2/ETP + DFP 2/EFP = 2NM 2Evenness P (33) = 2DP 2/EHP, EHP The Geometric Mean of these two overall estimates = 2ETPEFP/[ETP+EFP] for the full contingency table is = 2Ndp 2/ehp,ehp χ2KBM = 2Ndtp 2/PrevGBiasG = 2etpefp/[etp+efp] = 2NrPrR PrevGBiasG = 2Ndp 2/[rhpp]= Ndp 2/PrevG 2/Bias = 2Nr2GEvenness G= 2Nρ 2Evenness G = NB 2Evenness R/Bias = Nr 2PPrevG 2/Bias = 2NBM Evenness G (34) ≈ (N+PN)r 2PPrevG 2(Bias → 1) This is simply the total Sum of Squares Deviance = (N+PN)B 2Evenness R (30) (SSD) accounted for by the correlation coefficient G2 captures Total Information Gain, being N times BMG (22) over the N data points discounted by the the Average Information Gain in nats, otherwise Global Evenness factor, being the squared

47 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

Geometric Mean of all four Positive and Negative Also note that α bounds the probability of the null Bias and Prevalence terms (Evenness G= hypothesis, but 1α is not a good estimate of the PrevGBiasG). The less even the Bias and probabilty of any specific alternate hypothesis. Prevalence, the more data will be required to Based on a Bayesian equal probability prior for the achieve significance, the maximum evenness value null hypothesis (H 0, e.g. B=M=0 as population of 0.25 being achieved with both even bias and effect) and an unspecific onetailed alternate even Prevalence. Note that for even bias or hypothesis (H A, e.g. the measured B and C as true Prevalence, the corresponding positive and population effect), we can estimate new posterior negative significance estimates match the global probability estimates for Type I (H 0 rejection, estimate. Alpha(p)) and Type II (HA rejection, Beta(p)) errors When χ 2+P or G 2+P is calculated for a specific label in from the posthoc likelihood estimation [22]: a dichotomous contingency table, it has one degree L(p) =Alpha(p)/Beta(p) of freedom for the purposes of assessment of ≈ – e p log(p) (36) significance. The full table also has one degree of Alpha(p) = 1/[1+1/L(p)] (37) freedom, and summing for goodness of fit over only Beta(p) = 1/[1+L(p)] (38) the positive prediction label will clearly lead to a lower χ 2 estimate than summing across the full Confidence Intervals and Deviations table, and while summing for only the negative label An alternative to significance estimation is will often give a similar result it will in general be confidence estimation in the statistical rather than different. Thus the weighted arithmetic mean the data mining sense. We noted earlier that calculated by χ 2KB is an expected value independent selecting the highest isocost line or maximizing of the arbitrary choice of which predictive variate is AUC or Bookmaker Informedness, B, is equivalent investigated. This is used to see whether a to minimizing fpr+fnr=(1B) or maximizing hypothesized main effect (the alternate hypothesis, tpr+tnr=(1+B), which maximizes the sum of HA) is borne out by a significant difference from the normalized squared deviations of B from chance, usual distribution (the null hypothesis, H0). sse B=B 2 (as is seen geometrically from Fig. 1). Note Summing over the entire table (rather than that this contrasts with minimizing the sum of averaging of labels), is used for χ 2 or G 2 squares distance from the optimum which independence testing independent of any specific minimizes the relative sum of squared normalized alternate hypothesis [21], and can be expected to error of the aggregated contingency, sse B=fpr 2+fnr 2. achieve a χ 2 estimate approximately twice that However, an alternate definition calculating the sum achieved by the above estimates, effectively of squared deviation from optimum is as a cancelling out the Evenness term, and is thus far normalization the square of the minimum distance less conservative (viz. it is more likely to satisfy to the isocost of contingency, sse B=(1B) 2. p<α): χ2BM = Nr2G= Nρ2= Nφ2= NBM (35) This approach contrasts with the approach of Note that this equates Pearson’s Rho, ρ, with the considering the error versus a specific null Phi Correlation Coefficient, φ, which is defined in hypothesis representing the expectation from terms of the Inertia φ 2=χ 2/N. We now have margins. Normalization is to the range [0,1] like |B| confirmed that not only does a factor of N connects and normalizes (due to similar triangles) all the full contingency G 2 to Mutual Information (MI), orientations of the distance between isocosts (Fig. but it also normalizes the full approximate χ 2 1). With these estimates the relative error is contingency to Matthews/Pearson (=BMG=Phi) constant and the relative size of confidence Correlation, at least for the dichotomous case. This intervals around the null and full hypotheses only tells us moreover, that MI and Correlation are depend on N as |B| and |1B| are already measuring essentially the same thing, but MI and standardized measures of deviation from null or full Phi do not tell us anything about the direction of the correlation respectively (σ/=1). Note however that correlation, but the sign of Matthews or Pearson or if the empirical value is 0 or 1, these measures BMG Correlation does (it is the Biases and admit no error versus no information or full Prevalences that are multiplied and squarerooted). information resp. If the theoretical value is B=0, then The individual or averaged goodnessoffit a full ±1 error is possible, particularly in the discrete estimates are in general much more conservative low N case where it can be equilikely and will be than full contingency table estimation of p by the more likely than expected values that are fractional Fisher Exact Test, but the full independence and thus likely to become zeros. If the theoretical estimate can over inflate the statistic due to value is B=1, then no variation is expected unless summation of more than there are degrees of due to measurement error. Thus |1B| reflects the freedom. The conservativeness has to do both with maximum (low N) deviation in the absence of distributional assumptions of the χ 2 or G 2 estimates measurement error. that are only asymptotically valid as well as the The standard Confidence Interval is defined in approximative nature of χ 2 in particular. terms of the Standard Error, SE =√[SSE/(N(N1))] 48 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

=√[sse/(N1)]. It is usual to use a multiplier X of null side and another on the full side (a around X=2 as, given the central limit theorem parameterized special case of the last that applies and the distribution can be regarded as corresponds to percentilebased usages like box normal, a multiplier of 1.96 corresponds to a plots, being more appropriate to distributions that confidence of 95% that the true mean lies in the cannot be assumed to be symmetric). specified interval around the estimated mean, viz. The √sse means may be weighted or unweighted the probability that the derived confidence interval and in particular a selfweighted arithmetic mean will bound the true mean is 0.95 and the test thus gives our recommended definition, √sse B1 =1 corresponds approximately to a significance test 2|B|+2B 2, whilst an unweighted geometric mean with alpha=0.05 as the probability of rejecting a gives √sse B1 =√[|B|B2] and an unweighted harmonic correct null hypothesis, or a power test with mean gives √sse B1 =|B|B2. All of these are beta=0.05 as the probability of rejecting a true full or symmetric, with the weighted arithmetic mean giving partial correlation hypothesis. A number of other a minimum of 0.5 at B=±0.5 and a maximum of 1 at distributions also approximate 95% confidence at both B=0 and B=±1, contrasting maximally with 2SE. sse B0 and sse B2 resp in these neighbourhoods, whilst We specifically reject the more traditional approach the unweighted harmonic and geometric means which assumes that both Prevalence and Bias are having their minimum of 0 at both B=0 and B=±1, fixed, defining margins which in turn define a acting like sse B0 and sse B2 resp in these specific chance case rather than an isocost line neighbourhoods (which there evidence zero representing all chance cases – we cannot assume variance around their assumed true values). The that any solution on an isocost line has greater error minimum at B=±0.5 for the geometric mean is 0.5 than any other since all are by definition equivalent. and for the harmonic mean, 0.25. The above approach is thus argued to be For this probabilistic |B| range, the weighted appropriate for Bookmaker and ROC statistics arithmetic mean is never less than the arithmetic which are based on the isocost concept, and mean and the geometric mean is never more than reflects the fact that most practical systems do not the arithmetic mean. These relations demonstrate in fact preset the Bias or match it to Prevalence, the complementary nature of the and indeed Prevalences in early trials may be quite weighted/arithmetic and unweighted geometric different from those in the field. means. The maxima at the extremes is arguably he specific estimate of sse that we present for more appropriate in relation to power as alpha, the probability of the current estimate for B intermediate results should calculate squared occurring if the true Informedness is B=0, deviations from a strictly intermediate expectation is√sse B0 =|1B|=1, which is appropriate for testing based on the theoretical distribution, and will thus the null hypothesis, and thus for defining be smaller on average if the theoretical hypothesis unconventional error bars on B=0. Conversely, holds, whilst providing emphasized differentiation √sse B2 =|B|=0, is appropriate for testing deviation when near the null or full hypothesis. The minima of from the full hypothesis in the absence of 0 at the extremes are not very appropriate in measurement error, whilst √sse B2 =|B|=1 relation to significance versus the null hypothesis conservatively allows for full range measurement due the expectation of a normal distribution, but its error, and thus defines unconventional error bars on power dual versus the full hypothesis is B=M=C=1. appropriately a minimum as perfect correlation admits no error distribution. Based on Monte Carlo In view of the fact that there is confusion between simulations, we have observed that setting the use of beta in relation to a specific full sse B1 =√sse B2 =1|B| as per the usual convention is dependency hypothesis, B=1 as we have just appropriately conservative on the upside but a little considered, and the conventional definition of an broad on the downside, whilst the weighted arbitrary and unspecific alternate contingent arithmetic mean, √sse B1 =12|B|+2B 2, is sufficiently hypothesis, B≠0, we designate the probability of conservative on the downside, but unnecessarily incorrectly excluding the full hypothesis by gamma, conservative for high B. and propose three possible related kinds of correction for the √sse for beta: some kind of mean Note that these twotailed ranges are valid for of |B| and |1B| (the unweighted arithmetic mean is Bookmaker Informedness and Markedness that can 1/2, the geometric mean is less conservative and go positive or negative, but a one tailed test would the harmonic mean least conservative), the be appropriate for unsigned statistics or where a maximum or minimum (actually a special case of particular direction of prediction is assumed as we the last, the maximum being conservative and the have for our contingency tables. In these cases a minimum too low an underestimate in general), or smaller multiplier of 1.65 would suffice, however the an asymmetric interval that has one value on the convention is to use the overlapping of the

49 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

Table 2. Binary contingency tables. Colour coding highlights example counts of correct (green) and incorrect (pink) decisions with the resulting Bookmaker Informedness (B=WRacc=DeltaP'), Markedness (C=DeltaP), Matthews Correlation (C), Recall, Precision, Rand Accuracy, Harmonic Mean of Recall and Precision (F=F1), Geometric Mean of Recall and Precision (G), Cohen Kappa ( κ),and χ2 calculated using 2 2 2 Bookmaker ( χ +P ), Markedness ( χ +R ) and standard ( χ ) methods across the positive prediction or condition only, as well as calculated across the entire K=2 class contingency, all of which are designed to be referenced to alpha ( α) according to the χ2 distribution, with the latter more reliable due to taking into account all contingencies. Single-tailed threshold is shown for α =0.05. 2 68.0% 32.0% χ @α=0.05 3.85 2 2 76.0% 56 20 76 B 19.85% Recall 82.35% F 77.78% χ +P 1.13 χ KB 1.72 2 2 24.0% 12 12 24 M 23.68% Precision 73.68% G 77.90% χ +R 1.61 χ KM 2.05 Ra nd 68 32 100 C 21.68% 68.00% 21.26% 2 1.13 2 1.87 Acc κ χ χ KBM

60.0% 40.0% α=0.05 3.85 2 2 42.0% 30 12 42 B 20.00% Recall 50.00% F 58.82% χ +P 2.29 χ KB 1.92 58.0% 30 28 58 M 19.70% Precision 71.43% G 59.76% 2 2.22 2 1.89 χ +R χ KM Ra nd 2 2 confidence bars60 around 40 the 100 various C 19.85% hypotheses 58. 00%close κrelationship 18.60% (10, 15)χ with2.29 ROC χ AUC.KBM A system1.91 (although usually the null is not explicitly Acc that makes an informed (correct) decision for a represented). target condition with probability B, and guesses the remainder of the time, will exhibit a Bookmaker Thus for any two hypotheses (including the null Informedness (DeltaP') of B and a Recall of hypothesis, or one from a different contingency B—(1−Prev) + Bias. Conversely a proposed marker table or other experiment deriving from a different which is marked (correctly) for a target condition theory or system) the traditional approach of with probability M, and according to chance the checking that 1.95SE (or 2SE) error bars don’t remainder of the time, will exhibit a Markedness overlap is rather conservative (it is enough for the (DeltaP) of M and a Precision of M—(1Bias) + Prev. value to be outside the range for a twosided test), Precision and Recall are thus biased by Prevalence whilst checking overlap of 1SE error bars is usually and Bias, and variation of system parameters can insufficiently conservative given that the upper make them rise or fall independently of represents beta

(Marking) Problems show higher χ2 significance, human performance, for example in Word approaching the 0.05 level in some instances (and Alignment for Machine Translation [1]. Precision far exceeding it for the Inverse Dual). The KB and Markedness, as biased and unbiased variants variant gives a single conservative significance level of the same measure, are appropriate for testing for the entire table, sensitive only to the direction of effectiveness relative to a set of predictions. This is proposed implication, and is thus to be preferred particularly appropriate where we do not have an over the standard versions that depend on choice of appropriate gold standard giving correct labels for condition. every case, and is the primary measure used in Incidentally, the Fisher Exact Test shows Information Retrieval for this reason, as we cannot significance to the 0.05 level for both the examples know the full set of relevant documents for a query in Table 2. This corresponds to an assumption of a and thus cannot calculate Recall. hypergeometric distribution rather than normality – However, in this latter case of an incompletely viz. all assignments of events to cells are assumed characterized test set, we do not have a fully to be equally likely given the marginal constraints specified contingency matrix and cannot apply any (Bias and Prevalence). However it is in appropriate of the other measures we have introduced. Rather, given the Bias and Prevalence are not specified by whether for Information Retrieval or Medical Trials, the experimenter in advance of the experiment as is it is assumed that a test set is developed in which assumed by the conditions of this test. This has all real labels are reliably (but not necessarily also been demonstrated empirically through Monte perfectly) assigned. Note that in some domains, Carlo simulation as discussed later. See [22] for a labels are assigned reflecting different levels of comprehensive discussion on issues with assurance, but this has lead to further confusion in significance testing, as well as Monte Carlo relation to possible measures and the effectiveness simulations. of the techniques evaluated [1]. In Information

Retrieval, the labelling of a subset of relevant PRACTICAL CONSIDERATIONS documents selected by an initial collection of If we have a fixed size dataset, then it is arguably systems can lead to relevant documents being sufficient to maximize the determinant of the labelled as irrelevant because they were missed by unnormalized contingency matrix, DT. However this the first generation systems – so for example is not comparable across datasets of different sizes, systems are actually penalized for improvements and we thus need to normalize for N, and hence that lead to discovery of relevant documents that do consider the determinant of the normalized not contain all specified query words. Thus here too, contingency matrix, dt. However, this value is still it is important to develop test sets that of influenced by both Bias and Prevalence. appropriate size, fully labelled, and appropriate for In the case where two evaluators or systems are the correct application of both Informedness and being compared with no a priori preference, the Markedness, as unbiased versions of Recall and Correlation gives the correct normalization by their Precision. respective Biases, and is to be preferred to Kappa. In the case where an unimpeachable Gold Standard This Information Retrieval paradigm indeed is employed for evaluation of a system, the provides a good example for the understanding of appropriate normalization is for Prevalence or the Informedness and Markedness measures. Not Evenness of the real gold standard values, giving only can documents retrieved be assessed in terms Informedness. Since this is constant, optimizing of prediction of relevance labels for a query using Informedness and optimizing dt are equivalent. Informedness, but queries can be assessed in More generally, we can look not only at what terms of their appropriateness for the desired proposed solution best solves a problem, by documents using Markedness, and the different comparing Informedness, but which problem is most kinds of search tasks can be evaluated with the usefully solved by a proposed system. In a medical combination of the two measures. The standard context, for example, it is usual to come up with Information Retrieval mantra that we do not need to potentially useful medications or tests, and then find all relevant documents (so that Recall or explore their effectiveness across a wide range of Informedness is not so relevant) applies only where complaints. In this case Markedness may be there are huge numbers of documents containing appropriate for the comparison of performance the required information and a small number can be across different conditions. expected to provide that information with Recall and Informedness, as biased and unbiased confidence. However another kind of Document variants of the same measure, are appropriate for Retrieval task involves a specific and rather small testing effectiveness relative to a set of conditions, set of documents for which we need to be confident and the importance of Recall is being increasingly that all or most of them have been found (and so recognized as having an important role in matching Recall or Informedness are especially relevant).

51 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

This is quite typical of literature review in a We now define a binary dichotomy for each label l specialized area, and may be complicated by new with l and the corresponding c as the Positive cases developments being presented in quite different (and all other labels/classes grouped as the forms by researchers who are coming at it from Negative case). We next denote its Prevalence different directions, if not different disciplinary Prev( l) and its dichotomous Bookmaker backgrounds. Informedness B( l), and so can simplify (41) to B(R|P) = ∑l Prev( l) B( l) (42) THE GENERAL CASE Analogously we define dichotomous Bias( c) and So far we have examined only the binary case with Markedness( c) and derive dichotomous Positive versus Negative classes and M(P|R) = ∑c Bias( c) M( c) (43) labels. These formulations remain consistent with the It is beyond the scope of this article to consider the definition of Informedness as the probability of an continuous or multivalued cases, although the informed decision versus chance, and Markedness Matthews Correlation is a discretization of the as its dual. The Geometric Mean of multiclass Pearson Correlation with its continuousvalued Informedness and Markedness would appear to assumption, and the Spearman Rank Correlation is give us a new definition of Correlation, whose an alternate form applicable to arbitrary discrete square provides a well defined Coefficient of value (Likert) scales, and Tetrachoric Correlation is Determination. Recall that the dichotomous forms of available to estimate the correlation of an Markedness (20) and Informedness (21) have the underlying continuous scale [11]. If continuous determinant of the contingency matrix as common measures corresponding to Informedness and numerators, and have denominators that relate only Markedness are required due to the canonical to the margins, to Prevalence and Bias respectively. nature of one of the scales, the corresponding Correlation, Markedness and Informedness are thus Regression Coefficients are available. equal when Prevalence = Bias. The dichotomous It is however, useful in concluding this article to Correlation Coefficient would thus appear to have consider briefly the generalization to the multiclass three factors, a common factor across Markedness case, and we will assume that both real classes and and Informedness, representing their conditional predicted classes are categorized with K labels, and dependence, and factors representing Evenness of again we will assume that each class is nonempty Bias (cancelled in Markedness) and Evenness of unless explicitly allowed (this is because Precision Prevalence (cancelled in Informedness), each is illdefined where there are no predictions of a representing a marginal independence. label, and Recall is illdefined where there are no In fact, Bookmaker Informedness can be driven members of a class). arbitrarily close to 0 whilst Markedness is driven Generalization of Association arbitrarily close to 1, demonstrating their Powers [4] derives Bookmaker Informedness (41) independence – in this case Recall and Precision analogously to Mutual Information & Conditional will be driven to or close to 1. The arbitrarily close Entropy (3940) as a pointwise average across the hedge relates to our assumption that all predicted contingency cells, expressed in terms of label and real classes are nonempty, although probabilities PP( l), where PP( l) is the probability of appropriate limits could be defined to deal with the Prediction l, and labelconditioned class divide by zero problems associated with these probabilities P R(c|l ) , where PR(c|l ) is the probability extreme cases. Technically, Informedness and that the Prediction labeled l is actually of Real class Markedness are conditionally independent – once c, and in particular PR(l|l ) = Precision( l), and where the determinant numerator is fixed, their values we use the delta functions as mathematical depend only on their respective marginal shorthands for Boolean expressions interpreted denominators which can vary independently. To the algorithmically as in C, with true expressions taking extent that they are independent, the Coefficient of the value 1 and false expressions 0, so that δ |c-l| ≡ ( c Determination acts as the joint probability of mutual = l) represents a Dirac measure (limit as δ→0); ∂ |c- determination, but to the extent that they are l|≡ ( c ≠ l) represents its logical complement (1 if c ≠ l dependent, the Correlation Coefficient itself acts as and 0 if c = l)). the joint probability of mutual determination. MI(R||P) =∑l PP(l) ∑c PR(c|l ) [–log(P R(c|l ))/P R(c)] (39) These conditions carry over to the definition of H(R|P) =∑l PP(l) ∑c PR(c|l ) [–log(P R(c|l ))] Correlation in the multiclass case as the Geometric (40) Mean of Markedness and Informedness – once all B(R|P) =∑l PP(l) ∑c PR(c|l ) [P P(l)/(P R(l) – ∂| c-l|)] numerators are fixed, the denominators (41) demonstrate marginal independence.

52 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

the formulation in terms of the determinants DET and det (generalizing dichotomous DP=DTP and dp=dtp) and their geometric interpretation as the area of a parallelogram in PNspace and its normalization to ROCspace by the product of Prevalences, giving Informedness, or conversely normalization to Markedness by the product of biases. The generalization of DET to a volume in high dimensional PNspace and det to its normalization by product of Prevalences or biases, is sufficient to guarantee generalization of (2022) to K classes by reducing from KD to SSD so that BMG has the form of a coefficient of proportionality of variance: M ≈ [det / BiasG K]2/K = det 2/K / Evenness P (44) B ≈ [det / PrevG K ]2/K = det 2/K / Evenness R+ (45) BMG ≈ det 2/K / [PrevG — BiasG] = det 2/K / Evenness G+ (46) We have marked the Evenness terms in these equations with a trailing plus to distinguish them from other usages, and their definitions are clear from comparison of the denominators. Note that the Evenness terms for the generalized regressions (4445) are not Arithmetic Means but have the form of Geometric Means. Furthermore, the dichotomous case emerges for K=2 as expected. Empirically (Fig. 3), this generalization matches well near B=0 or B=1, but fares less well in between the extremes, suggesting a mismatched exponent in the heuristic conversion of K dimensions to 2. Here we set up the Monte Carlo simulation as follows: we define the diagonal of a random perfect performance contingency table with expected N entries using a random uniform distribution, we define a random chance level contingency table setting margins independently using a random binormal distribution, then distributing randomly across cells around their Fig. 3- Determinant-based estimates of correlation. expected values, we combine the two (perfect and 110Monte Carlo simulations with 11 stepped expected chance) random contingency tables with respective Informedness levels (red line) with Bookmaker weights I and (1I), and finally increment or estimated Informedness (red dots), Markedness (green decrement cells randomly to achieve cardinality N dot) and Correlation (blue dot), with significance (p+1) which is the expected number but is not constrained calculated using G2, X2, and Fisher estimates, and by the process for generating the random (perfect Correlation estimates calculated from the Determinant of and chance) matrices. This procedure was used to Contingency using two different exponents, 2/ K (DB & ensure Informedness and Markedness estimates DM) and 1/[3 K-2] (DBa and DMa). The difference retain a level of independence; otherwise they tend between the estimates is also shown. to correlate very highly with overly uniform margins Here K=4, N=128, X=1.96, α=β=0.05. for higher K and lower N (conditional independence is lost once the margins are specified) and in We now reformulate the Informedness and particular Informedness, Markedness, Correlation Markedness measures in terms of the Determinant and Kappa would always agree perfectly for either of the Contingency and Evenness, generalizing (20 I=1 or perfectly uniform margins. Note this use of 22). In particular, we note that the definition of Informedness to define a target probability of an Evenness in terms of the Geometric Mean or informed decision followed by random inclusion or product of biases or Prevalences is consistent with deletion of cases when there is a mismatch versus

53 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation the expected number of instances N – the preset The analogous Prevalenceweighted multiclass Informedness level is thus not a fixed preset statistic generalized from the Bookmaker Informedness but a target level that permits jitter Informedness form of the Significance statistic, and around that level, and in particular will be an the Biasweighted statistic generalized from the overestimate for the step I=1 (no negative counts Markedness form, extend Eqns 3234 to the K>2 possible) which can be detected by excess case by probabilityweighted summation (this is a deviation beyond the set Confidence Intervals for weighted Arithmetic Mean of the individual cases high Informedness steps. targeted to r=K1 degree of freedom): χ2KB = KNB 2Evenness R– (47) In Fig. 3 we therefore show and compare an χ2KM = KNM 2Evenness P– (48) alternate exponent of 1/(3K2) rather than the χ2KBM = KNBMEvenness G– (49) exponent of 2/K shown in (44 to 45). This also For K=2 and r=1, the Evenness terms were the reduces to 1 and hence the expected exact product of two complementary Prevalence or Bias correspondence for K=2. This suggests that what is terms in both the Bookmaker derivations and the important is not just the number of dimensions, but Significance Derivations, and (30) derived a single the also the number of marginal degrees of multiplicative Evenness factor from a squared freedom: K+2(K1), but although it matches well for Evenness factor in the numerator deriving from dtp 2, high degrees of association it shows similar error at and a single Evenness factor in the denominator. low informedness. The precise relationship between We will discuss both these Evenness terms in the a Determinant and Correlation, Informedness and later section. We have marked the Evenness terms Markedness for the general case remains a matter in (4749) with a trailing minus to distinguish them for further investigation. We however continue with from the forms used in (2022,4446). the use of the approximation based on 2/K. One specific issue with the goodnessoffit The Evenness R (Prev.IPrev) concept corresponds approach applied to Kclass contingency tables to the concept of Odds (IPrev/Prev), where relates to the up to (K−1) 2 degrees of freedom, Prev+IPrev=1, and Powers [4] shows that (multi which we focus on now. The assumption of class) Bookmaker Informedness corresponds to the independence of the counts in (K−1) 2 of the cells is expected return per bet made with a fair Bookmaker appropriate for testing the null hypothesis, H0, and (hence the name). From the perspective of a given the calculation versus alpha, but is patently not the bet (prediction), the return increases as the case when the cells are generated by K condition probability of winning decreases, which means that variables and K prediction variables that mirror an increase in the number of other winners can them. Thus a correction is in order for the increase the return for a bet on a given horse calculation of beta for some specific alternate (predicting a particular class) through changing the hypothesis H A or to examine the significance of the Prevalences and thus Evenness R and the Odds. difference between two specific hypotheses H A and The overall return can thus increase irrespective of HB which may have some lesser degree of the success of bets in relation to those new wins. In difference. practice, we normally assume that we are making Whilst many corrections are possible, in this case our predictions on the basis of fixed (but not correcting the degrees of freedom directly seems necessarily known) Prevalences which may be appropriate and whilst using r = (K−1) 2 degrees of estimated a priori (from past data) or post hoc (from freedom is appropriate for alpha, using r = K−1 the experimental data itself), and for our purposes degrees of freedom is suggested for beta under the are assumed to be estimated from the contingency conditions where significance is worth testing, given table. the association (mirroring) between the variables is almost complete. In testing against beta, as a Generalization of Significance threshold on the probability that a specific alternate 2 In relation to Significance, the single class χ+P and hypothesis of the tested association being valid G+P 2 definitions both can be formulated in terms of should be rejected. The difference in a χ 2 statistic cell counts and a function of ratios, and would between two systems (r = K−1) can thus be tested normally be summed over at least (K−1) 2 cells of a for significance as part of comparing two systems 2 Kclass contingency table with (K−1) degrees of (the Correlationbased statistics are recommended freedom to produce a statistic for the table as a in this case). The approach can also compare a whole. However, these statistics are not system against a model with specified Informedness independent of which variables are selected for (or Markedness). Two special cases are relevant evaluation or summation, and the pvalues obtained here, H0, the null hypothesis corresponding to null are thus quite misleading, and for highly skewed Informedness (B = 0: testing alpha with r = (K−1)2), distributions (in terms of Bias or Prevalence) can be and H1, the full hypothesis corresponding to full outlandishly incorrect. If we sum loglikelihood (31) Informedness (B = 1: testing beta with r = K−1). 2 over all K cells we get NMI(R||P) which is invariant Equations 4749 are proposed for interpretation over Inverses and Duals. under r = K−1 degrees of freedom (plus noise) and

54 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

are hypothesized to be more accurate for statistics, it is essential that the optimal assignment investigating the probability of the alternate of labels is made – perverse solutions with hypothesis in question, HA (beta). suboptimal allocations of labels will underestimate Equations 5052 are derived by summing over the the significance of the contingency table as they (K−1) complements of each class and label before clearly do take into account what one is trying to applying the Prevalence or bias weighted sum demonstrate and how well we are achieving that across all predictions and conditions. These goal. measures are thus applicable for interpretation The empirical observation concerning Cramer’s V under r = (K−1)2 degrees of freedom (plus biases) suggests that the strict probabilistic interpretation of and are theoretically more accurate for estimating the multiclass generalized Informedness and the probability of the null hypothesis H0 (alpha). In Markedness measures (probability of an informed or practice, the difference should always be slight (as marked decision), is not reflected by the traditional the cumulative density function of the gamma correlation measures, the squared correlation being distribution χ 2 is locally near linear in r – see Fig. 4) a coefficient of proportionate determination of reflecting the usual assumption that alpha and beta variance and that outside of the 2D case where they may be calculated from the same distribution. Note match up with BMG, we do not know how to that there is no difference in either the formulae nor interpret them as a probability. However, we also r when K=2. note that Informedness and Markedness tend to χ2XB = K(K−1)NB 2Evenness R– (50) correlate and are at most conditionally independent χ2XM = K(K−1)NM 2Evenness P– (51) (given any one cell, e.g given tp), so that their χ2XBM = K(K−1)NBMEvenness G– (52) product cannot necessarily be interpreted as a joint Equations 5355 are applicable to naïve unweighted probability (they are conditionally dependent given a summation over the entire contingency table, but margin, viz. prevalence rp or bias pp: specifying one also correspond to the independence test with r = of B or M now constrains the other; setting (K−1) 2 degrees of freedom, as well as slightly bias=prevalence, as a common heuristic learning underestimating but asymptotically approximating constraint, maximizes correlation at BMG=B=M). the case where Evenness is maximum in (5052) at 1/K 2. When the contingency table is uneven, We note further that we have not considered a Evenness factors will be lower and a more tetrachoric correlation, which estimates the conservative pvalue will result from (5052), whilst regression of assumed underlying continuous summing naively across all cells (5355) they can variables to allow calculation of their Pearson lead to inflated statistics and underestimated p Correlation. values. However, they are the equations that correspond to common usage of the χ2 and G 2 statistics as well as giving rise implicitly to Cramer’s V = [χ 2/N(K1)] 1/2 as the corresponding estimate of the Pearson correlation coefficient, ρ, so that Cramer’s V is thus also likely to be inflated as an estimate of association where Evenness is low. We however, note these, consistent with the usual conventions, as our definitions of the conventional forms of the χ 2 statistics applied to the multiclass generalizations of the Bookmaker accuracy/association measures:

χ2B = (K−1)NB 2 (53) χ2M = (K−1)NM 2 (54) Sketch Proof of General Chi-squared Test χ2BM = (K−1)NBM (55) Note that Cramer’s V calculated from standard full The traditional χ 2 statistic sums over a number of contingency χ 2 and G 2 estimates tends vastly terms specified by r degrees of freedom, stopping overestimate the level of association as measured once dependency emerges. The G 2 statistic derives by Bookmaker and Markedness or constructed from a loglikelihood analysis which is also empirically. It is also important to note that the full approximated, but less reliably, by the χ2 statistic. In matrix significance estimates (and hence Cramer’s both cases, the variates are assumed to be V and similar estimates from these χ 2 statistics) are asymptotically normal and are expected to be independent of the permutations of predicted labels normalized to mean =0, standard deviation σ=1, (or real classes) assigned to the contingency tables, and both the Pearson and Matthews correlation and and that in order to give such an independent the χ2 and G 2 significance statistics implicitly estimate using the above family of Bookmaker perform such a normalization. However, this leads to significance statistics that vary according to which

55 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

labels other than our target c to get a (K1) 2 degree of freedom estimate χ 2-lXP given by

χ2-lXP = ∑c≠l χ2+l P= ∑c χ2+c P – χ2+l P (56)

We then perform a Bias( l) weighted sum over χ 2-lXP to achieve our label independent (K1) 2 degree of freedom estimate χ 2XB as follows (substituting from equation 30 then 39):

χ2XB =∑lBias( l) [NB 2Evenness R(l)/Bias( l) – χ2+l P] =K χ2KB – χ2KB = (K1) χ2KB =K(K1) NB 2Evenness R (57) This proves the Informedness form of the generalized (K1) 2 degree of freedom χ 2 statistic (42), and defines Evenness R as the Arithmetic Mean of the individual dichotomous Evenness R(l) terms (assuming B is constant). The Markedness form of the statistic (43) follows by analogous (Dual) argument, and the Correlation form (44) is simply the Geometric Mean of these two forms. Note however that this proof assumes that B is constant across all labels, and that assuming the determinant det is constant leads to a derivative of (2021) involving a Harmonic Mean of Evenness as discussed in the next section.

The simplified (K1) degree of freedom χ2K statistics were motivated as weighted averages of the dichotomous statistics, but can also be seen to approximate the χ 2X statistics given the observation that for a rejection threshold on the null hypothesis H0, alpha< 0.05, the χ 2 cumulative isodensity lines are locally linear in r (Fig. 4). Testing differences within a beta threshold as discussed above, is appropriate using the χ 2K series of statistics since they are postulated to have (K1) degrees of freedom. Alternately they may be tested according Fig. 5-Illustration of significance and Cramer’s to the χ 2X series of statistics given they are V. 110Monte Carlo simulations with 11 stepped postulated to differ in (K1) 2 degrees of freedom, expected Informedness (red) levels with namely the noise, artefact and error terms that Bookmaker make the cells different between the two estimated Informedness (red dots), Markedness hypotheses (viz. that contribute to decorrelation). In practice, when used to test two systems or models (green dot) and Correlation (blue dot), with other than the null, the models should be in a significance (p+1) calculated using G 2, X 2, and sufficiently linear part of the isodensity contour to be Fisher estimates, and (skewed) Cramer’s V insensitive to the choice of statistic and the 2 Correlation estimates calculated from both G and assumptions about degrees of freedom. When 2 X . Here K=4, N=128, X=1.96, α=β=0.05. tested against the null model, a relatively constant error term can be expected to be introduced by term is in focus if we sum over r rather than K 2. In using the lower degree of freedom model. The error the binary dichotomous case, it makes sense to introduced by the Cramer’s V (K1 degree of sum over only the condition of primary focus, but in freedom) approximation to significance from G 2 or the general case it involves leaving out one case χ2 can be viewed in two ways. If we start with a G 2 (label and class). By the Central Limit Theorem, or χ 2 estimate as intended by Cramer we can test summing over (K1)2 such independent zscores the accuracy of the estimate versus the true gives us a normal distribution with σ=(K1). correlation, markedness and informedness as illustrated in Fig. 3. Note that we can see here that We define a single case χ 2+l P from the χ 2+P (30) calculated for label l = class c as the positive Cramer’s V underestimates association for high dichotomous case. We next sum over these for all levels of informedness, whilst it is reasonably

56 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

accurate for lower levels. If we use (53) to (55) to Evenness R– = Evenness R+ / Evenness R# (62) estimate significance from the empirical association where the + form is defined as the squared measures, we will thus underestimate significance Geometric Mean (4446), again suggesting that the under conditions of high association – viz. it the test – form is best approximated as an Arithmetic Mean is more conservative as the magnitude of the effect (4749). The above division by the Harmonic Mean increases. is reminiscent of the Williams’ correction which Generalization of Evenness divides the G 2 values by an Evennesslike term The proof that the product of dichotomous q=1+(a 21)/6Nr where a is the number of categories Evenness factors is the appropriate generalization for a goodnessoffit test, K [1820] or more in relation to the multiclass definition of Bookmaker generally, K/PrevH [17] which has maximum K Informedness and Markedness does not imply that when Prevalence is even, and r=K1 degrees of it is an appropriate generalization of the freedom, but for the more relevant usage as an dichotomous usage of Evenness in relation to independence test on a complete contingency table Significance, and we have seen that the Arithmetic with r=(K1) 2 degrees of freedom it is given by a 2 rather the Geometric Mean emerged in the above 1=(K/PrevH1)(K/BiasH1) where PrevH and BiasH sketch proof. Whilst in general one would assume are the Harmonic Means across the K classes or that Arithmetic and Harmonic Means approximate labels respectively [1723]. the Geometric Mean, we argue that the latter is the In practice, any reasonable excursion from more appropriate basis, and indeed one may note Evenness will be reflected adequately by any of the that it not only approximates the Geometric Mean of means discussed, however it is important to the other two means, but is much more stable as recognize that the + form is actually a squared the Arithmetic and Harmonic means can diverge Geometric Mean and is the product of the other two radically from it in very uneven situations, and forms as shown in (62). An uneven bias or increasingly with higher dimensionality. On the other Prevalence will reduce all the corresponding hand, the Arithmetic Mean is insensitive to Evenness forms, and compensate against reduced evenness and is thus appropriate as a baseline in measures of association and significance due to determining evenness. Thus the ratios between the lowered determinants. means, as well as between the Geometric Mean and the geometric mean of the Arithmetic and Whereas broad assumptions and gross accuracy Harmonic means, give rise to good measures of within an order of magnitude may be acceptable for evenness. calculating significance tests and pvalues [23], it is On geometric grounds we introduced the clearly not appropriate for estimate the strength of Determinant of Correlation, det, generalizing dp, associations. Thus the basic idea of Cramer’s V is and representing the volume of possible deviations flawed given the rough assumptions and substantial from chance covered by the target system and its errors associated with significance tests. It is thus perversions, showing its normalization to and better to start with a good measure of association, Informednesslike statistic is Evenness P+ the and use analogous formulae to estimate product of the Prevalences (and is exactly significance or confidence. Informedness for K=2). This gives rise to an alternative dichotomous formulation for the Generalization of Confidence aggregate false positive error for an individual case The discussion of confidence generalizes directly to in terms of the K1 negative cases, using a ratio or the general case, with the approximation using submatrix determinant to submatrix product of Bookmaker Informedness 1, or analogously Prevalences. This can be extended to all K cases Markedness, applying directly (the Informedness while reflecting K1 degrees of freedom, by form is again a Prevalence weighted sum, in this extending to the full contingency matrix case of a sum of squared versus absolute errors), determinant, det, and the full product of viz. Prevalences, as our definition of another form of Evenness, Evenness R# being the Harmonic Mean of CI B2 = X [1|B|] / √[2 E (N1)] (63) the dichotomous Evenness terms for constant CI M2 = X [1|B|] / √[2 E (N1)] (64) determinant: CI C2 = X [1|B|] / √[ 2 E (N1)] (65) χ2KB = KNdet 2/K / Evenness R# (58) χ2KM = KNdet 2/K / Evenness P# (59) χ2KBM = KNdet 2/K / Evenness G# (60) Recall that the + form of Evenness is exemplified by 1 Informedness may be dichotomous and relates in this form to DeltaP, WRacc and the Gini Coefficient as discussed below. Evenness R+ = [Π l Prev( l)] 2/K =PrevG (61) Bookmaker Informedness refers to the polychotomous and that the relationship between the three forms of generalization based on the Bookmaker analogy and algorithm Evenness is of the form [4].

57 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

show those appropriate to the different measures of association (Bookmaker Informedness, B; Markedness, M, and their geometric mean as a symmetric measure of Correlation, C). Those shown relate to beta (the empirical hypothesis based on the calculated B, giving rise to a test of power), but are also appropriate both for significance testing the null hypothesis (B=0) and provide tight (0width) bounds on the full correlation (B=1) hypothesis as appropriate to its signification of an absence of random variation and hence 100% power (and extending this to include measurement error, discretization error, etc.) The numeric subscript is 2 as notwithstanding the different assumptions behind the calculation of the confidence intervals (0 for the null hypothesis corresponding to alpha=0.05, 1 for the alternate hypothesis corresponding to beta=0.05 based on the weighted arithmetic model, and 2 for the full correlation hypothesis corresponding to gamma=0.05 – for practical purposes it is reasonable to use |1B| to define the basic confidence interval for CI B0 , CI B1 and CI B2 , given variation is due solely to unknown factors other than measurement and discretization error. Note that all error, of whatsoever kind, will lead to empirical estimates B<1.

If the empirical (CI B1 ) confidence intervals include B=1, the broad confidence intervals (CI B2 ) around a theoretical expectation of B=1 would also include the empirical contingency – it is a matter of judgement based on an understanding of contributing error whether the hypothesis B=1 is supported given nonzero error. In general B=1 should be achieved empirically for a true correlation unless there are measurement or labelling errors that are excluded from the informedness model, since B<1 is always significantly different from B=1 by definition (1B=0 unaccounted variance due to guessing). None of the traditional confidence or significance measures fully account for discretization error Fig. 6- Illustration of significance and (N<8K) or for the distribution of margins, which are confidence. 110Monte Carlo simulations with 11 ignored by traditional approaches. To deal with stepped expected Informedness levels (red line) discretization error we can adopt an sse estimate with Bookmakerestimated Informedness (red that is either constant independent of B, such as the dots), Markedness (green dot) and Correlation unweighted arithmetic mean, or a nontrivial (blue dot), with significance (p+1) calculated function that is nonzero at both B=0 and B=1, such using G 2, X 2, and Fisher estimates, and as the weighted arithmetic mean which leads to: confidence bands shown for both the theoretical CI B1 = X [12|B|+2B 2] / √[2 E (N1)] (66) Informedness and the B=0 and B=1 levels CI M1 = X [12|B|+2B 2] / √[2 E (N1)] (67) (parallel almost meeting at B=0.5). The lower CI C1 = X [12|B|+2B 2] / √[ 2 E (N1)] (68) theoretical band is calculated twice, using both

CI B1andCI B2. Here K=4, N=16, X=1.96, α=β=0.05. Substituting B=0 and B=1 into this gives equivalent CIs for the null and full hypothesis. In fact it is sufficient to use the B=0 and 1 confidence intervals In Equations 6365 Confidence Intervals derived based on this variant since for X=2 they overlap at from the sse estimates of §2.8 are subscripted to 58 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

N<16. We illustrate such a marginal significance Bookmaker statistics 2 (these were modified by the case in Fig. 6, where the large difference between present author to produce the results presented in the significance estimates is clear with Fisher this paper). The connection with DeltaP was noted showing marginal significance or better almost in the course of collaborative research in everywhere, G 2 for B>~0.6, χ 2 for B>~0.8. >~95% of Psycholinguistics, and provides an important Bookmaker estimates are within the confidence Psychological justification or confirmation of the bands as required (with 100% bounded by the more measure where biological plausibility is desired. We conservative lower band), however our B=0 and have referred extensively to the equivalence of B=1 confidence intervals almost meet showing that Bookmaker Informedness to ROC AUC, as used we cannot distinguish intermediate B values other standardly in Medicine, although AUC has the form than B=0.5 which is marginal. Viz. we can say that of an undemeaned probability based on a this data seems to be random (B<0.5) or informed parameterized classifier or a series of classifiers, (B>0.5), but cannot be specific about the level of and B is a demeaned renormalized kappalike form informedness for this small N (except for based on a single fully specified classifier. B=0.5±0.25). The Informedness measure has thus proven its If there is a mismatch of the marginal weights worth across a wide range of disciplines, at least in between the respective prevalences and biases, its dichotomous form. A particular feature of the this is taken to contravene our assumption that major PhD studies that used Informedness, is that Bookmaker statistics are calculated for the optimal they covered different numbers of classes assignment of class labels. Thus we assume that (exercising the multiclass form of Bookmaker as any mismatch is one of evenness only, and thus we implemented in Matlab), as well as a number of set the Evenness factor E=PrevG*BiasG*K 2. Note different noise and artefact conditions. Both of that the difference between Informedness and these aspects of their work meant that the Markedness also relates to Evenness, but traditional measures and derivatives of Recall, Markedness values are likely to lie outside bounds Precision and Accuracy were useless for comparing attached to Informedness with probability greater the different runs and the different conditions, whilst than the specified beta. Our model can thus take Bookmaker gave clear unambiguous, easily into account distribution of margins provided the interpretable results which were contrasted with the optimal allocation of predictions to categories traditional measures in these studies. (labelling) is assigned. The new χ 2KB , χ 2KM and χ 2KBM ,χ 2XB , χ 2XM and χ 2XBM The multiplier X shown is set from the appropriate correlation statistics were developed heuristically (inverse cumulative) Normal or Poisson distribution, with approximative sketch proofs/arguments, and and under the twotailed form of the hypothesis, have only been investigated to date in toy contrived X=1.96 gives alpha, beta and gamma of 0.05. A situations and the Monte Carlo simulations in Figs 3 multiplier of X=1.65 is appropriate for a onetailed to 5. In particular, whilst they work well in the hypotheses at 0.05 level. Significance of difference dichotomous state, where they demonstrate a clear from another model is satisfied to the specified level advantage over χ 2 traditional approaches, there has if the specified model (including null or full) does not as yet been no no application to our multiclass lie in the confidence interval of the alternate model. experiments and no major body of work comparing Power is adequate to the specified level if the new and conventional approaches to significance. alternate model does not lie in the confidence Just as Bookmaker (or DeltaP') is the normative interval of the specified model. Figure 7 further measure of accuracy for a system against a Gold illustrates the effectiveness of the 95% empirical Standard, so is χ 2XB the proposed χ 2 significance and theoretical confidence bounds in relation to the statistic for this most common situation in the significance achievable at N=128 (K=5). absence of a more specific model (noting that ∑x = ∑x 2 for dichotomous data in {0,1}). For the cross EXPLORATION AND FUTURE WORK rater or crosssystem comparison, where neither is Powers Bookmaker Informedness has been used normative, the BMG Correlation is the appropriate extensively by proponent and his students over the measure, and correspondingly we propose that 2 2 last 10 years, in particular in the PhD Theses and χ KBM is the appropriate χ significance statistic. To other publications relating to AudioVisual Speech explore these thoroughly is a matter for future Recognition [2526] and EEG/Brain Computer research. However, in practice we tend to Interface [2728], plus Matlab scripts that are recommend the use of Confidence Intervals as available for calculating both the standard and

2 http://www.mathworks.com/matlabcentral/fileexchange/ 5648-informedness-of-a-contingency-matrix

59 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation

hypothesis, and if it is it is significantly different. If its own confidence interval also avoids overlapping the alternate mean this mutual significance is actually a reflection of statistical power at a complementary level. However, as with significance tests, it is important to avoid reading to avoid too much into nonoverlap of interval and mean (not of intervals) as the actual probabilities of the hypotheses depends also on unknown priors. Thus whilst our understanding of Informedness and Markedness as performance measure is now quite mature, particularly in view of the clear relationships with existing measures exposed in this article, we do not regard current practice in relation to significance and confidence, or indeed our present discussion, as having the same level of maturity and a better understanding of the significance and confidence measures remains a matter for further work, including in particular, research into the multi class application of the technique, and exploration of the asymmetry in degrees of freedom appropriate to alpha and beta, which does not seem to have been explored hitherto. Nonetheless, based on pilot experiments, the dichotomous χ 2KB family of statistics seems to be more reliable than the traditional χ 2 and G 2 statistics, and the confidence intervals seem to be more reliable than both. It is also important to recall that the marginal assumptions underlying the both the χ 2 and G 2 statistics and the Fisher exact test are not actually valid for contingencies based on a parameterized or learned system (as opposed to naturally occurring pre and postconditions) as the different tradeoffs Fig. 7-Illustration of significance and and algorithms will reflect different margins (biases). confidence. 110Monte Carlo simulations with 11 It also remains to explore the relationship between stepped expected Informedness levels (red line) Informedness, Markedness, Evenness and the with Bookmakerestimated Informedness (red Determinant of Contingency in the general dots), Markedness (green dot) and Correlation multiclass case. In particular, the determinant (blue dot), with significance (p+1) calculated generalizes to multiple dimensions to give a volume using G 2, X 2, and Fisher estimates, and of space that represents the coverage of confidence bands shown for both the theoretical parameterizations that are more random than Informedness and the B=0 and B=1 levels contingency matrix and its perverted forms (that is (parallel almost meeting at B=0.5). The lower permutations of the classes or labels that make it theoretical band is calculated twice, using both suboptimal or subchance). Maximizing the

CI B1andCI B2. Here K=5, N=128, X=1.96, determinant is necessary to maximize Informedness α=β=0.05. and Markedness and hence Correlation, and the illustrated in Figs 4 and 5, since these give a direct normalization of the determinant to give those indication of power versus the confidence interval measures as defined by (4243) defines respective on the null hypothesis, as well as power when used multiclass Evenness measures satisfying a with confidence intervals on an alternate generalization of (2021). This alternate definition hypothesis. needs to be characterized, and is the exact form that should be used in equations 30 to 46. The Furthermore, when used on the empirical mean relationship to the discussed meanbased (correlation, markedness or informedness), the definitions remains to be explored, and they must at overlap of the interval with another system, and present be regarded as approximative. However, it viceversa, give direct indication of both significance is possible (and arguably desirable) to instead of and power of the difference between them. If a using Geometric Means as outlined above, to system occurs in another confidence interval it is calculate Evenness as defined by the combination not significantly different from that system or of (2022,4243). It may be there is an simplified

60 Journal of Machine Learning Technologies ISSN: 22293981 & ISSN: 2229399X, Volume 2, Issue 1, 2011 Powers DMW

identity or a simple relationship with the Geometric probability is 0 is not reflected in the empirical Mean definition, but such simplifications have yet to distributions. be investigated. Monte Carlo simulations have been performed in MONTE CARLO SIMULATION Matlab using all the variants discussed above. Violating the strictly positive margin assumption Whilst the Bookmaker measures are exact causes NaNs for many statistics, and for this reason estimates of various probabilities, as expected this in enforced by setting 1s at the intersection of values, they are means of distributions influenced paired zeromargin rows and columns, or arbitrarily not only by the underlying decision probability but for unpaired rows or columns. Another way of the marginal and joint distributions of the contingent avoiding these NaN problems is to relax the variables. In developing these estimates a minimum integral/discreteness assumptions. Uniform margin of assumptions have been made, including avoiding free distribution, discrete or realvalued, produces a the assumption that the margins are predetermined broader error distribution than the margin or that bias tracks prevalence, and thus it is constrained distributions. It is also possible to use arguable that there is no attractor at the expected socalled copula techniques to reshape uniformly values produced as the independent product of distributed random numbers to another distribution. marginal probabilities. For the purposes of Monte In addition Matlab’s directly calculated binornd Carlo simulation, these have been implemented in function has been used to simulate the binomial Matlab 6R12 using a variety of distributions across distribution, as well as the absolute value of the the full contingency table, the uniform variant normal distribution shifted by (plus) the binomial modelling events hitting any cell with equal standard deviation. No noticeable difference has probability in a discrete distribution with K 21 been observed due to relaxing the degrees of freedom (given N is fixed). In practice, integral/discreteness assumptions except for (pseudo)random number will not automatically set disappearance of the obvious banding and more K2 random numbers so that they add exactly to N, prevalent extremes at low N, outside the and setting K 21 cells and allowing the final cell to recommended minimum expected count of 5 per be determined would give it o(K) times the standard cell for significance and confidence estimates to be deviation of the other cells. Thus another approach valid. On the other hand, we note that the built in is to approximately specify N and either leave the binornd produced unexpectedly low means and number of elements as it comes, or randomly always severely underproduced before correction 3. increment or decrement cells to bring it back to N, This leads to a higher discretization effect and less or ignore integer discreteness constraints and randomness, and hence overestimation of renormalize by multiplication. This raises the associations. The direct calculation over N events question of what other constraints we want to means it takes o(N) times longer to compute and is maintain, e.g. that cells are integral and non impractical for N in the range where the statistics negative, and that margins are integral and strictly are meaningful. The binoinv and related functions positive. ultimately use gammaln to calculate values and thus An alternate approach is to separately determine the copula technique is of reasonable order, its the prediction bias and real prevalence margins, results being comparable with those of absolute using a uniform distribution, and then using normal. conventional distributions around the expected Figures 2, 3, 5, 6 and 7 have thus all been based on value of each cell. If we believe the appropriate premarginalized simulations in Matlab, with distribution is normal, or the central limit applies, as discretized absolute normal distributions using post is conventionally assumed in the theory of χ 2 processing as discussed above to ensure significance as well as the theory of confidence maintenance of all constraints, for K=2 to 102 with 1 9 intervals, then a normal distribution can be used. expected value of N/K = 2 to 2 and expected B of However, if as in the previous model we envisage 0/10 to 10/10, noting that the forced constraint events that are allocated to cells with some process introduces additional randomness and that probability, then a binomial distribution is the relative amount of correction required is appropriate, noting that this is a discrete distribution expected to decrease with K. and that for reasonably large N it approaches the CONCLUSION normal distribution, and indeed the sum of The system of relationships we have discovered is independent events meets the definition of the amazingly elegant. From a contingency matrix in normal distribution except that discretization will count or reduced form (as probabilities), we can cause deviation. In general, it is possible that the expected marginal distribution is not met, or in particular that the assumption that no marginal 3 The author has since found and corrected this Matlab initialization bug.

61 Copyright © 2011, Bioinfo Publications Evaluation: from precision, recall and fmeasure to ROC, informedness, markedness & correlation construct both dichotomous and mutually exclusive as a demeaned average. Evenness is the square of multiclass statistics that correspond to debiased the Geometric Mean of Prevalence and Inverse versions of Recall and Precision (28,29). These Prevalence and/or Bias and Inverse Bias. χ 2 testing may be related to the Area under the Curve and is just multiplication by a constant, and conservative distance from (1,1) in the Recallbased ROC confidence intervals are then a matter of taking a analysis, and its dual Precisionbased method, for a squareroot. single fullyspecified classifier (viz. after fixing There is also an intuitive relationship between the threshold and/or other parameters). There are unbiased measures and their significance and further insightful relationships with Matthews confidence, and we have sought to outline a rough Correlation, with the determinant of either form of rationale for this, but this remains somewhat short the matrix (DTP or dtp), and the Area of the Triangle of formal proof of optimal formulae defining close defined by the ROC point and the chance line, or bounds on significance and confidence. equivalently the Area of the Parallelogram or Trapezoid defined by its perverted forms. REFERENCES Also useful is the direct relationship of the three [1] Fraser Alexander & Daniel Marcu (2007) Bookmaker goodness measures (Informedness, Computational Linguistics 33(3):293303. Markedness and Matthews Correlation) with both [2] Reeker L.H. (2000) Performance Metrics standard (biased) single variable significance tests for Intelligent Systems (PerMIS2000) . as well as the clean generalization to unbiased Accessed at significance tests in both dependent (low degree of http://www.isd.mel.nist.gov/research_area freedom) and independent (high degree of freedom) s/ forms along with simple formulations for estimating research_engineering/PerMIS_Workshop/ confidence intervals. More useful still is the simple 22 December 2007. extension to confidence intervals which have the [3] Lafferty J., McCallum A. & Pereira F. advantage that we can compare against models (2001) 18th International Conference on other than the null hypothesis corresponding to Machine Learning , San Francisco (ICML), B=0. In particular we also introduce the full CA: Morgan Kaufmann, 282289. hypothesis corresponding full informedness at B=1 [4] Powers David M. W. (2003) International mediated by measurement or labelling errors, and Conference on Cognitive Science (ICCS). can thus distinguish when it is appropriate to Accessed at recognize a specific value of partial informedness, http://david.wardpowers.info/BM/index.htm 0

expanded treatment at http:// Interfacing, PhD Thesis, School of ourworld.compuserve.com/homepages/ Psychology, Flinders University. jsuebersax/agree.htm 28 February 2011. [17] Williams D.A. (1976) Biometrika 63:3337. [18] Lowry Richard (1999) Concepts and Applications of Inferential Statistics . Accessed 2 March 2011 as http://faculty.vassar.edu/lowry/webtext.html [19] Berger J. (1985) Statistical Decision Theory and Bayesian Analysis . New York: Springer-Verlag. [20] Manning Christopher D. and Hinrich Schütze (1999) Foundations of Statistical Natural Language Processing . MIT Press, Cambridge, MA. [21] McDonald John H (2007) The Handbook of Biological Statistics . Accessed 2 March 2011 as http://udel.edu/~mcdonald/statpermissions.html

[22] Sellke T., Bayarri M.J. and Berger J. (2001) American Statistician 55, 6271. Accessed at http:// www.stat.duke.edu/%7Eberger/papers.html# pvalue 22 December 2007.)

[23] Smith P.J., Rae, D.S., Manderscheid R.W. and Silbergeld S. (1981) Journal of the American Statistical Association 76:375,737740.

[24] Sokal R.R., Rohlf F.J. (1995) Biometry: The principles and practice of statistics in biological research, 3rd ed New York: WH Freeman and Company.

[25] Lewis T.W. and Powers D.M.W. (2003). Journal of Research and Practice in Information Technology 35(1):4164

[26] Lewis Trent W. (2003) Noise-robust Audio Visual Phoneme Recognition , PhD Thesis, School of Informatics and Engineering, Flinders University.

[27] Fitzgibbon Sean P., David M. W. Powers, Kenneth Pope, and Richard Clark C. (2007) Journal of Clinical Neurophysiology 24(3):232243

[28] Fitzgibbon Sean P. (2007) A Machine Learning Approach to BrainComputer

63 Copyright © 2011, Bioinfo Publications