Two Information-Theoretic Tools to Assess the Performance of Multi-Class Classiﬁers

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Universidad Carlos III de Madrid e-Archivo Two Information-Theoretic Tools to Assess the Performance of Multi-class Classifiers Francisco J. Valverde-Albacete∗, Carmen Pelaez-Moreno´ Departamento de Teor´ıade la Señaly de las Comunicaciones, Universidad Carlos III de Madrid Avda de la Universidad, 30. 28911 Leganés,Spain Abstract We develop two tools to analyze the behavior of multiple-class, or multi-class, classifiers by means of entropic measures on their confusion matrix or contingency table. First we obtain a balance equation on the entropies that captures interesting properties of the classifier. Second, by normalizing this balance equation we first obtain a 2-simplex in a three-dimensional entropy space and then the de Finetti entropy diagram or entropy triangle. We also give examples of the assessment of classifiers with these tools. Key words: Multiclass classifier, confusion matrix, contingency table, performance measure, evaluation criterion, de Finetti diagram, entropy triangle 1 1. Introduction 22 non-uniform prior distributions of input patterns (Ben- 23 David, 2007; Sindhwani et al., 2004). For instance, with n p 2 Let VX = {xi}i=1 and VY = {y j} j=1 be sets of input and 24 continuous speech corpora, the silence class may ac- 3 output class identifiers, respectively, in a multiple-class 25 count for 40–60% percent of input patterns making a 4 classification task. The basic classification event con- 26 majority classifier that always decides Y = silence, the 5 sists in “presenting a pattern of input class xi to the clas- 27 most prevalent class, quite accurate but useless. Related 6 sifier to obtain output class identifier y j,” (X = xi, Y = 28 measures based in proportions over the confusion ma- 7 y j) . The behavior of the classifier can be sampled over 29 trix can be found in Sokolova and Lapalme (2009). 8 N iterated experiments to obtain a count matrix NXY 30 On these grounds, Kononenko and Bratko (1991) 9 where NXY (xi, y j) = Ni j counts the number of times that 31 have argued for the factoring out of the influence of 10 the joint event (X = xi, Y = y j) occurs. We say that NXY 32 prior class probabilities in similar measures. Yet, Ben- 11 is the (count-based) confusion matrix or contingency ta- 33 David (2007) has argued for the use of measures that 12 ble of the classifier. 34 correct naturally for random decisions, like Cohen’s 13 Since a confusion matrix is an aggregate recording 35 kappa, although this particular measure seems to be af- 14 of the classifier’s decisions, the characterization of the 36 fected by the marginal distributions. 15 classifier’s performance by means of a measure or set 16 of measures over its confusion matrix is an interesting 37 The Receiver Operating Characteristic (ROC) curve 17 goal. 38 (Fawcett, 2006) has often been considered a good visual 18 One often used measure is accuracy, the propor- 39 characterization of binary confusion matrices built upon 19 tion of times the classifier takes the correct deci- 40 proportion measures, but its generalization to higher in- P 20 sion A(NXY ) ≈ i NXY (xi, yi)/N. But this has of- 41 put and output set cardinals is not as effective. Likewise, 21 ten been deemed biased towards classifiers acting on 42 an extensive Area Under the Curve, (AUC) for a ROC 43 has often been considered an indication of good classi- 44 fiers (Bradley, 1997; Fawcett, 2006), but the calculation ∗ Corresponding author. Phone: +34 91 624 87 38. Fax: +34 91 45 of its higher dimensional analogue, the Volume Under 624 87 49 46 the Surface, (VUS) (Hand and Till, 2001) is less man- Email addresses: [email protected] (Francisco J. Valverde-Albacete), [email protected] (Carmen 47 ageable. It may also suffer from comparability issues Pelaez-Moreno)´ 48 across classifiers (Hand, 2009). Preprint submitted to Pattern Recognition Letters November 27, 2009 P 49 A better ground for discussing performance than 96 with marginals PX(x) = PX,Y (x, y j) and PY (y) = P y j∈Y 50 97 count confusion matrices may be empirical estimates of xi∈X PX,Y (xi, y). 1 51 the joint distribution between input and outputs, like the 98 Let QXY = PX · PY be the pmf with the same 52 maximum likelihood estimate used throughout this let- 99 marginals as PXY considering them to be independent ˆ MLE 53 ter PXY (xi, y j) ≈ PXY (xi, y j) = N(xi, y j)/N . The sub- 100 (that is, describing independent variables). Let UXY = 54 sequent consideration of the classifier as an analogue 101 UX · UY be the product of the uniform, maximally en- 55 of a communication channel between input and output 102 tropic pmfs over X and Y, UX(x) = 1/n and UY (y) = 56 class identifiers enables the importing of information- 103 1/p . Then the loss in uncertainty from UXY to QXY is 57 theoretic tools to characterize the “classification chan- 104 the difference in entropies: 58 nel”. This technique is already implicit in the work of 105 ∆H = H − H (2) 59 Miller and Nicely (1955). 106 PX ·PY UX ·UY PX ·PY 60 With this model in mind, Sindhwani et al. (2004) ar- 107 Intuitively, ∆HPX ·PY measures how far the classifier 61 gued for entropic measures that take into account the 108 is operating from the most general situation possible 62 information transfer through the classifier, like the ex- 109 where all inputs are equally probable, which prevents 63 pected mutual information between the input and output 110 the classifier from specializing in an overrepresented 64 distributions (Fano, 1961) 111 class to the detriment of classification accuracy in oth- X PX,Y (x, y) 112 ers. Since HUX = log n and HUY = log p , ∆HPX ·PY may 65 MIP = PX,Y (x, y) log (1) min XY 113 vary from ∆H = 0 , when the marginals themselves PX(x)PY (y) PX ·PY 66 x,y 114 are uniform PX = UX and PY = UY , to a maximum max 115 value ∆H = log n+log p , when they are Kronecker 67 and provided a contrived example with three confusion PX ·PY 116 68 matrices with the same accuracy but clearly differing delta distributions. 117 69 performances, in their opinion due to differences in mu- We would like to relate this entropy decrement to the 118 MI 70 tual information. Such examples are alike those put expected mutual information PXY of a joint distribu- 119 71 forth by Ben-David (2007) to argue for Cohen’s kappa tion. For that purpose, we realize that the mutual in- 120 72 as an evaluation metric for classifiers. formation formula (1) describes the decrease in entropy 121 when passing from distribution Q = P · P to P 73 For the related task of clustering, Meila (2007) used XY X Y XY 74 the Variation of Information, that actually amounts to 122 MIPXY = HPX ·PY − HPXY . (3) 75 the sum of their mutually conditioned entropies as a true 123 76 distance between the two random variables 124 And finally we invoke the well-known formula relating 125 the joint entropy HPXY and the expected mutual infor- 77 VI = H + H . 78 PXY PX|Y PY|X 126 mation MIPXY to the conditional entropies of X given Y , 127 HP (Y given X , HP respectively): 79 In this letter we first try to reach a more complete X|Y Y|X 80 understanding of what is a good classifier by develop- 128 H = H + H + MI (4) 129 PXY PX|Y PY|X PXY 81 ing an overall constraint on the total entropy balance 82 attached to its joint distribution. Generalizing over the min 130 Therefore MI may range from MI = 0 when PXY PXY 83 input and output class set cardinalities will allow us to 131 PXY = PX · PY , a bad classifier, to a theoretical max- 84 present a visualization tool in section 2.2 for classifier max 132 imum MI = (log n + log p)/2 in the case where the PXY 85 evaluation that we will further explore in some exam- 133 marginals are uniform and input and output are com- 86 ples both from real and synthetic data in section 2.3. In 134 pletely dependent, an excellent classifier. 87 section 2.4 we try to extend the tools to unmask major- 135 Recall the variation of information definition in Eq. 88 ity classifiers as bad classifiers. Finally we discuss the 136 (5). 89 affordances of these tools in the context of previously 90 used techniques. 137 VI = H + H (5) 138 PXY PX|Y PY|X 139 For optimal classifiers with deterministic relation from 91 2. Information-Theoretic Analysis of Confusion 140 the input to the output, and diagonal confusion matrices 92 Matrices min 141 VI = 0 , e.g., all the information about X is borne by PXY 93 2.1. The Balance equation and the 2-simplex 94 Let PXY (x, y) be an estimate of the joint proba- 1We drop the explicit variable notation in the distributions from 95 bility mass function (pmf) between input and output now on. 2 142 Y and vice versa. On the contrary, when they are inde- 189 to the information being transferred from input to out- max 143 pendent VI = H + H , the case with inaccurate 190 put MI . This requires as much mutual information to PXY PX PY PXY 144 classifiers which uniformly redistribute inputs among 191 stochastically bind the input to the output, thereby trans- 145 all outputs.

Two Information-Theoretic Tools to Assess the Performance of Multi-Class Classiﬁers

Information Theory Techniques for Multimedia Data Classification and Retrieval

On Measures of Entropy and Information

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

Mutual Information Analysis: a Comprehensive Study∗

Nonparametric Maximum Entropy Estimation on Information Diagrams

Mutual Information Analysis: a Comprehensive Study *

Anatomy of a Bit: Information in a Time Series Observation

4 Mutual Information and Channel Capacity

Information Measures & Information Diagrams

5 Mutual Information and Channel Capacity

The Conditional Entropy Bottleneck

The Mutual Information Diagram for Uncertainty Visualization