Article Assessing Information Transmission in Data Transformations with the Channel Multivariate Entropy Triangle

Francisco J. Valverde-Albacete † ID and Carmen Peláez-Moreno *,† ID Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés 28911, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-91-624-8771 † These authors contributed equally to this work.

 Received: 3 May 2018; Accepted: 20 June 2018; Published: 27 June 2018 

Abstract: Data transformation, e.g., feature transformation and selection, is an integral part of any machine learning procedure. In this paper, we introduce an information-theoretic model and tools to assess the quality of data transformations in machine learning tasks. In an unsupervised fashion, we analyze the transformation of a discrete, multivariate source of information X into a

discrete, multivariate sink of information Y related by a distribution PXY. The first contribution is a decomposition of the maximal potential entropy of (X, Y), which we call a balance equation, into its (a) non-transferable, (b) transferable, but not transferred, and (c) transferred parts. Such balance equations can be represented in (de Finetti) entropy diagrams, our second set of contributions. The most important of these, the aggregate channel multivariate entropy triangle, is a visual exploratory tool to assess the effectiveness of multivariate data transformations in transferring information from input to output variables. We also show how these decomposition and balance equations also apply to the entropies of X and Y, respectively, and generate entropy triangles for them. As an example, we present the application of these tools to the assessment of information transfer efficiency for Principal Component Analysis and Independent Component Analysis as unsupervised feature transformation and selection procedures in supervised classification tasks.

Keywords: entropy, entropy visualization; entropy balance equation; Shannon-type relations; multivariate analysis; machine learning evaluation; data transformation

1. Introduction Information-related considerations are often cursorily invoked in many machine learning applications, sometimes to suggest why a system or procedure is seemingly better than another at a particular task. In this paper, we set out to ground our work on measurable evidence phrases such as “this transformation retains more information from the data” or “this learning method uses the information from the data better than this other”. This has become particularly relevant with the increase of complexity of machine learning methods, such as deep neuronal architectures [1], which prevents straightforward interpretations. Nowadays, these learning schemes almost always become black-boxes, where the researchers try to optimize a prescribed performance metric without looking inside. However, there is a need to assess what the deep layers are actually accomplishing. Although some answers have started to appear [2,3], the issue is by no means settled. In this paper, we put forward that framing the previous problem in a generic information-theoretical model can shed light on it by exploiting the versatility of .

Entropy 2018, 7, 498; doi:10.3390/e20070498 www.mdpi.com/journal/entropy Version May 3, 2018 submitted to Entropy 2 of 19

Entropy 2018, 7, 498 2 of 20 PK PX PY PK K observe transform classify Kˆ b For instance, a classical end-to-end example of an information-based model evaluation can be observed in Figure1a. In this supervised scheme introduced in [ 4], the evaluation of the performance of the (a) Conceptual representation of a supervised classification task as a classifier involves only the comparison of the true labels K vs. the predicted labels Kˆ. This means that communication channel (modified from [1]). Version May 3, 2018all submitted the complexity to Entropy enclosed in the classifier box cannot be accessed, measured or interpreted.2 of 19 Version May 3, 2018 submitted to Entropy 2 of 19 P P P P P PK X P Y P K K P K K PobserveK transformX K classifyY classifierK Kˆ K K observe transform Hclassify HKˆ PK b PbK b b b (a) Conceptual representation of a supervised classification architecture as a (a) Conceptual representation of a supervised(b) classificationThe end-to-end task as view a for evaluation: a “classifier chain” is trained (a) Conceptualcommunication representation channel (modified of a supervised from [5]). classification task as a communicationcommunication channel channel (modified (modified from from[1]). [to1]). predict labels K from the true emitted labels K.

P b PK P K P PX PY K classifierK K K X transform Y KH classifierH K H H PK H PbK H P P PK PbK X Y

(b) The end-to-end viewb for thebb b (c) Focusing on the transformation block (b) evaluation:The(b) end-to-endThe end-to-enda “classifier view chain”for view evaluation: for is trained evaluation: a “classifier a(c) “classifierFocusingimplementing chain” chain” on is trained theY is= tranformation trainedf (X). X becomes block the implementing Y = f (X). X to predictto predictto predict labels labels labelsKKfromfromK thefrom the true true the emitted true emitted labels labelsbecomesK.dataK. source the data and Y sourcethe sink. and Y the sink. labels K. P b b P X PX b FigureY PY 1. Different views on a supervised classification task as an information channel: (a) as FigureX 1. XDifferenttransform viewstransform of a supervisedindividualizedY classificationY blocks; (b) task for as end-to-end an information evaluation; channel: and (a) as(c) focused on the transformation. HP HP HP HP individualizedX blocks;X (b) for end-to-endY evaluation;Y and (c) focused on the transformation.

(c) Focusing(c) Focusing on the on tranformation30 the tranformationFinally, block the block implementingY are implementing the inputsY = Yf ( toX=) an.f (XX actual). X classifier of choice that obtains the predicted labels Kˆ. In this paper, we want to expand• the previous model into the scheme of Figure1a, which provides a morebecomes detailedbecomes the picture data the source of data the source contents and Y andthe ofY sink. thethe black-box sink. where: 31 Conventionally, the process of generalizing the performance of the classifier for eventual new data

FigureFigure 1. Different 1.A Different random views source views on a supervisedof on32 classification a supervisedrequires classification labelsa classification seriesK is of task subjected good-practices astask an as to information an a measurement information in the channel: use channel:processof the(a) available thatas(a) returnsas data to train and then evaluate it [2,3]. In • individualizedindividualized blocks;random blocks; (b) observations for (b) end-to-end for end-to-end33X. Thethis evaluation;n supervisedinstances evaluation; and of pairs(c) scheme, and focused (c)(ki ,focusedxi the) on, 1 evaluation the oni transformation. then is transformation. often of the called performance the (task) dataset. of the classifier involves the comparison of ≤ ≤ Then, a generic data transformation block may transform the available data,ˆ e.g., the observations • 34 the true labels K vs. the predicted labels K, as the abstracted diagram in Fig.1.(b) shows. in the dataset X, into other data with “better” characteristics, the transformed feature vectors Y. 30 We argued in [4] for doing this evaluation withˆ theˆ relatively new framework of entropy 30 Finally,Finally, the YThese theareY the characteristicsare inputs the inputs to an may to actual anbe representational actual classifier classifier of choice power, of choice that independence obtainsthat obtains the amongpredicted the individualpredicted labels dimensions,labelsK. K. • • balance equations and their related entropy triangles [1,4,5]. The gist of this framework is that we reduction of complexity offered to a classifier, etc. The process is normally called feature 31 Conventionally, the process of generalizing the performance of the classifier for eventual new data 31 Conventionally, thetransformation process of generalizing and selection.can the information-theoretically performance of the classifier assess for the eventual classifier new that data carried out the prediction and obtained the 32 requires a series of good-practices in the use of the available data to train and then evaluate it [2,3]. In 32 requires a series of good-practicesFinally, the Y are inthe the inputs useconfusion to of an the actual available matrix classifierP dataˆ ofby tochoice analyzing train that and obtains the then entropies theevaluate predicted and it [2 labels informations,3]. InKˆ. in the related distribution P ˆ into • KK KK 33 this supervised scheme, the evaluation of the performance of the classifier involves the comparison of 33 this supervised scheme, the evaluation ofthe followingperformancebalance of the equation classifier[5], involves the comparison of 34 This would allow us to betterˆ understand the flow of information in the classification process 34 the truethe labels true labelsK vs. theK vs. predicted the predicted labels labelsKˆ, asK the, as abstracted the abstracted diagram diagram in Fig. in1 Fig..(b)1 shows..(b) shows. with a view toward assessing and improving it. We arguedWe argued in [4] in for [4] doing for doing this evaluation this evaluation with with the relatively the relativelyHU newU new framework= ∆ frameworkHP P + of2 entropy ofMIentropyP + VIP (1) Note the similarity between the classical setting of Figure1Kb· andKˆ the transformationK· Kˆ ∗ blockKKˆ KKˆ balancebalance equations equationsand theirand theirrelated relatedentropyentropy triangles triangles[1,4,5[].1,4 The,5]. Thegist gistof this of this framework framework is that is that we we of Figure1a reproduced in Figure1c for convenience. Despite this,0 the∆H formerP P , MI representsP , VIP a HU U can information-theoretically assess the classifier that carried out the prediction≤ K· andKˆ obtainedKKˆ K theKˆ ≤ K· Kˆ can information-theoreticallySingle-Input Single-Output assess the (SISO) classifier block with that(K carried, Kˆ) P out, whereas the prediction the latter represents and obtained a multivariate the ∼ KKˆ confusion matrix PKKˆ by analyzing the entropies and informations in the related distribution PKKˆ into confusion matrixMultiple-InputPKKˆ by analyzing Multiple-Output the35 entropieswhere (MIMO)U andK blockand informationsU describedˆ are the by in uniform the joint related distributiondistributions distribution of random onPK theKˆ into vectorssupports of PK and P ˆ , respectively, and the the following balance equation [5], K K the following balance(X, Y) equationP . [5], 36 ∼ XY information theoretic quantities are: This MIMO kind of block may represent an unsupervised transformation method—for instance, a H = ∆H + 2 MI + VI (1) Principal Component AnalysisUK37UKˆ (PCA)a) theP orK P IndependentdivergenceKˆ withPK ComponentKˆ respectPKKˆ to Analysis uniformity (ICA)—in, ∆HP whichP ˆ , between case, the the joint distribution where PK and P ˆ HU U ˆ =· ∆HP P ˆ +·2 MI∗P ˆ + VIP ˆ K K (1) K K· K K· K KK KK · “effectiveness” of the transformation38 is supplied∗ by a heuristic principle, e.g., least reconstruction error 0 ∆HareP P independent, MIP , VIP and theHU uniformU distributions with the same cardinality of events as PK and PKˆ , 0 ∆H≤P P , MIK· KˆP , VIKKPˆ KHKˆ U≤ U K· Kˆ on some test data, maximum39 mutualKb) Kˆthe information,mutualKKˆ informationK etc.Kˆ However,K, MIKˆ it may[6,7 also], quantifying represent a supervisedthe strength of the stochastic binding between P ≤ · ≤ · PKKˆ K transformation method—for instance, X are the feature instances, and Y are the (multi-)labels or classes 35 where UK and UKˆ are the uniform40 distributionsand PKˆ on, and the supports of PK and PKˆ , respectively, and the 35 where UK and inUK aˆ classificationare the uniform task,distributions or may be the on activation the supports signals of ofP aK convolutionaland PKˆ , respectively, neural network andtrained the 41Y c) the variation of information, VI [8], that embodies the residual entropy not used in binding the 36 information theoretic quantities are: PKKˆ 36 information theoreticusing an quantities implicit target are: signal— in which case, the “effectiveness” should measure the conformance to 42 variables. 37 a) the thedivergence supervisory with signal. respect to uniformity, ∆H , between the joint distribution where P and P PK PKˆ K Kˆ 37 a) the divergence with respect to uniformity, ∆H , between· the joint distribution where P and P ˆ In [4], we argued for carryingPK P outKˆ the evaluation of classification tasksK that canK be 38 are independent and the uniform43 The distributionsheuristic· criterion with thefor same assessment cardinality is that of eventsgood classifiers as PK and maximizePKˆ , the (transferred) 38 are independent and the uniform distributions with the same cardinality of events as PK and P ˆ , 39 b) the modeledmutual information by Figure,1bMI with44 [ the6,7], new quantifying framework˜ the of strength entropy of balance the stochastic equations binding and their between relatedK P PKKˆ between K and K. Furthermore, this assessment can be carried outK visually with a ternary plot diagram 39 b) the mutual information, MI [6,7], quantifying the strength of the stochastic binding between P PKKˆ K 40 and PKˆ , and 45 called the Channel Bivariate Entropy Triangle (CBET) as in Exploratory Data Analysis. We will briefly 40 and P ˆ , and 41 c) Kthe variation of information, VI [8], that embodies the residual entropy not used in binding the 46 PKpresentKˆ the theory behind this case and some typical applications in Section 2.1. 41 c) the variation of information, VIP ˆ [8], that embodies the residual entropy not used in binding the 42 variables. KK 47 In this paper we generalize this previous situation —single-input single-output (SISO) blocks 42 variables. 48 (K, Kˆ) P ˆ —to a proper multivariate multiple-input multiple-output (MIMO) data transformation 43 The heuristic criterion for assessment is that∼goodKK classifiers maximize the (transferred) mutual information 43 The heuristic criterion˜ for assessment is49 thatblockgood described classifiers maximizeby its joint the distribution (transferred)(X mutual, Y) informationP . 44 between K and K. Furthermore, this assessment can be carried out visually with a ternary∼ plotXY diagram 44 between K and K˜. Furthermore, this assessment can be carried out visually with a ternary plot diagram 45 called the Channel Bivariate Entropy Triangle (CBET) as in Exploratory Data Analysis. We will briefly 45 called the Channel Bivariate Entropy Triangle (CBET) as in Exploratory Data Analysis. We will briefly 46 present the theory behind this case and some typical applications in Section 2.1. 46 present the theory behind this case and some typical applications in Section 2.1. 47 In this paper we generalize this previous situation —single-input single-output (SISO) blocks 47 In this paper we generalize this previous situation —single-input single-output (SISO) blocks 48 (K, Kˆ) P —to a proper multivariate multiple-input multiple-output (MIMO) data transformation ∼ KKˆ 48 (K, Kˆ) P —to a proper multivariate multiple-input multiple-output (MIMO) data transformation 49 blockKK describedˆ by its joint distribution (X, Y) P . ∼ ∼ XY 49 block described by its joint distribution (X, Y) P . ∼ XY Entropy 2018, 7, 498 3 of 20 entropy triangles [4–6]. This has provided a means of quantifying and visualizing the end-to-end information transfer for SISO architectures. The gist of this framework is explained in Section 2.1: if a classifier working on a certain dataset obtained a confusion matrix PKKˆ , then we can information-theoretically assess the classifier by analyzing the entropies and information in the related distribution PKKˆ with the help of a balance equation [6]. However, looking inside the black-box poses a challenge since X and Y are random vectors and most information-theoretic quantities are not readily available in their multivariate version. If we want to extend the same framework of evaluation to random vectors in general, we need the multivariate generalizations of the information-theoretic measures involved in the balance equations, an issue that is not free of contention. With this purpose in mind, we review the best-known multivariate generalizations of mutual information in Section 2.2. We present our contributions finally in Section3. As a first result, we develop a balance equation for the joint distribution PXY and related representation in Sections 3.1 and 3.2, respectively. However we are also able to obtain split equations for the input and output multivariate sources only tied by one multivariate extension of mutual information, much as in the SISO case. As an instance of use, in Section 3.3, we analyze the transfer of information in PCA and ICA transformations applied to some well-known UCI datasets. We conclude with a discussion of the tools in light of this application in Section 3.4.

2. Methods In Section3, we will build a solution to our problem by finding the minimum common multiple, so to speak, of our previous solutions to the SISO block we describe in Section 2.1 and the multivariate source cases, to be described in Section 2.2.

2.1. The Channel Bivariate Entropy Balance Equation and Triangle A solution to conceptualizing and visualizing the transmission of information through a channel where input and output are reduced to a single variable, that is with X = 1 and Y = 1, was | | | | presented in [6] and later extended in [4]. For this case, we use simply X and Y to describe the random variables. Notice that in the Introduction, and later in the example application, these are called K and Kˆ, but here, we want to present this case as a simpler version of the one we set out to solve in this paper. Figure2a, then, depicts a classical information-diagram (i-diagram) [ 7,8] of an entropy decomposition around PXY in which we have included the exterior boundaries arising from the entropy balance equation, as we will show later. Three crucial regions can be observed:

The (normalized) redundancy ([9], Section 2.4), or divergence with respect to uniformity (yellow • area), ∆HP P , between the joint distribution where PX and PY are independent and the uniform X · Y distributions with the same cardinality of events as PX and PY,

∆HP P = HU U HP P . (1) X · Y X · Y − X · Y The mutual information, MI [10] (each of the green areas), quantifies the force of the stochastic • PXY binding between PX and PY, “towards the outside” in Figure2a,

MIP = HP P HP (2) XY X · Y − XY but also “towards the inside”,

= = MIPXY HPX HPX Y HPY HPY X . (3) − | − | Version May 3, 2018 submitted to Entropy 4 of 19 Version May 3, 2018 submitted to Entropy 4 of 19

The (normalized) redundancy [14, § 2.4], or divergence with respect to uniformity, ∆HP P , between • X · Y the jointThe distribution(normalized) where redundancyP and [14P, §are2.4], independent or divergence and with the respect uniform to uniformity distributions, ∆HP withP , between the • X Y X · Y same cardinalitythe joint distribution of events as wherePX andPX andPY , PY are independent and the uniform distributions with the same cardinality of events as PX and PY , ∆HP P = HU U HP P . (2) X · Y X · Y − X · Y ∆HP P = HU U HP P . (2) X · Y X · Y − X · Y The mutual information, MIPXY [6,7], quantifies the force of the stochastic binding between PX and • The mutual information, MI [6,7], quantifies the force of the stochastic binding between P and PY , “towards• the outside” in Fig.PXY2,(a) X PY , “towards the outside” in Fig.2,(a)

MIPXY = HPX PY HPXY (3) MIP =· H−P P HP (3) XY X · Y − XY but also “towards the inside”, but also “towards the inside”, MI = H H = H H . (4) PXY PX PX Y PY PY X MIP =−HP | HP =−HP | HP . (4) XY X − X Y Y − Y X Entropy 2018, 7, 498 | | 4 of 20 The variation of information, VIP [8], embodies the residual entropy, not used in binding the • The variation of informationXY, VI [8], embodies the residual entropy, not used in binding the variables,• PXY variables, The variation of information (the sum of the red areas), VI [11], embodies the residual entropy, • PXY not used in binding the variables,VI = H + H . (5) PXY PX=Y PY+X VIPXY | HPX Y | HPY X . (5) = +| | VIPXY HPX Y HPY X . (4) | |

∆HPX PY · ∆ HUX UY HPX PY · MIPXY · HUX UY MIPXY H·PX PY HUX · HPX PY HUY HUX · HUY ∆HPX ∆HPY ∆HPX ∆HPY MIPXY H HP MIHP XY PXY X Y H Y X H HP MIP HPXY | PX Y | PY X X Y H XY | | | PX Y MIPXY | MI H PXY PY X H MIPXY | PY X |

HPX HPY HPX HPY HPX PY HPX HPY HUX UY HPXY · HPX PY HPX HPY · HUX UY HPXY · · (a) Extended(a) (Extendeda) Extended entropy entropy entropy diagram diagram (b)(b) Schematic(b)SchematicSchematic split entropy split split diagram entropy entropy diagramdiagram Figure 2. Extended entropy diagram related to a bivariate distribution, from [4].

FigureFigure 2. Extended 2. Extended entropy entropy diagram diagram related related to a bivariate to a bivariate distribution, distribution, from from [5]. [5]. Then, we may write the following entropy balance equation between the entropies of X and Y: Then, weThen, may we write may thewrite following the followingentropyentropy balance balance equation equationbetweenbetween the entropies the entropies of X ofandX Yand: Y: HU U = ∆HP P + 2 MIP + VIP (5) X · Y X · Y · XY XY 0 ∆HPX PY , MIPXY , VIPXY HUX UY HUX UHY U= ∆U H=P≤X∆PYH+P · 2P +MI2PXYMI+PVI≤P+XYVI·P (1, revisited)(1, revisited) · X · Y · X · Y∗ ∗ XY XY where the bounds are easily0 obtained∆0HP from∆PH, distributionalPMIPP, MI, VIP P considerations, VIP HU UHU [6].U If we normalize (5) by the ≤ ≤X · Y X · YXY XYXY ≤ XY ≤X · Y X · Y overall entropy HU U , we obtain: X · Y wherewhere the bounds the bounds are easily are easily obtained obtained from distributional from distributional considerations considerations [5]. If [5 we]. If normalize we normalize(1) by(1 the) by the 1 = ∆ H + 2 MI + VI 0 ∆ H , MI , VI 1 (6) 0 PX PY P0 XY P0 XY 0 PX PY P0 XY P0 XY overalloverall entropy entropyHU UHUweU obtainwe· obtain· ≤ · ≤ X · Y X · Y Equation (6) is the 2-simplex in normalized ∆H0PX PY 2MI 0PXY VI 0PXY space. Each = + + · × × 1joint= ∆ distribution01HP ∆P0 H+PX2PXYPY MIcan2P0 beMI+ characterizedP0VIP0 VIP0 by its joint0 entropy∆00 HP∆0P fractions,H,PMIX PYP0, MIF,(VIP0 XYP0 ,)VI =P0 [1∆HP0 1, 2 (6) (6) X · Y · ∗ ∗XY XY XY XY ≤ ≤ X · Y · XY XYXY ≤XY ≤ XY · MI , VI ], whose projection onto the plane with director vector (1, 1, 1) is its de Finetti or 0PXY 0PXY 90 compositional diagram [12]. This diagram of the 2-simplex is an equilateral triangle, the coordinates of 90 EquationEquation(6) is the(6) is2-simplex the 2-simplex in normalized in normalized∆H0P∆PH0PX 2PYMI 02PMI 0PXYVI 0P VI space.0PXY space. Each Each joint joint X · Y ×· × XY × × XY which are F(PXY), so every bivariate distribution is shown as a point in the triangle, and each zone in 91 distribution91 distributionPXY canPXY becan characterized be characterized by its byjoint its entropyjoint entropy fractions fractions, F(PXY, F)(P =XY [∆) =H0 [∆H, 2P0 ,MI 2 0 MI, VI0P 0, VI] ,0P ] , the triangle is indicative of the characteristics of distributions, the coordinates ofPXY which×XY fall× inPXY it. ThisXYPXY XY 92 whose92 whose projection projection onto the onto plane the planewith director with director vector vector(1, 1, 1()1,is 1, its 1) deis its Finettide Finetti or Compositional or Compositional diagram diagram [15]. [15]. is what we call the Channel Bivariate Entropy Triangle (CBET) whose schematic is shown in Figure3. 93 This diagram of the 2-simplex is an equilateral triangle whose coordinates are F(P ) so every 93 This diagram ofWe the can 2-simplex actually decompose is an equilateral (5) and the quantities triangle in whose it into two coordinates split balance are equations,F(PXY) soXY every 94 94 bivariatebivariate distribution distribution shows shows as a point as a point in the in triangle, the triangle, and each and each zone zone in the in triangle the triangle is indicative is indicative 95 = + + = + + 95 of the characteristicsHUX of∆H distributionsPX MIPXY H whosePX Y coordinatesHUY fall∆ inH it.PY ThisMIP isXY whatHPY X we. call the (7) Channel of the characteristics of distributions whose coordinates| fall in it. This is what we call| the Channel 96 Bivariate Entropy Triangle, CBET, one of whose instances is shown below in Fig.3. 96 Bivariate Entropy Triangle, CBET, one of whose instances is shown below in Fig.3. with the obvious limits. These can be each normalized by HUX , respectively HUY , leading to the 2-simplex equations:

1 = ∆0 H + MI0 + H0 1 = ∆0 H + MI0 + H0 . (8) PX PXY PX Y PY PXY PY X | |

Since these are also equations on a 2-simplex, we can actually represent the coordinates FX(PXY) = [∆H0 , MI 0 , H0 ] and F (P ) = [∆H0 , MI 0 , H0 ] in the same triangle side by side the PX PXY PX Y Y XY PY PXY PY X | | original F(PXY), whereby the representation seems to split in two. VersionEntropy May2018 3,, 20187, 498 submitted to Entropy 5 of5 of 19 20

0 100 perfect classifier worst classifier 20 useless classifier 80 Good MI ˆ K K 40 P P Balanced 60 K VI U = P Kˆ K K (0, , ) U = P ⇔ · · 60  K K 40 Diagonal ( , , 0) b b 80 Specialized 20 · · Bad 100 0 0 20 40 60 80 100 ∆H PKKˆ Random P P ( , 0, ) K⊥ K ⇔ · ·

FigureFigure 3. 3. SchematicSchematic CBET CBET as as applied applied to to supervised supervised classifier classifier assessment. assessment.b An An actual actual triangle triangle shows shows dotsdots for for each each classifier classifier (or (or its its split split coordinates, coordinates see see6 for Figure example) 6 forexample) and noneand of the none callouts of the for callouts specific for typesspecific of classifiers types of (from classifiers [4]). The(from callouts [4]). The situated callouts in the situated center in of the the center sides of of the the triangle sides of apply the triangle to the wholeapply side. to the whole side.

2.1.1. Application: The Evaluation of Multiclass Classification Considering (1) and the composition of the quantities in it we can actually decompose the the equationThe into CBET two cansplit be balance used toequations, visualize the performance of supervised classifiers in a straightforward manner as announced in the Introduction: Consider the confusion matrix NKKˆ of a classifier chain = + + = + + on a supervisedHUX classification∆HPX MIP taskXY givenHPX Y the random variableHUY of∆H truePY classMIP labelsXY HKPY X .PK and that (7) of | ∼| predicted labels K P as depicted in Figure1a, which now play the role of P and P . From this ∼ K X Y with the obvious limits. These can be each normalized by HU , respectively HU , leading to the confusion matrix, we can estimate the joint distribution PKK betweenX the randomY variables, so that 2-simplex equationsb b the entropy triangle for PKK produces valuable information about the actual classifier used to solve the task [6,13] and even the theoretical limits of the task; forb instance, whether it can be solved in a 1 = ∆0 H + MI’b + H0 1 = ∆0 H + MI0 + H0 . (8) PX PXY PX Y PY PXY PY X trustworthy manner by classification technology| and with what effectiveness. | The CBET acts, in this case, as an exploratory data analysis tool for visual assessment, as shown 97 Since these are also equations on a 2-simplex, we can actually represent the coordinates F (P ) = in Figure3. X XY 98 [∆H0 , MI 0 , H0 ] and F (P ) = [∆H0 , MI 0 , H0 ] in the same triangle side by side the original PX PXY PX Y Y XY PY PXY PY X The success of| this approach in the bivariate, supervised| classification case is a strong hint that the 99 F(multivariatePXY), whereby extension the representation will likewise seems be useful to split for inother two. machine learning tasks. See [4] for a thorough explanation of this procedure. 100 2.1.1. Application: the evaluation of multiclass classification

101 2.2.The Quantities CBET can around be used the Multivariate to visualize Mutual the performance Information of supervised classifiers in a straightforward

102 mannerThe as main announced hurdle in for the a multivariate introduction: extension consider of the the confusion balance Equationmatrix NK(K5ˆ )ofand a classifier the CBET chain is the 103 on a supervised classification task given the random variable of true class labels K P and that of multivariate generalization of binary mutual information, since it quantifies the information∼ K transport 104 predicted labels K P as depicted in Figure1.(b)—that now play the role of P and P . From this from input to output∼ K in the bivariate case and is also crucial for the decouplingX of (5)Yinto the split 105 confusionbalance Equation matrix we (7). can For estimate this reason, the joint we next distribution review thePK differentK between “flavors” the random of information variables, measures so that b b 106 thedescribing entropy triangle sets of more for P thanKK produces two variables valuable looking information for these about two properties. the actual classifierWe start from used very to solve basic b 107 thedefinitions task [5,16 both], and in even the interest the theoretical of self-containment limits of the and task—for to provide instance, a script whether of theprocess it can be of solved developing in a b 108 trustworthyfuture analogues manner for by other classification information technology, measures. and with what effectiveness. 109 The CBET acts, in this case, as an exploratory data analysis tool for visual assessment, as shown To fix notation, let X = Xi 1 i m be a set of discrete random variables with joint 110 in Figure3. The success of this approach{ | in≤ the≤ bivariate,} supervised classification case is a strong hint multivariate distribution PX = PX1...Xm and the corresponding marginals PXi (xi) = ∑j=i PX(x) where 6 111 that= the multivariate extension will likewise be useful for= other machine learning tasks.= See [4] for a x x1 ... xm is a tuple of m elements; likewise for Y Yj 1 j l , with PY PY1...Yl and the 112 thorough explanation of this procedure. { | ≤ ≤ } marginals PYj . Furthermore, let PXY be the joint distribution of the (m + l)-length tuples XY. Note that two different situations can be clearly distinguished: 113 2.2. Quantities around the Multivariate Mutual Information Situation 1: All the random variables form part of the same set X, and we are looking at information 114 The main hurdle for a multivariate extension of the balance equation (1) and the CBET is the transfer within this set, or 115 multivariate generalization of binary mutual information, since it quantifies the information transport Version May 3, 2018 submitted to Entropy 6 of 19

116 from input to output in the bivariate case, and is also crucial for the decoupling of (1) into the split Version May 3, 2018 submitted to Entropy 6 of 19 117 balance equations (7). For this reason, we next review the different “flavors” of information measures

118 describing sets of more than two variables looking for these two properties. We start from very basic

119 definitions116 bothfrom ininput the to interest output of in self-containment the bivariate case, and and to is provide also crucial a script for the on decoupling the process of of(1 developing) into the split

120 future analogues117 balance for equations other information(7). For this reason, measures. we next review the different “flavors” of information measures

121 To118 fixdescribing notation, sets let X of more= X than1 two variablesi n lookingbe a set for of these discrete two properties. random variables We start from with very joint basic { i | ≤ ≤ } 119 definitions both in the interest of self-containment and to provide a script on the process of developing 122 multivariate distribution PX = PX1...Xn , and the corresponding marginals PXi (xi) = ∑j=i PX(x) where 6 123 x = x 120... xfutureis a tuple analogues of n elements. for other information And likewise measures. for Y = Y 1 j l , with P = P and the 1 n { j | ≤ ≤ } Y Y1...Yl 121 To fix notation, let X = X 1 i n be a set of discrete random variables with joint 124 i marginals PYj . Furthermore let PXY be the{ joint| ≤ distribution≤ } of the (n + l)-length tuples XY . 122 multivariate distribution PX = PX1...Xn , and the corresponding marginals PXi (xi) = ∑j=i PX(x) where 125 NoteEntropy that2018 two, 7, different 498 situations can be clearly distinguished whether the random variables:s6 6 of 20 123 x = x ... x is a tuple of n elements. And likewise for Y = Y 1 j l , with P = P and the 1 n { j | ≤ ≤ } Y Y1...Yl 126 all124 formmarginals part ofP theY . Furthermoresame set X and let P webe are the looking joint distribution at information of the transfer(n + l)-lengthwithin tuplesthis set,XY or. • j XY 127 are125 partitionedNote thatinto two two different different situations sets X and canY beand clearly we are distinguished looking at whetherinformation the random transfer variables:sbetween • Situation 2: They are partitioned into two different sets X and Y, and we are looking at information 128 these sets. 126 all formtransfer part betweenof the same these set X sets.and we are looking at information transfer within this set, or • 127 are partitioned into two different sets X and Y and we are looking at information transfer between 129 An up-to-dateAn• reviewup-to-date of multivariate review of multivariate information information measures in measuresboth situations in both is [ situations17] that follows is [14], which 128 these sets. 130 the interestingfollows the methodological interesting methodological point from [18] ofpoint calling frominformation [15] of callingthose information measures which those involve measures that 3 131 amountsinvolve129 ofAn entropy up-to-date amounts shared of review entropy by multiple of multivariate shared variables by multiple information and entropies variables measuresthose and inthat entropies both dosituations not those. is that [17 do] that not—although, follows 132 Sincethis130 i-diagrams posesthe interesting a conundrum are methodological a powerful for the tool entropy point to visualize from written [18] the of as calling interaction the self-informationinformation of distributionsthoseH measuresP = inMI theP which bivariate. involve X 3 XX 133 case, we131 willamountsSince also i-diagrams try of entropyto use them are shared a powerfulfor by sets multiple of tool random variables to visualize variables. and entropies the For interactionthose multivariate that of do distributions not generalizations. in the of bivariate 132 Since i-diagrams are a powerful tool to visualize the interaction of distributions in the bivariate 134 mutualcase, information we will as also seen try in to the use i-diagrams, them for sets the following of random caveats variables. apply: For multivariate generalizations of 133 case, we will also try to use them for sets of random variables. For multivariate generalizations of mutual information as seen in the i-diagrams, the following caveats apply: 135 Their134 mutual multivariate information generalization as seen in the is onlyi-diagrams, warranted the following when signed caveats measures apply: of probability are • 136 considered,Their sincemultivariate it is well-known generalization that some is of only these warranted “areas” can when be negative signed, contrary measures to geometric of probability are •135 Their multivariate generalization is only warranted when signed measures of probability are 137 intuitionsconsidered,• on this respect. since it is well known that some of these “areas” can be negative, contrary to the 136 considered, since it is well-known that some of these “areas” can be negative, contrary to geometric 138 We shouldgeometric retain intuitions the bounding in this rectangles respect. that appear when considering the most entropic • 137 intuitions on this respect. 139 distributionsWe should with retainsimilar the support bounding to the ones rectangles being graphed that appear [5]. This when is the considering sense of the the bounding most entropic 138 We should retain the bounding rectangles that appear when considering the most entropic • • 140 rectangles139 distributionsdistributions in Figures with4.(a) with similar and similar4.(b) support. support to to the the ones ones being graphed graphed [5]. [6 This]. This is the is the sense sense of the of bounding the bounding

140 rectanglesrectangles in Figure in Figures4a,b.4.(a) and4.(b).

∆HP P P X1 · X2 · X3 ∆HP P P CP X1 X2 X3 HP X1X2X3 · · X3 ∆HP P CP HP X Y X1X2X3 X3 · ∆HP P I X· Y HPX X X PXY 3| 1 2 I HPX X X PXY 3| 1 2 DP X1X2X3 DP X1X2X3 HP HP IP HP HPX X X HPX X X HPX X X XY X Y XY Y X 1 2 3 1| 2 3 2| 1 3 HP | HP IP | HP HPX X X HPX X X HPX X X XY X Y XY Y X 1 2 3 1| 2 3 2| 1 3 | |

HP P P HPHPP P HP HP HUHPU UHU U U HP P HP PHP HP HP HPHU HUU U X1 · X2 · X3 X1 ·X1X2 · X3 X1 X2 X1 ·X2X2 · X3 X1 · X2 · X3 X· Y X· Y X X Y Y X· YX· Y (a) Extended(a) entropyExtended(a) Extended diagram entropy entropy of diagram a trivariate diagram of a trivariate of a (b) Partitioned((b)b)Partitioned distribution distribution distribution entropy entropy diagram entropy diagram distributiondistribution (fromtrivariate [1]). (from distribution [1]). (from [5]). diagram

Figure 4. (Color Online) Extended entropy diagram of multivariate distributions for (a) a trivariate Figure 4.Figure(Color 4. Online)(Color online)Extended Extended entropy entropy diagram diagram of multivariate of multivariate distributions distributions for (a) a trivariate for (a) a trivariate distribution (from [1]), and (b) a joint distribution where a partitioning of the variables is made distributiondistribution (from [ (from1]), and [5]) (b) as a an joint instance distribution of Situation where 1; and a partitioning (b) a joint distribution of the variables where is a made partitioning of evident. The color scheme follows that of Fig.2, to be explained in the text. evident.theThe variables color scheme is made follows evident that (Situation of Fig.2, to 2). be The explained color scheme in the followstext. that of Figure2, to be explained in the text.

3 Although this poses a conundrum for the entropy written as the self information HPX = MIPXX . 3 Although thisWith poses great a conundrum insight, the for theauthors entropy of written [15] point as the out self information that someH ofPX the= MI multivariatePXX . information measures stem from focusing on a particular property of the bivariate mutual information and generalizing it to the multivariate setting. The properties in question—including already stated (2) and (3)—are: The properties in question are:

MI = H + H H PXY PX PY − PXY = = MIPXY HPX HPX Y HPY HPY X − | − | P (x, y) = ( ) XY MIPXY ∑ PXY x, y log (9) x,y PX(x)PY(y)

Regarding the first situation of a vector of random variables X P , let Π = ∏n P be the ∼ X X i=1 Xi (jointly) independent distribution with similar marginals to PX. To picture this (virtual) distribution consider Figure4a depicting an i-diagram for X = [X , X , X ]. Then, Π = P P P is the inner 1 2 3 X X1 · X2 · X3 rectangle containing both green areas. The different extensions of mutual information that concentrate on different properties are: Entropy 2018, 7, 498 7 of 20

the total correlation [16], integration [17] or multi-information [18], which is a generalization of (2), • represented by the green area outside H . PX

CP = HΠ HP (10) X X − X the dual total correlation [19,20] or interaction complexity [21] is a generalization of (3), • represented by the green area inside H : PX

DP = HP VIP (11) X X − X the interaction information [22], multivariate mutual information [23] or co-information [24] is • the generalization of (9), the total amount of information to which all variables contribute.

P (x) MI = P (x) log X (12) PX ∑ X ΠX(x)

It is represented by the inner convex green area (within the dual total correlation), but note that it may in fact be negative for n > 2 [25]. the local exogenous information [15] or the bound information [26] is the addition of the total • correlation and the dual total correlation:

M = C + D . (13) PX PX PX

Some of these generalizations of the multivariate case were used in [5,26] to develop a similar technique as the CBET, but applied to analyzing the information content of data sources. For this purpose, it was necessary to define for every random variable a residual entropy H , where PX Xc i| i Xc = X X , which is not explained by the information provided by the other variables. We call i \{ i} residual information [15] or (multivariate) variation of information [11,26] the generalization of the same quantity in the bivariate case, i.e., the sum of these quantities across the set of random variables:

n

VIP = HP c . (14) X ∑ Xi X i=1 | i

Then, the variation of information can easily be seen to consist of the sum of the red areas in Figure4a and amounts to information particular to each variable. The main question regarding this issue is which, if any, of these generalizations of bivariate mutual information are adequate for an analogue of the entropy balance equations and triangles. Note that all of these generalizations consider X as a homogeneous set of variables, that is Situation 1 described at the beginning of this section, and none consider the partitioning of the variables in X into two subsets (Situation 2), for instance to distinguish between input and output ones, so the answer cannot be straightforward. This issue is clarified in Section 3.1.

3. Results Our goal is now to find a decomposition of the entropies around characterizing a joint distribution

PXY between random vectors X and Y in ways analogous to those of (5) but considering multivariate input and output. Note that it provides no advantage trying to do this on continuous distributions, as the entropic measures used are basic. Rather, what we actually capitalize on is in the outstanding existence of a balance equation between these apparently simple entropic concepts, and what their intuitive meanings afford to the problem of measuring the transfer of information in data processing tasks. As we set out to demonstrate in this section, our main results are in complete analogy to those of the binary case, but with the flavour of the multivariate case. Entropy 2018, 7, 498 8 of 20

3.1. The Aggregate and Split Channel Multivariate Balance Equation Consider the modified information diagram of Figure4b highlighting entropies for some distributions around PXY. When we distinguish two random vectors in the set of variables X and Y, a proper multivariate generalization of the variation of information in (4) is

VI = H + H . (15) PXY PX Y PY X | | and we will also call it the variation of information. It represents the addition of the information in X not shared with Y and vice-versa, as captured by the red area in Figure4b. Note that this is a non-negative quantity, since its is the addition of two entropies. Next, consider

U , the uniform distribution over the supports of X and Y, and • XY P P , the distribution created with the marginals of P considered independent. • X × Y XY Then, we may define a multivariate divergence with respect to uniformity—in analogy to (1)—as

∆HP P = HU HP P . (16) X × Y XY − X × Y This is the yellow area in Figure4b representing the divergence of the virtual distribution P P X × Y with respect to uniformity. The virtuality comes from the fact that this distribution does not properly exist in the context being studied. Rather, it only appears in the extreme situation that the marginals of

PXY are independent. Furthermore, recall that both the total entropy of the uniform distribution and the divergence from uniformity factor into individual equalities H = H + H —since uniform joint distributions UXUY UX UY always have independent marginals—and HP P = HP + HP . Therefore (16) admits splitting as X × Y X Y ∆HP P = ∆HP + ∆HP where X × Y X Y

∆HP = HU HP ∆HP = HU HP . (17) X X − X Y Y − Y

Now, both UX and UY are the most entropic distributions definable in the support of X and Y whence both ∆H and ∆H are non-negative, as is their addition. These generalizations are PX PY straightforward and intuitively mean that we expect them to agree with the intuitions developed in the CBET, which is an important usability concern. The problem is finding a quantity that fulfills the same role as the (bivariate) mutual information. The first property that we would like to have is for this quantity to be a “transmitted information” after conditioning away any of the entropy of either partition, so we propose the following as a definition:

IP = HP VIP (18) XY XY − XY represented by the inner green area in the i-diagram of Figure4b. This can easily be “refocused” on each of the subsets of the partition:

Lemma 1. Let PXY be a discrete joint distribution. Then

H H = H H = I (19) PX PX Y PY PY X PXY − | − | Proof. Recalling that the conditional entropies are easily related to the by the chain rule H = H + H = H + H , simply subtract VI . PXY PX PY X PY PX Y PXY | | This property introduces the notion that this information is within each of X and Y independently but mutually induced. It is easy to see that this quantity appears once again in the i-diagram: Entropy 2018, 7, 498 9 of 20

Lemma 2. Let PXY be a discrete joint distribution. Then

IP = HP P HP . (20) XY X × Y − XY Proof. Considering the entropy decomposition of P P : X × Y

H H = H + H H + H = H H = I PX PY PXY PX PY PY PX Y PX PX Y PXY × − − | − |  

In other words, this is the quantity of information required to bind PX and PY; equivalently, it is the amount of information lost from P P to achieve the binding in P . Pictorially, this is the X × Y XY outermost green area in Figure4b, and it must be non-negative, since P P is more entropic than X × Y PXY. Notice that (18) and (19) are the analogues of (10) and (11), respectively, but with the flavor of (2) and (3). Therefore, this quantity must be the multivariate mutual information of PXY as per the Kullback-Leibler divergence definition:

Lemma 3. Let PXY be a discrete joint distribution. Then

P (xi, yj) I = P (x , y ) log XY (21) PXY ∑ XY i j i,j PX(xi)PY(yj)

Proof. This is an easy manipulation.

P (x , y ) PX Y=y (xi yj) 1 XY i j | j | ∑ PXY(xi, yj) log = ∑ PXY(xi, yj) log = ∑ PX(xi) log i,j PX(xi)PY(yj) i,j PX(xi) i PX(xi) − 1 P (y ) P (x y ) log = ∑ Y j ∑ X Y=yj i j − j i | | PX Y=y (xi yj) | j | = H H = I , PX PX Y PXY − | after a step of marginalization and considering (3).

With these relations we can state our first theorem:

Theorem 1. Let PXY be a discrete joint distribution. Then the following decomposition holds:

HU U = ∆HP P + 2 IP + VIP (22) X × Y X × Y · XY XY 0 ∆HP P , IP , VIP HU U ≤ X × Y XY XY ≤ X × Y

Proof. From (16) we have HU U = ∆HP P + HP P whence by introducing (18) and (20) X × Y X × Y X × Y we obtain:

HU U = ∆HP P + IP + HP = ∆HP P + IP + IP + VIP . (23) X × Y X × Y XY XY X × Y XY XY XY Recall that each quantity is non-negative by (15), (16) and (21), so the only things left to be proven are the limits for each quantity in the decomposition. For that purpose, consider the following clarifying conditions,

1. X marginal uniformity when H = H , Y marginal uniformity when H = H and PX UX PY UY marginal uniformity when both conditions coocur. 2. Marginal independence, when P = P P . XY X × Y Entropy 2018, 7, 498 10 of 20

3. Y determines X when H = 0, X determines Y when H = 0 and mutual determination, PX Y PY X when both conditions hold.| | Notice that these conditions are independent of each other and that each fixes the value of one of the quantities in the balance:

For instance, in case HP = HU then ∆HP = 0 after (17). Similarly, if HP = HU then ∆HP = 0. • X X X Y Y Y Hence when marginal uniformity holds, we have ∆H = 0. PXY Similarly, when marginal independence holds, we see that IP = 0 from (20). Otherwise stated, • X Y H = H and H = H . | PX Y PX PY X PY Finally,| if mutual determination| holds—that is to say the variables in either set are deterministic • functions of those of the other set—by the definition of the multivariate variation of information, we have VI = 0. PX Y | Therefore, these three conditions fix the lower bounds for their respectively related quantities. Likewise, the upper bounds hold when two of the conditions hold at the same time. This is easily seen invoking the previously found balance equation (23):

For instance, if marginal uniformity holds, then ∆HP = 0 . But if marginal independence also • XY holds, then I = 0 whence by (23) VI = H . PX Y PXY UX UY | × But if both marginal uniformity and mutual determination hold, then we have ∆HP = 0 and • XY VIP = 0 so that IP = HU U . XY XY X × Y Finally, if both mutual determination and marginal indepence holds, then a fortiori ∆HP = • XY HU U . X × Y This concludes the proof.

Notice how the bounds also allow an interpretation similar to that of (5). In particular, the interpretation of the conditions for actual joint distributions will be taken again in Section 3.2. The next question is whether the balance equation also admits splitting.

Theorem 2. Let PXY be a discrete joint distribution. Then the Channel Multivariate Entropy Balance equation can be split as:

H = ∆H + I + H 0 ∆H , I , H H (24) UX PX PXY PX Y PX PXY PX Y UX | ≤ | ≤ H = ∆H + I + H 0 ∆H , I , H H (25) UY PY PXY PY X PY PXY PY X UY | ≤ | ≤ Proof. We prove (24): the proof of (25) is similar mutatis mutandis. In a similar way as for (22), we have that H = ∆H + H . By introducing the value of H UX PX PX PX from (19) we obtain the decomposition of H of (24). UX These quantities are non-negative, as mentioned. Next consider the X marginal uniformity condition applied to the input vector introduced in the proof of Theorem1. Clearly, ∆HX = 0. Marginal independence, again, is the condition so that IXY = 0. Finally, if Y determines X then H = 0. These conditions individually provide the lower bounds on each quantity. PX Y | On the other hand, when we put together any two of these conditions, we obtain the upper bound for the unspecified variable: so, if ∆H = 0 and I = 0 then H = H = H . Also, if I = 0 PX PXY PX Y PX UX PXY | and HP = 0, then HP = HP = 0 and ∆HP = HU 0 . Finally, if HP = 0 and ∆HP = 0, then X Y X X Y X X − X Y X I = H| H = H 0| . | PXY PX PX Y UX − | − 3.2. Visualizations: From i-Diagrams to Entropy Triangles

3.2.1. The Channel Multivariate Entropy Triangle Our next goal is to develop an exploratory analysis tool similar to the CBET introduced in Section 2.1. As in that case, we need the equation of a simplex to represent the information balance of Entropy 2018, 7, 498 11 of 20 a multivariate transformation. For that purpose, as in (6) we may normalize by the overall entropy HU U to obtain the equation of the 2-simplex in multivariate entropic space, X × Y

1 = ∆0 HP P + 2 IP0 + VIP0 (26) X × Y · XY XY 0 ∆0 HP P , IP0 , VIP0 1 . ≤ X × Y XY XY ≤ The de Finetti diagram of this equation then provides the aggregated Channel Multivariate Entropy Triangle, CMET. A formal graphical assessment of multivariate joint distribution with the CMET is fairly simple Version May 3, 2018 submitted to Entropy 1212 of of 19 19 using the schematic in Figure5a and the conditions of Theorem1:

00 100 100 Mutual determination 2020 (( ,, ,, 0 0)) 8080 ·· ·· FaithfulFaithful 22 YY 40 ∗∗I 0 XX 40 I Marginal Uniformity 0 PP 0 Marginal Uniformity 60 PP 0 VI 60 XX VI YY U = P XX XX ((0,0, ,, )) 6060 U = P ⇔ ·· ·· 4040  YY YY 8080 Rigid 2020 Randomizing 100100 00 00 2020 4040 6060 8080 100100

∆HP0P0 XXYY Marginal Independence P = P P (( ,, 0, 0, )) XXYY XX × YY ⇔ ·· ·· (a)(a)(a) SchematicSchematic CMET CMET with with a a formal formal interpretation. interpretation.

00 X,, Y Marginal 100100 Uniformity X determines Y,, 2020 Y determines X U = P 8080 UXX PXX ( ) (( ,, ,, 0 0)) (0,0, ,, ) XX || U = P 0 YY ·· ·· UY PY ⇔ ·· ·· 0 PP  Y Y  II   HH 40 P 0 ,, 40 P 0 XX 60 Y YY 60 Y || 0 XX 0 PP HH 6060 4040

8080 2020

100100 00 00 2020 4040 6060 8080 100100

∆HP0P0 ,,∆HP0P0 XX YY Marginal Independence P = P P (( ,, 0, 0, )) XXYY XX × YY ⇔ ·· ·· (b)(b)(bSchematic) Schematicsplitsplit CMETssplit CMETs CMETswith formalwith interpretations. formal interpretations. Note that there areNotetwotwo types types that thereof overimposed are two types entropyof triangles overimposed in this figure.entropy triangles in this figure. Figure 5. Schematic Channel Multivariate Entropy Triangles (CMET) showing interpretable zones Figureand 5. extremeSchematic cases Channel using formal Multivariate conditions. EntropyThe Triangles annotations (CMET) on the showing center of interpretable each side are meantzones andto hold for that whole side, those for the vertices are meant to hold in their immediate neighborhood too. extremehold cases for that using whole formal side, conditions. those for the The vertices annotations are meant on to the hold center in their of each immediate side are neighborhood meant to hold too. for that whole side, those for the vertices are meant to hold in their immediate neighborhood too.

262262 3.2.2. Normalized Split Channel Multivariate Balance Equations

TheWith lower a side normalization of the triangle similar with toI thatP0 = from0, affected(7) to (8) of, (24marginal) and (25 independence) naturally leadP to= 2-simplexPX P , is • With a normalization similar to thatXY from (7) to (8), (24) and (25) naturally leadXY to 2-simplex× Y equationsthe locusnormalizing of partitioned by jointHU distributionsand HU ,, respectively respectively who do not share information between the two blocks UXX UYY X and Y.

1 = ∆00HP + II0P0 + H0P0 (27)(27) XX PXXYY PXX YY || 0 ∆00HP ,, II0P0 ,,H0P0 1 XX PXXYY PXX YY ≤ || ≤ 1 = ∆00HP + II0P0 + H0P0 (28)(28) YY PXXYY PYY XX || 0 ∆00HP ,, II0P0 ,,H0P0 1 YY PXXYY PYY XX ≤ || ≤

263263 Note that the quantities ∆H0P0 and ∆H0P0 have been independently motivated and named PXX PYY 264264 redundanciesredundancies [[14,, § § 2.4]. 2.4]. Entropy 2018, 7, 498 12 of 20

The right side of the triangle with VI0 = 0, described with mutual determination H0 = 0 = PXY PX Y • | H0 , is the locus of partitioned joint distributions whose groups do not carry supplementary PY X information| to that provided by the other group.

The left sidewith ∆HP0 = 0, describing distributions with uniform marginals PX = UX and • XY PY = UY, is the locus of partitioned joint distributions that offer as much potential information for transformations as possible.

Based on these characterizations we can attach interpretations to other regions of the CMET:

If we want a transformation from X to Y to be faithful, then we want to maximize the information • used for mutual determination IP0 1, equivalently, minimize at the same time the divergence XY → from uniformity ∆HP0 0 and the information that only pertains to each of the blocks in the XY → partition VIP0 0. So the coordinates of a faithful partitioned joint distribution will lay close to XY → the apex of the triangle.

However, if the coordinates of a distribution lay close to the left vertex VIP0 1, then it shows • XY → marginal uniformity ∆HP0 0 but shares little or no information between the blocks IP0 0, XY → XY → hence it must be a randomizing transformation. Distributions whose coordinates lay close to the right vertex ∆HP0 1 are essentially • XY → deterministic and in that sense carry no information IP0 0, VIP0 0. Indeed in this XY → XY → instance there does not seem to exist a transformation, whence we call them rigid.

These qualities are annotated on the vertices of the schematic CMET of Figure5a. Note that different applications may call for partitioned distributions with different qualities and the one used above is pertinent when the partitioned joint distributions models a transformation of X into Y or vice-versa.

3.2.2. Normalized Split Channel Multivariate Balance Equations With a normalization similar to that from (7) to (8), (24) and (25) naturally lead to 2-simplex equations normalizing by H and H , respectively UX UY

1 = ∆0 HP + I0 + H0 0 ∆0 HP , I0 , H0 1 (27) X PXY PX Y X PXY PX Y | ≤ | ≤ 1 = ∆0 HP + I0 + H0 0 ∆0 HP , I0 , H0 1 (28) Y PXY PY X Y PXY PY X | ≤ | ≤

Note that the quantities ∆H0 and ∆H0 have been independently motivated and named PX PY redundancies ([9], Section 2.4). These are actually two different representations for each of the two blocks in the partitioned joint distribution. Using the fact that they share one coordinate—I0 —and the rest are analogues—∆0 HP PXY X and ∆0 HP on one side, and H0 and H0 on the other—we can represent both equations at the same Y PX Y PY X time in a single de Finetti diagram.| We call| this representation the split Channel Multivariate Entropy Triangle, an schema of which can be seen in Figure5b. The qualifying “split” then refers to the fact that each partitioned joint distribution appears as two points in the diagram. Note the double annotation in the left and bottom coordinates implying that there are two different diagrams overlapping. Conventionally, the point referring to the X block described by (27) is represented with a cross, while the point referring to the Y block described by (28) is represented with a circle as will be noted in Figure6. Entropy 2018, 7, 498 13 of 20

(a) PCA on Iris (b) ICA on Iris

(c) PCA on Glass (d) ICA on Glass

(e) PCA on Arthritis (f) ICA on Arthritis

Figure 6. (Color online) Split CMET exploration of feature transformation and selection with PCA (left) and ICA (right) on Iris, Glass and Arthritis when selecting the first n ranked features as obtained for each method. The colors of the axes have been selected to match those of Figure4.

The formal interpretation of this split diagram with the conditions of Theorem1 follows that of the aggregated CMET but considering only one block at a time, for instance, for X: Entropy 2018, 7, 498 14 of 20

The lower side of the triangle is interpreted as before. • The right side of the triangle is the locus of the partitioned joint distribution whose X block is • completely determined by the Y block, that is, H0 = 0. PX Y | The left side of the triangle ∆HP0 = 0 is the locus of those partitioned joint distributions whose X • X marginal is uniform PX = UX . The interpretation is analogue for Y mutatis mutandis. The purpose of this representation is to investigate the formal conditions separately on each block. However, for this split representation we have to take into consideration that the normalizations may not be the same, that is H and H are, in general, different. PX PY A full example of the interpretation of both types of diagrams, the CMET and the split CMET is provided in the next Section in the context of feature transformation and selection.

3.3. Example Application: The Analysis of Feature Transformation and Selection with Entropy Triangles In this Section we present an application of the results obtained above to a machine learning subtask: the transformation and selection of features for supervised classification. The task. An extended practice in supervised classification is to explore different transformations of the observations and then evaluate such different approaches on different classifiers for a particular task [27]. Instead of this “in the loop” evaluation—that conflates the evaluation of the transformation and the classification—we will use the CMET to evaluate only the transformation block using the information transferred from the original to the transformed features as heuristic. As specific instances of transformations, we will evaluate the use of Principal Component Analysis (PCA) [28] and Independent Component Analysis (ICA) [29] which are often employed for dimensionality reduction. Note that we may evaluate feature transformation and dimensionality reduction at the same time with the techniques developed above: the transformation procedure in the case of PCA and ICA may provide the Y as a ranking of features, so that we may carry out feature selection afterwards by selecting subsets Yj spanning from the first-ranked to the j-th feature. The tools. PCA is a staple technique in statistical data analysis and machine learning based in the Singular Value Decomposition of the data matrix to obtain projections along the singular vectors that account for its variance in decreasing amount, so PCA ranks the transformed features by this order. The implementation used in our examples are those of the publicly available R packages stats (v. 3.3.3) (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html, accessed on 11 June 2018). While PCA aims at the orthogonalization of the projections, ICA finds the projections, also known as factors, by maximimizing their statistical independence, in our example by minimizing a cost term related to their mutual information [30]. However, this does not result in a ranking of the transformed features, hence we have created a pseudo-ranking by carrying an ICA transformation obtaining j transformed features for all sensible values of 1 j l using independent runs of the ICA algorithm. ≤ ≤ The implementation used in our examples is that of fastICA [30] as implemented in the R package fastICA (v. 1.2-1) (https://cran.r-project.org/package=fastICA, accessed on 11 June 2018, with standard parameter values ( alg.typ=“parallel”, fun=“logcosh”, alpha=1, method=“C”, row.norm= FALSE, maxit=200, tol=0.0001). The entropy diagrams and calculations were carried out with the open-source entropies experimental R package that provides an implementation of the present framework (available at https://github.com/FJValverde/entropies.git, accessed on 11 June 2018). The analysis carried out in this section is part of an illustrative vignette for the package and will remain so in future releases. Analysis of results. We analyzed in this way some UCI classification datasets [31], whose number of classes k, features m, and observations n are listed in Table1. Entropy 2018, 7, 498 15 of 20

Table 1. Datasets analyzed.

Name k m n 1 Ionosphere 2 34 351 2 Iris 3 4 150 3 Glass 7 9 214 4 Arthritis 3 3 84 5 BreastCancer 2 9 699 6 Sonar 2 60 208 7 Wine 3 13 178

For simplicity issues, we decided to illustrate our new techniques on three datasets: Iris, Glass and Arthritis. Ionosphere, BreastCancer, Sonar and Wine have a similar pattern to Glass, but less interesting, as commented below. Besides, both Ionosphere and Wine have too many features for the kind of neat visualization we are trying to use in this paper. We have also used a slightly modified entropy triangles in which the colors of the axes are related to those of the information diagrams of Figure4b. For instance, Figure6a presents the results of the PCA transformation on the logarithm of the features of Anderson’s Iris. Crosses represent the information decomposition of the input features X using (27) while circles represent the information decomposition of transformed features Yj using (28) and filled circles the aggregate decomposition of (26). We represent several possible features sets Yj as output where each is obtained selecting the first j features in the ranking provided by PCA. For example, since Iris has four features we can make four different feature sets of 1 to j features, named in the Figure as “1_j”, that is, “1_1” to “1_4”. The figure then explores how the information in the whole database X is transported to different, nested candidate feature sets Yj as per the PCA recipe: choose as many ranked features as required to increase the transmitted information. We first notice that all the points for X lie on a line parallel to the left side of the triangle and their average transmitted information is increasing, parallel to a decrease in remanent information. Indeed, ∆H the redundancy ∆H = X is the same regardless of the choice of Y . The monotonic increase with X0 HU j X IP XY the number of features selected j in average transmitted information I = j in (27) corresponds to P0 HU XYj X the monotonic increase in absolute transmitted information IP : for a given input set of features X, XYj the more output features are selected, the higher the mutual information between input and output. This is the basis of the effectiveness of the feature-selection procedure. Regarding the points for Yj, note that the absolute transmitted information also appears in the IP XYj average transmitted information (with respect to Yj) as IP0 = in (28). While IP increases XY HU XYj j Yj with j, as mentioned, we actually see a monotonic decrease in IP0 . The reason for this is the rapidly XYj increasing value of the denominator HU as we select more and more features. Yj Finally, notice how these two tendencies are conflated in the aggregate plot for the XYj in Figure6a that shows a lopsided, inverted U pattern, peaking before j reaches its maximum. This suggests that if we balance aggregated transmitted information against number of features selected—the complexity of the representation—in the search for a faithful representation, the average transmitted information is the quantity to optimize, that is, the mutual determination between the two feature sets. Figure6b presents similar results on the ICA transformation on the logarithm of the features of Anderson’s Iris with the same glyph convention as before, but with a ranking resulting from carrying the ICA method in full for each value of j. That is, we first work out Y1 which is a single component, then we calculate Y2 which the two best ICA components, and so on. The reason for this is that ICA does not rank the features it produces, so we have to create this ranking by carrying the ICA algorithm for all values of j to obtain each Yj. Note that the transformed features produce by PCA and ICA are, in principle, very different, but the phenomena described for PCA are also apparent here: an increase Entropy 2018, 7, 498 16 of 20

in aggregate transmitted information, checked by the increase of the denominator represented by HU Yj which implies a decreasing transmitted information per feature for Yj. With the present framework the question of which transformation is “better” for this dataset can be given content and rephrased as which transformation transmits more information on average on this dataset, and also, importantly, whether the aggregate information available in the dataset is being transmitted by either of these methods. This is explored in Figure7 for Iris, Glass and Arthritis, where, for reference, we have included a point for the (deterministic) transformation of the logarithm, the cross, giving an idea of what a lossless information transformation can achieve.

(a) Comparing the transformations on Iris (b) Comparing the transformations on Glass

(c) Comparing the transformations on Arthritis

Figure 7. (Color online) Comparison of PCA and ICA as data transformations using the CMET on Iris, Glass and Arthritis. Note that these are the same positions represented as inverted triangles in Figure6a,b.

Consider Figure7a for Iris. The first interesting observation is that neither technique is transmitting all of the information in the database, which can be gleaned from the fact that both feature sets “1_4”—when all the features available have been selected—are below the cross. This clearly follows the data processing inequality, but is still surprising since transformations like ICA and PCA are extensively used and considered to work well in practice. In this instance it can only be explained by the advantages of the achieved dimensionality reduction. Actually, the observation in the CMET suggests that we can improve on the average transmitted information per feature by retaining the three first features for each PCA and ICA. The analysis of Iris turns out to be an intermediate case between that of Arthritis and Glass, the latter being the most typical in our analysis. This is the case with a lot of original features X which transmit very little private, distinctive information per feature. The typical behavior, both for PCA and Entropy 2018, 7, 498 17 of 20

ICA is to select at first, features that carry very little average information Y1. As we select more and more transformed features, information accumulates but at a very slow pace as shown in Figure6c,d. Typically, the transformed features chosen last are very redundant. In the case of Glass, specifically, there is no point in retaining features beyond the sixth (out of 9) for either PCA or ICA as shown in Figure7b. As to comparing the techniques, in some similarly-behaving datasets PCA is better, while in others ICA is. In the case of Glass, it is better to use ICA when retaining up to two transformed features, but it is better to use PCA when retaining between 2 and 6. The case of Arthritis is quite different, perhaps due to the small number of original features n = 3. Our analyses show that just choosing the first ICA component Y1—perhaps the first two—provides an excellent characterization of the dataset, being extremely efficient in what regards information transmission. This phenomenon is also seen in the first PCA component, but is lost as we aggregate more PCA components. Crucially, taking the 3 ICA components amounts to taking all of the original information in the dataset, while taking the 3 components in the case of PCA is rather inefficient, as confirmed by Figure7c. All in all, our analyses show that the unsupervised transformation and selection of features in datasets can be assessed using an information-theoretical heuristic: maximize the average mutual information accumulated by the transformed features. And we have also shown how to carry out this assessment with entropic balance equations and entropy triangles.

3.4. Discussion The development of the multivariate case is quite parallel to the bivariate case. An important point to realize is that the multivariate transmitted information between two different random vectors I is the proper generalization for the usual mutual information MI in the bivariate case, rather PXY PXY than the more complex alternatives used in multivariate sources (see Section 2.2 and [5,14]). Indeed properties (18) and (20) are crucial in transporting the structure and intuitions built from the bivariate channel entropy triangle to the multivariate one, of which the former is a proper instance. This was not the case with balance equations and entropy triangles for stochastic sources of information [5]. The crucial quantities in the balance equation and the triangle have been independently motivated in other works. First, multivariate mutual information is fundamental in Information Theory, and we have already mentioned the redundancy ∆HPX [9]. We also mentioned the input-entropy normalized I0 used as a standalone assessment measure in intrusion detection [32]. Perhaps the least known PXY quantity in the paper was the variation of information. Despite being inspired by the concept proposed by Meila [11], to the best of our knowledge it is completely new in the multivariate setting. However, the underlying concepts of conditional or remanent entropies have proven their usefulness time and again. All of the above is indirect proof that the quantities studied in this paper are significant, and the existence of a balance equation binding them together important. The paragraph above notwithstanding, there are researchers who claim that Shannon-type relations cannot capture all the dependencies inside multivariate random vectors [33]. Due to the novelty of that work, it is not clear how much the “standard” theory of Shannon measures would have to change to accommodate the objections raised to it in that respect. But this question seems to be off the mark for our purposes: the framework of channel balance equations and entropy triangles has not been developed to look into the question of dependency, but of aggregate information transfer, wherever that information comes from. It may be relevant to source balance equations and triangles [5]—which have a different purpose—but that still has to be researched into. The normalizations involved in (6) and (26)—respectively, (8), (27) and (28)—are similar conceptually: to divide by the logarithm of the total size of the domains involved whether it is the size of X Y or that of X Y . Notice, first, that this is the same as taking the logarithm base × × these sizes in the non-normalized equations. The resulting units would not be bits for the multivariate case proper, since the size of X or Y is at least 2 2 = 4. But since the entropy triangles represent × compositions [12], which are inherently dimensionless, this allows us to represent many different, and Entropy 2018, 7, 498 18 of 20 otherwise incomparable systems, e.g., univariate and multivariate ones with the same kind of diagram. Second, this type of normalization allows for an interpretation of the extension of these measures to the continuous case as a limit in the process of equipartitioning a compact support, as done, for instance, for the Rényi entropy in ([34], Section 3) which is known to be a generalization of Shannon’s. There are hopes, then for a continuous version of the balance equations for Renyi’s entropy. Finally, note that the application presented in Section 3.3 above, although principled in the framework presented here, is not conclusive on the quality of the analyzed transformations in general but only as applied to the particular dataset. For that, a wider selection of data transformation approaches, and many more datasets should be assessed. Furthermore, the feature selection process used the “filter” approach which for supervised tasks seems suboptimal. Future work will address this issue as well as how the technique developed here relates to the end-to-end assessment presented in [4] and the source characterization technique of [5].

4. Conclusions In this paper, we have introduced a new way to assess quantitatively and visually the transfer of information from a multivariate source X to a multivariate sink of information Y, using a heretofore unknown decomposition of the entropies around the joint distribution PXY. For that purpose, we have generalized a similar previous theory and visualization tools for bivariate sources, greatly extending the applicability of the results:

We have been able to decompose the information of a random multivariate source into three • components: (a) the non-transferable divergence from uniformity ∆H , which is an entropy PXY “missing” from P ; (b) a transferable, but not transferred part, the variation of information VI ; XY PXY and (c) the transferable and transferred information I , which is a known, but never considered PXY in this context, generalization of bivariate mutual information. Using the same principles as in previous developments, we have been able to obtain a new type of • visualization diagram for this balance of information using de Finetti’s ternary diagrams, which is actually an exploratory data analysis tool.

We have also shown how to apply these new theoretical developments and the visualization tools to the analysis of information transfer in unsupervised feature transformation and selection, a ubiquitous step in data analysis, and specifically, to apply it to the analysis of PCA and ICA. We believe this is a fruitful approach, e.g., for the assessment of learning systems, and foresee a bevy of applications to come. Further conclusions on this issue are left for a more thorough later investigation.

Author Contributions: Conceptualization, F.J.V.-A. and C.P.-M.; Formal analysis, F.J.V.-A. and C.P.-M.; Funding acquisition, C.P.-M.; Investigation, F.J.V.-A. and C.P.-M.; Methodology, F.J.V.-A. and C.P.-M.; Software, F.J.V.-A.; Supervision, C.P.-M.; Validation, F.J.V.-A. and C.P.-M.; Visualization, F.J.V.-A. and C.P.-M.; Writing—original draft, F.J.V.-A. and C.P.-M.; Writing—review & editing, F.J.V.-A. and C.P.-M. Funding: This research was funded by he Spanish Government-MinECo projects TEC2014-53390-P and TEC2017-84395-P. Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations The following abbreviations are used in this manuscript:

PCA Principal Component Analysis ICA Independent Component Analysis CMET Channel Multivariate Entropy Triangle CBET Channel Binary Entropy Triangle SMET Source Multivariate Entropy Triangle Entropy 2018, 7, 498 19 of 20

References

1. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. 2. Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810v3. 3. Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. In Proceedings of the IEEE 2015 Information Theory Workshop, San Diego, CA, USA, 1–6 February 2015. 4. Valverde-Albacete, F.J.; Peláez-Moreno, C. 100% classification accuracy considered harmful: The normalized information transfer factor explains the accuracy paradox. PLOS ONE 2014, doi:10.1371/journal.pone.0084217.[CrossRef][PubMed] 5. Valverde-Albacete, F.J.; Peláez-Moreno, C. The Evaluation of Data Sources using Multivariate Entropy Tools. Expert Syst. Appl. 2017, 78, 145–157. doi:10.1016/j.eswa.2017.02.010.[CrossRef] 6. Valverde-Albacete, F.J.; Peláez-Moreno, C. Two information-theoretic tools to assess the performance of multi-class classifiers. Pattern Recognit. Lett. 2010, 31, 1665–1671. [CrossRef] 7. Yeung, R. A new outlook on Shannon’s information measures. IEEE Trans. Inf. Theory 1991, 37, 466–474. [CrossRef] 8. Reza, F.M. An Introduction to Information Theory; McGraw-Hill Electrical and Electronic Engineering Series; McGraw-Hill Book Co., Inc.: New York, NY, USA; Toronto, ON, Canada; London, UK, 1961. 9. MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. 10. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, XXVII, 379–423, 623–656. [CrossRef] 11. Meila, M. Comparing clusterings—An information based distance. J. Multivar. Anal. 2007, 28, 875–893. [CrossRef] 12. Pawlowsky-Glahn, V.; Egozcue, J.J.; Tolosana-Delgado, R. Modeling and Analysis of Compositional Data; John Wiley & Sons: Chichester, UK, 2015. 13. Valverde-Albacete, F.J.; de Albornoz, J.C.; Peláez-Moreno, C. A Proposal for New Evaluation Metrics and Result Visualization Technique for Sentiment Analysis Tasks. In CLEF 2013: Information Access Evaluation. Multilinguality, Multimodality and Visualization; Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8138, pp. 41–52. 14. Timme, N.; Alford, W.; Flecker, B.; Beggs, J.M. Synergy, redundancy, and multivariate information measures: An experimentalist’s perspective. J. Comput. Neurosci. 2014, 36, 119–140. [CrossRef][PubMed] 15. James, R.G.; Ellison, C.J.; Crutchfield, J.P. Anatomy of a bit: Information in a time series observation. Chaos 2011, 21, 037109. [CrossRef][PubMed] 16. Watanabe, S. Information theoretical analysis of multivariate correlation. J. Res. Dev. 1960, 4, 66–82. [CrossRef] 17. Tononi, G.; Sporns, O.; Edelman, G.M. A measure for brain complexity: Relating functional segregation and integration in the nervous system. Proc. Natl. Acad. Sci. USA 1994, 91, 5033–5037. [CrossRef][PubMed] 18. Studený, M.; Vejnarová, J. The Multiinformation Function as a Tool for Measuring Stochastic Dependence. In Learning in Graphical Models; Springer: Dordrecht, The Netherlands, 1998; pp. 261–297. 19. Han, T.S. Nonnegative entropy measures of multivariate symmetric correlations. Inf. Control 1978, 36, 133–156. [CrossRef] 20. Abdallah, S.A.; Plumbley, M.D. A measure of statistical complexity based on predictive information with application to finite spin systems. Phys. Lett. A 2012, 376, 275–281. [CrossRef] 21. Tononi, G. Complexity and coherency: Integrating information in the brain. Trends Cognit. Sci. 1998, 2, 474–484. [CrossRef] 22. McGill, W.J. Multivariate information transmission. Psychometrika 1954, 19, 97–116. [CrossRef] 23. Sun Han, T. Multiple mutual informations and multiple interactions in frequency data. Inf. Control 1980, 46, 26–45. [CrossRef] 24. Bell, A. The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation, Nara, Japan, 1–4 April 2003. 25. Abdallah, S.A.; Plumbley, M.D. Predictive Information, Multiinformation and Binding Information; Technical Report C4DM-TR10-10; Queen Mary, University of London: London, UK, 2010. Entropy 2018, 7, 498 20 of 20

26. Valverde Albacete, F.J.; Peláez-Moreno, C. The Multivariate Entropy Triangle and Applications. In Hybrid Artificial Intelligence Systems (HAIS 2016); Springer: Seville, Spain, 2016; pp. 647–658. 27. Witten, I.H.; Eibe, F.; Hall, M.A. Data Mining. Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA 2011. 28. Pearson, K. On Lines and Planes of Closest Fit to Systems of Points in Space. Philos. Mag. 1901, 559–572. [CrossRef] 29. Bell, A.J.; Sejnowski, T.J. An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Comput. 1995, 7, 1129–1159. [CrossRef][PubMed] 30. Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. IEEE Trans. Neural Netw. 2000, 13, 411–430. [CrossRef] 31. Bache, K.; Lichman, M. UCI Machine Learning Repository; 2013. 32. Gu, G.; Fogla, P.; Dagon, D.; Lee, W.; Skori´c,B. Measuring Intrusion Detection Capability: An Information- theoretic Approach. In Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security (ASIACCS ’06), Taipei, Taiwan, 21–23 March 2006 ; ACM: New York, NY, USA, 2006; pp. 90–101. doi:10.1145/1128817.1128834.[CrossRef] 33. James, G.R.; Crutchfield, P.J. Multivariate Dependence beyond Shannon Information. Entropy 2017, 19, 531–545. [CrossRef] 34. Jizba, P.; Arimitsu, T. The world according to Rényi: Thermodynamics of multifractal systems. Ann. Phys. 2004, 312, 17–59. [CrossRef]

c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).